utility_metrics.multivariate_predictions_acc_regr module¶

Computation of the utility metric that compares the multivariate scores of the original and synthetic data for regression and classification tasks.

utility_metrics.multivariate_predictions_acc_regr.avg_abs_prediction_differences_svm(original_data, synthetic_data, cols_cat, frac_training=0.7, random_state=1)[source]¶

Compute the average absolute difference between of the accuracies of classifications and regression tasks performed with the original dataframe and accuracies of classifications and regression tasks performed with the synthetic dataframe.

Parameters:

original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.
cols_cat (Sequence[str]) – A list containing the names of the categorical columns.
frac_training (float) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy and R-squared scores.
random_state (int) – The random seed that is used to split the dataset into training and testing.

Return type:

float

Returns:

The average absolute difference between the prediction accuracies.

utility_metrics.multivariate_predictions_acc_regr.calculate_prediction_scores(df_train, df_test, cols_cat)[source]¶

This function uses an SVM to predict each column based on the other columns in the dataframe. The accuracies for this classification and regression task are then returned. df_train train represents the part of the dataframe used for training and df_test represents the part of the dataframe used for testing.

Parameters:

df_train (DataFrame) – A pandas dataframe that contains the training data.
df_test (DataFrame) – A pandas dataframe that contains the test data.
cols_cat (Iterable[str]) – An iterable containing the names of the categorical columns.

Return type:

list[float]

Returns:

A list with the accuracies of the SVM prediction tasks for each column in the dataframe.

utility_metrics.multivariate_predictions_acc_regr.visualise_prediction_scores(original_data, synthetic_data, cols_cat, frac_training=0.7, random_state=1)[source]¶

Plot the accuracies scores of multivariate prediction models for the original and synthetic data. Showing R-squared and accuracies in one plot.

Parameters:

original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.
cols_cat (Sequence[str]) – A list containing the names of the categorical columns.
frac_training (float) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy and R-squared scores.
random_state (int) – The random seed that is used to split the dataset into training and testing.

Return type:

tuple[list[float], list[float], Any]

Returns:

The prediction accuracies for the original and synthetic data and the plot.

utility_metrics.multivariate_predictions_acc_regr.visualise_regression_scores(original_data, synthetic_data, cols_cat, frac_training=0.7, random_state=1)[source]¶

Plot the scores of multivariate regression models for the original and synthetic data.

Parameters:

original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.
cols_cat (Sequence[str]) – A list containing the names of the categorical columns.
frac_training (float) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy and R-squared scores.
random_state (int) – The random seed that is used to split the dataset into training and testing.

Return type:

tuple[list[float], list[float], Any]

Returns:

The regression scores for the original and synthetic data and the plot.