utility_metrics.multivariate_predictions_acc module

Computation of the utility metric that compares the multivariate scores of the original and synthetic data for classification tasks.

utility_metrics.multivariate_predictions_acc.avg_abs_classification_accuracy_difference_svc(original_data, synthetic_data, cols_cat, n_bins=50, frac_training=0.7, random_state=1)[source]

Compute the average absolute difference between of the accuracies of classifications tasks performed with the original dataframe and accuracies of classifications tasks performed with the synthetic dataframe.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Sequence[str]) – A list containing the names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

  • frac_training (float) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy.

  • random_state (int) – The random seed that is used to split the dataset into training and testing.

Return type:

float

Returns:

The average absolute difference between the prediction accuracies.

utility_metrics.multivariate_predictions_acc.calculate_accuracies_scores(df_train, df_test, cols_cat, n_bins)[source]

This function uses an SVM to predict each column based on the other columns in the dataframe. The accuracies for this classification task are then returned. df_train train represents the part of the dataframe used for training and df_test represents the part of the dataframe used for testing.

Parameters:
  • df_train (DataFrame) – A pandas dataframe that contains the training data.

  • df_test (DataFrame) – A pandas dataframe that contains the test data.

  • cols_cat (Iterable[str]) – An iterable containing the names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

list[float]

Returns:

A list with the accuracies of the SVM classification task for each column in the dataframe.

utility_metrics.multivariate_predictions_acc.visualise_accuracies_scores(original_data, synthetic_data, cols_cat, n_bins=50, frac_training=0.7, random_state=1)[source]

Plot the accuracies scores of multivariate prediction models for the original and synthetic data.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Sequence[str]) – A list containing the names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

  • frac_training (float) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy.

  • random_state (int) – The random seed that is used to split the dataset into training and testing.

Return type:

tuple[list[float], list[float], Any]

Returns:

The prediction accuracies for the original and synthetic data and the plot.