utility_metrics.multivariate_predictions_acc module¶
Computation of the utility metric that compares the multivariate scores of the original and synthetic data for classification tasks.
- utility_metrics.multivariate_predictions_acc.avg_abs_classification_accuracy_difference_svc(original_data, synthetic_data, cols_cat, n_bins=50, frac_training=0.7, random_state=1)[source]¶
Compute the average absolute difference between of the accuracies of classifications tasks performed with the original dataframe and accuracies of classifications tasks performed with the synthetic dataframe.
- Parameters:
original_data (
DataFrame) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame) – A pandas dataframe that contains the synthetic data.cols_cat (
Sequence[str]) – A list containing the names of the categorical columns.n_bins (
int) – The number of bins that is used to discretise the numerical columns.frac_training (
float) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy.random_state (
int) – The random seed that is used to split the dataset into training and testing.
- Return type:
float- Returns:
The average absolute difference between the prediction accuracies.
- utility_metrics.multivariate_predictions_acc.calculate_accuracies_scores(df_train, df_test, cols_cat, n_bins)[source]¶
This function uses an SVM to predict each column based on the other columns in the dataframe. The accuracies for this classification task are then returned. df_train train represents the part of the dataframe used for training and df_test represents the part of the dataframe used for testing.
- Parameters:
df_train (
DataFrame) – A pandas dataframe that contains the training data.df_test (
DataFrame) – A pandas dataframe that contains the test data.cols_cat (
Iterable[str]) – An iterable containing the names of the categorical columns.n_bins (
int) – The number of bins that is used to discretise the numerical columns.
- Return type:
list[float]- Returns:
A list with the accuracies of the SVM classification task for each column in the dataframe.
- utility_metrics.multivariate_predictions_acc.visualise_accuracies_scores(original_data, synthetic_data, cols_cat, n_bins=50, frac_training=0.7, random_state=1)[source]¶
Plot the accuracies scores of multivariate prediction models for the original and synthetic data.
- Parameters:
original_data (
DataFrame) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame) – A pandas dataframe that contains the synthetic data.cols_cat (
Sequence[str]) – A list containing the names of the categorical columns.n_bins (
int) – The number of bins that is used to discretise the numerical columns.frac_training (
float) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy.random_state (
int) – The random seed that is used to split the dataset into training and testing.
- Return type:
tuple[list[float],list[float],Any]- Returns:
The prediction accuracies for the original and synthetic data and the plot.