utility_metrics.multivariate_predictions_acc module¶
Computation of the utility metric that compares the multivariate scores of the original and synthetic data for classification tasks.
- utility_metrics.multivariate_predictions_acc.avg_abs_classification_accuracy_difference_svc(original_data, synthetic_data, cols_cat, n_bins=50, frac_training=0.7, random_state=1)[source]¶
Compute the average absolute difference between of the accuracies of classifications tasks performed with the original dataframe and accuracies of classifications tasks performed with the synthetic dataframe.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Sequence
[str
]) – A list containing the names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.frac_training (
float
) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy.random_state (
int
) – The random seed that is used to split the dataset into training and testing.
- Return type:
float
- Returns:
The average absolute difference between the prediction accuracies.
- utility_metrics.multivariate_predictions_acc.calculate_accuracies_scores(df_train, df_test, cols_cat, n_bins)[source]¶
This function uses an SVM to predict each column based on the other columns in the dataframe. The accuracies for this classification task are then returned. df_train train represents the part of the dataframe used for training and df_test represents the part of the dataframe used for testing.
- Parameters:
df_train (
DataFrame
) – A pandas dataframe that contains the training data.df_test (
DataFrame
) – A pandas dataframe that contains the test data.cols_cat (
Iterable
[str
]) – An iterable containing the names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.
- Return type:
list
[float
]- Returns:
A list with the accuracies of the SVM classification task for each column in the dataframe.
- utility_metrics.multivariate_predictions_acc.visualise_accuracies_scores(original_data, synthetic_data, cols_cat, n_bins=50, frac_training=0.7, random_state=1)[source]¶
Plot the accuracies scores of multivariate prediction models for the original and synthetic data.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Sequence
[str
]) – A list containing the names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.frac_training (
float
) – The fraction of the dataset that is used for training the model. The rest is used for calculating the accuracy.random_state (
int
) – The random seed that is used to split the dataset into training and testing.
- Return type:
tuple
[list
[float
],list
[float
],Any
]- Returns:
The prediction accuracies for the original and synthetic data and the plot.