utility_metrics.distinguishability module¶
Computation of the utility metric that distinguishes the synthetic data from the original data with a classification task.
- utility_metrics.distinguishability.logistical_regression_auc(original_data, synthetic_data, cols_cat)[source]¶
Compute the Area Under the Curve for the false positive (FPR) and true positive (TPR) rates for the original and the synthetic data based on a classification task (logistic regression) to distinguish the two classes.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Iterable
[str
]) – A list containing the names of the categorical columns.
- Return type:
tuple
[ndarray
[tuple
[Any
,...
],dtype
[Any
]],float
]- Returns:
Area Under the Curve of FPR and TPR
- utility_metrics.distinguishability.mean_propensity_difference_logistical_regression(original_data, synthetic_data, cols_cat)[source]¶
Compute the difference between the mean propensity score for the original dataframe and the mean propensity score for the synthetic dataframe based on a classification task (logistic regression) to distinguish the two classes.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Iterable
[str
]) – A list containing the names of the categorical columns.
- Return type:
tuple
[Any
,Any
]- Returns:
The difference between the mean propensity scores.
- utility_metrics.distinguishability.visualise_logistical_regression_auc(original_data, synthetic_data, cols_cat)[source]¶
Visualise the Area Under the Curve for the false positive (FPR) and true positive (TPR) rates for the original and the synthetic data based on a classification task (logistic regression) to distinguish the two classes.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Iterable
[str
]) – A list containing the names of the categorical columns.
- Return type:
tuple
[ndarray
[tuple
[Any
,...
],dtype
[Any
]],float
,Figure
]- Returns:
Area Under the Curve of FPR and TPR
- utility_metrics.distinguishability.visualise_propensity_scores(original_data, synthetic_data, cols_cat)[source]¶
Plot the distribution of the propensity scores for the original dataframe and for the synthetic dataframe based on a classification task to distinguish the two classes.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Iterable
[str
]) – A list containing the names of the categorical columns.
- Return type:
tuple
[ndarray
[tuple
[Any
,...
],dtype
[Any
]],Any
]- Returns:
The predictions for the original and synthetic data and the plot.