utility_metrics.distinguishability module

Computation of the utility metric that distinguishes the synthetic data from the original data with a classification task.

utility_metrics.distinguishability.logistical_regression_auc(original_data, synthetic_data, cols_cat)[source]

Compute the Area Under the Curve for the false positive (FPR) and true positive (TPR) rates for the original and the synthetic data based on a classification task (logistic regression) to distinguish the two classes.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Iterable[str]) – A list containing the names of the categorical columns.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[Any]], float]

Returns:

Area Under the Curve of FPR and TPR

utility_metrics.distinguishability.mean_propensity_difference_logistical_regression(original_data, synthetic_data, cols_cat)[source]

Compute the difference between the mean propensity score for the original dataframe and the mean propensity score for the synthetic dataframe based on a classification task (logistic regression) to distinguish the two classes.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Iterable[str]) – A list containing the names of the categorical columns.

Return type:

tuple[Any, Any]

Returns:

The difference between the mean propensity scores.

utility_metrics.distinguishability.visualise_logistical_regression_auc(original_data, synthetic_data, cols_cat)[source]

Visualise the Area Under the Curve for the false positive (FPR) and true positive (TPR) rates for the original and the synthetic data based on a classification task (logistic regression) to distinguish the two classes.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Iterable[str]) – A list containing the names of the categorical columns.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[Any]], float, Figure]

Returns:

Area Under the Curve of FPR and TPR

utility_metrics.distinguishability.visualise_propensity_scores(original_data, synthetic_data, cols_cat)[source]

Plot the distribution of the propensity scores for the original dataframe and for the synthetic dataframe based on a classification task to distinguish the two classes.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Iterable[str]) – A list containing the names of the categorical columns.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[Any]], Any]

Returns:

The predictions for the original and synthetic data and the plot.