utility_metrics.bivariate_distributions module

Computation of the utility metric that compares the bivariate distributions of the original and synthetic data.

utility_metrics.bivariate_distributions.avg_abs_correlation_difference(original_data, synthetic_data, cols_cat, n_bins=50)[source]

Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Container[str]) – A list containing the names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

ndarray[tuple[Any, ...], dtype[Any]]

Returns:

The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

utility_metrics.bivariate_distributions.compute_cramers_correlation_matrix(data_frame)[source]

Compute of the correlation matrix of a dataframe with discrete columns.

Parameters:

data_frame (DataFrame) – A discrete pandas dataframe that contains the data.

Return type:

DataFrame

Returns:

The correlation matrix of the dataframe.

utility_metrics.bivariate_distributions.compute_cramers_v(vector_1, vector_2)[source]

Compute the Cramer’s V correlation between two vectors.

Parameters:
  • vector_1 (ndarray[tuple[Any, ...], dtype[Any]]) – First column from the dataframe.

  • vector_2 (ndarray[tuple[Any, ...], dtype[Any]]) – Second column from the dataframe.

Return type:

float

Returns:

The Cramer’s V correlation between vec1 and vec2.

utility_metrics.bivariate_distributions.discretize_matrix(data_frame, cols_to_discretize, nr_of_bins)[source]

Function that discretizes columns from a selection of columns

Parameters:
  • data_frame (DataFrame) – Data.

  • cols_to_discretize (Iterable[Any]) – Columns to discretize.

  • nr_of_bins (int) – Number of bins to use for discretization.

Return type:

DataFrame

Returns:

The original dataframe with the columns to discretize discretized.

utility_metrics.bivariate_distributions.visualise_cramers_correlation_matrices(original_data, synthetic_data, cols_cat, n_bins=50)[source]

Plot the correlations matrices of the original and synthetic data.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Container[str]) – A list containing the names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

tuple[DataFrame, DataFrame, Any]

Returns:

The correlation matrices for the plots and the plot.