utility_metrics.bivariate_distributions module¶

Computation of the utility metric that compares the bivariate distributions of the original and synthetic data.

utility_metrics.bivariate_distributions.avg_abs_correlation_difference(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶

Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

Parameters:

original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.
cols_cat (Container[str]) – A list containing the names of the categorical columns.
n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

ndarray[tuple[Any, ...], dtype[Any]]

Returns:

The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

utility_metrics.bivariate_distributions.compute_cramers_correlation_matrix(data_frame)[source]¶

Compute of the correlation matrix of a dataframe with discrete columns.

Parameters:: data_frame (DataFrame) – A discrete pandas dataframe that contains the data.
Return type:: DataFrame
Returns:: The correlation matrix of the dataframe.

utility_metrics.bivariate_distributions.compute_cramers_v(vector_1, vector_2)[source]¶

Compute the Cramer’s V correlation between two vectors.

Parameters:

vector_1 (ndarray[tuple[Any, ...], dtype[Any]]) – First column from the dataframe.
vector_2 (ndarray[tuple[Any, ...], dtype[Any]]) – Second column from the dataframe.

Return type:

float

Returns:

The Cramer’s V correlation between vec1 and vec2.

utility_metrics.bivariate_distributions.discretize_matrix(data_frame, cols_to_discretize, nr_of_bins)[source]¶

Function that discretizes columns from a selection of columns

Parameters:

data_frame (DataFrame) – Data.
cols_to_discretize (Iterable[Any]) – Columns to discretize.
nr_of_bins (int) – Number of bins to use for discretization.

Return type:

DataFrame

Returns:

The original dataframe with the columns to discretize discretized.

utility_metrics.bivariate_distributions.visualise_cramers_correlation_matrices(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶

Plot the correlations matrices of the original and synthetic data.

Parameters:

original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.
cols_cat (Container[str]) – A list containing the names of the categorical columns.
n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

tuple[DataFrame, DataFrame, Any]

Returns:

The correlation matrices for the plots and the plot.