utility_metrics.bivariate_correlations module

Computation of the utility metric that compares the bivariate correlations of the original and synthetic data. For visualisation of correlations only Cramer’s V is used. For spiderplot a combination of numerical correlations Pearson’s r, categorical correlations Cramer’s V and numerical to categorical correlation ANOVA are calculated and their difference between synthetic and original data averaged.

utility_metrics.bivariate_correlations.avg_abs_correlation_difference(original_data, synthetic_data, cols_cat, n_bins=50)[source]

Compute of the total weighted average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe. Using Pearson’s R, Cramer’s V and ANOVA methods.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Sequence[str]) – The names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

ndarray[tuple[Any, ...], dtype[float64]]

Returns:

The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

utility_metrics.bivariate_correlations.avg_abs_correlation_difference_anova(original_data, synthetic_data, cols_cat)[source]

Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe with ANOVA method between unordered /categorical and ordered/numerical columns.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Iterable[str]) – The names of the categorical columns.

Return type:

float

Returns:

The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

utility_metrics.bivariate_correlations.avg_abs_correlation_difference_cramers(original_data, synthetic_data, cols_cat, n_bins=50)[source]

Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe with Cramer’s V method for categorical columns.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Container[str]) – The names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

float

Returns:

The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

utility_metrics.bivariate_correlations.avg_abs_correlation_difference_pearson(original_data_num, synthetic_data_num)[source]

Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe with the Pearson’s r coefficient for numerical columns.

Parameters:
  • original_data_num (DataFrame) – A pandas dataframe that contains the numerical columns of original data.

  • synthetic_data_num (DataFrame) – A pandas dataframe that contains the numerical columns of synthetic data.

Return type:

float

Returns:

The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.

utility_metrics.bivariate_correlations.compute_anova_correlation(data_frame, cols_cat)[source]

Compute ANOVA correlation of a dataframe for the unordered/categorical feature to the ordered/numerical feature.

Parameters:
  • data_frame (DataFrame) – A pandas dataframe that contains the data.

  • cols_cat (Iterable[str]) – The names of the categorical columns.

Return type:

list[float]

Returns:

The results of the ANOVA correlation test of the dataframe.

utility_metrics.bivariate_correlations.compute_cramers_correlation_matrix(data_frame)[source]

Compute of the correlation matrix of a dataframe with discrete columns.

Parameters:

data_frame (DataFrame) – A discrete pandas dataframe that contains the data.

Return type:

DataFrame

Returns:

The correlation matrix of the dataframe.

utility_metrics.bivariate_correlations.compute_cramers_v(vector_1, vector_2)[source]

Compute the Cramer’s V correlation between two vectors.

Parameters:
  • vector_1 (ndarray[tuple[Any, ...], dtype[Any]]) – First column from the dataframe.

  • vector_2 (ndarray[tuple[Any, ...], dtype[Any]]) – Second column from the dataframe.

Return type:

float

Returns:

The Cramer’s V correlation between vec1 and vec2.

utility_metrics.bivariate_correlations.compute_pearson_correlation_matrix(data_frame)[source]

Compute of the correlation matrix of a dataframe with discrete columns.

Parameters:

data_frame (DataFrame) – A discrete pandas dataframe that contains the data.

Return type:

DataFrame

Returns:

The correlation matrix of the dataframe.

utility_metrics.bivariate_correlations.compute_pearson_rho(vector_1, vector_2)[source]

Compute the Pearson’s r correlation between two vectors.

Parameters:
  • vector_1 (ndarray[tuple[Any, ...], dtype[Any]]) – First column from the dataframe.

  • vector_2 (ndarray[tuple[Any, ...], dtype[Any]]) – Second column from the dataframe.

Return type:

float

Returns:

The Pearson’s r correlation between vec1 and vec2.

utility_metrics.bivariate_correlations.discretize_matrix(data_frame, cols_to_discretize, nr_of_bins)[source]

Function that discretises columns from a selection of columns

Parameters:
  • data_frame (DataFrame) – Data.

  • cols_to_discretize (Iterable[Any]) – Columns to discretize.

  • nr_of_bins (int) – Number of bins to use for discretization.

Return type:

DataFrame

Returns:

The original dataframe with the columns to discretize discretized.

utility_metrics.bivariate_correlations.visualise_cramers_correlation_matrices(original_data, synthetic_data, cols_cat, n_bins=50)[source]

Plot the correlations matrices with Cramer’s V of the original and synthetic data.

Parameters:
  • original_data (DataFrame) – A pandas dataframe that contains the original data.

  • synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

  • cols_cat (Container[str]) – The names of the categorical columns.

  • n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

tuple[DataFrame, DataFrame, Any]

Returns:

The correlation matrices for the plots and the plot.