utility_metrics.bivariate_correlations module¶
Computation of the utility metric that compares the bivariate correlations of the original and synthetic data. For visualisation of correlations only Cramer’s V is used. For spiderplot a combination of numerical correlations Pearson’s r, categorical correlations Cramer’s V and numerical to categorical correlation ANOVA are calculated and their difference between synthetic and original data averaged.
- utility_metrics.bivariate_correlations.avg_abs_correlation_difference(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶
Compute of the total weighted average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe. Using Pearson’s R, Cramer’s V and ANOVA methods.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Sequence
[str
]) – The names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.
- Return type:
ndarray
[tuple
[Any
,...
],dtype
[float64
]]- Returns:
The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.
- utility_metrics.bivariate_correlations.avg_abs_correlation_difference_anova(original_data, synthetic_data, cols_cat)[source]¶
Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe with ANOVA method between unordered /categorical and ordered/numerical columns.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Iterable
[str
]) – The names of the categorical columns.
- Return type:
float
- Returns:
The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.
- utility_metrics.bivariate_correlations.avg_abs_correlation_difference_cramers(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶
Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe with Cramer’s V method for categorical columns.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Container
[str
]) – The names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.
- Return type:
float
- Returns:
The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.
- utility_metrics.bivariate_correlations.avg_abs_correlation_difference_pearson(original_data_num, synthetic_data_num)[source]¶
Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe with the Pearson’s r coefficient for numerical columns.
- Parameters:
original_data_num (
DataFrame
) – A pandas dataframe that contains the numerical columns of original data.synthetic_data_num (
DataFrame
) – A pandas dataframe that contains the numerical columns of synthetic data.
- Return type:
float
- Returns:
The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.
- utility_metrics.bivariate_correlations.compute_anova_correlation(data_frame, cols_cat)[source]¶
Compute ANOVA correlation of a dataframe for the unordered/categorical feature to the ordered/numerical feature.
- Parameters:
data_frame (
DataFrame
) – A pandas dataframe that contains the data.cols_cat (
Iterable
[str
]) – The names of the categorical columns.
- Return type:
list
[float
]- Returns:
The results of the ANOVA correlation test of the dataframe.
- utility_metrics.bivariate_correlations.compute_cramers_correlation_matrix(data_frame)[source]¶
Compute of the correlation matrix of a dataframe with discrete columns.
- Parameters:
data_frame (
DataFrame
) – A discrete pandas dataframe that contains the data.- Return type:
DataFrame
- Returns:
The correlation matrix of the dataframe.
- utility_metrics.bivariate_correlations.compute_cramers_v(vector_1, vector_2)[source]¶
Compute the Cramer’s V correlation between two vectors.
- Parameters:
vector_1 (
ndarray
[tuple
[Any
,...
],dtype
[Any
]]) – First column from the dataframe.vector_2 (
ndarray
[tuple
[Any
,...
],dtype
[Any
]]) – Second column from the dataframe.
- Return type:
float
- Returns:
The Cramer’s V correlation between vec1 and vec2.
- utility_metrics.bivariate_correlations.compute_pearson_correlation_matrix(data_frame)[source]¶
Compute of the correlation matrix of a dataframe with discrete columns.
- Parameters:
data_frame (
DataFrame
) – A discrete pandas dataframe that contains the data.- Return type:
DataFrame
- Returns:
The correlation matrix of the dataframe.
- utility_metrics.bivariate_correlations.compute_pearson_rho(vector_1, vector_2)[source]¶
Compute the Pearson’s r correlation between two vectors.
- Parameters:
vector_1 (
ndarray
[tuple
[Any
,...
],dtype
[Any
]]) – First column from the dataframe.vector_2 (
ndarray
[tuple
[Any
,...
],dtype
[Any
]]) – Second column from the dataframe.
- Return type:
float
- Returns:
The Pearson’s r correlation between vec1 and vec2.
- utility_metrics.bivariate_correlations.discretize_matrix(data_frame, cols_to_discretize, nr_of_bins)[source]¶
Function that discretises columns from a selection of columns
- Parameters:
data_frame (
DataFrame
) – Data.cols_to_discretize (
Iterable
[Any
]) – Columns to discretize.nr_of_bins (
int
) – Number of bins to use for discretization.
- Return type:
DataFrame
- Returns:
The original dataframe with the columns to discretize discretized.
- utility_metrics.bivariate_correlations.visualise_cramers_correlation_matrices(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶
Plot the correlations matrices with Cramer’s V of the original and synthetic data.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Container
[str
]) – The names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.
- Return type:
tuple
[DataFrame
,DataFrame
,Any
]- Returns:
The correlation matrices for the plots and the plot.