utility_metrics.bivariate_distributions module¶
Computation of the utility metric that compares the bivariate distributions of the original and synthetic data.
- utility_metrics.bivariate_distributions.avg_abs_correlation_difference(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶
Compute of the average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Container
[str
]) – A list containing the names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.
- Return type:
ndarray
[tuple
[Any
,...
],dtype
[Any
]]- Returns:
The average absolute difference between the correlations in the original dataframe and the correlations in the synthetic dataframe.
- utility_metrics.bivariate_distributions.compute_cramers_correlation_matrix(data_frame)[source]¶
Compute of the correlation matrix of a dataframe with discrete columns.
- Parameters:
data_frame (
DataFrame
) – A discrete pandas dataframe that contains the data.- Return type:
DataFrame
- Returns:
The correlation matrix of the dataframe.
- utility_metrics.bivariate_distributions.compute_cramers_v(vector_1, vector_2)[source]¶
Compute the Cramer’s V correlation between two vectors.
- Parameters:
vector_1 (
ndarray
[tuple
[Any
,...
],dtype
[Any
]]) – First column from the dataframe.vector_2 (
ndarray
[tuple
[Any
,...
],dtype
[Any
]]) – Second column from the dataframe.
- Return type:
float
- Returns:
The Cramer’s V correlation between vec1 and vec2.
- utility_metrics.bivariate_distributions.discretize_matrix(data_frame, cols_to_discretize, nr_of_bins)[source]¶
Function that discretizes columns from a selection of columns
- Parameters:
data_frame (
DataFrame
) – Data.cols_to_discretize (
Iterable
[Any
]) – Columns to discretize.nr_of_bins (
int
) – Number of bins to use for discretization.
- Return type:
DataFrame
- Returns:
The original dataframe with the columns to discretize discretized.
- utility_metrics.bivariate_distributions.visualise_cramers_correlation_matrices(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶
Plot the correlations matrices of the original and synthetic data.
- Parameters:
original_data (
DataFrame
) – A pandas dataframe that contains the original data.synthetic_data (
DataFrame
) – A pandas dataframe that contains the synthetic data.cols_cat (
Container
[str
]) – A list containing the names of the categorical columns.n_bins (
int
) – The number of bins that is used to discretise the numerical columns.
- Return type:
tuple
[DataFrame
,DataFrame
,Any
]- Returns:
The correlation matrices for the plots and the plot.