utility_metrics.univariate_distributions module¶

Computation of the utility metric that compares the univariate distributions of the original and synthetic data.

utility_metrics.univariate_distributions.average_hellinger_distance(original_data, synthetic_data, cols_cat, n_bins=50)[source]¶

Compute the average Hellinger distance between an original and synthetic dataframe.

Parameters:

original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.
cols_cat (Iterable[Any]) – A sequence of columns in the dataframes that are categorical.
n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

float

Returns:

The average Hellinger distance.

utility_metrics.univariate_distributions.compute_hellinger_distance(distribution_1, distribution_2)[source]¶

Function that computes the Hellinger distance between two distributions.

Parameters:

distribution_1 (ndarray[tuple[Any, ...], dtype[Any]]) – a vector of probabilities that sums to 1.
distribution_2 (ndarray[tuple[Any, ...], dtype[Any]]) – a vector of probabilities that sums to 1.

Return type:

float

Returns:

The Hellinger Distance between the distributions.

utility_metrics.univariate_distributions.discrete_vector_to_distribution(vector, categories=None)[source]¶

Return the distribution of discrete values in the vector. If no list of categories is provided, the set of unique values in the vector is taken as the list of categories.

Parameters:

categories (Sequence[Any] | None) – List of categories that can appear in the vector.
vector (ndarray[tuple[Any, ...], dtype[Any]]) – Input vector.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[Any]], ndarray[tuple[Any, ...], dtype[Any]]]

Returns:

An array of all the unique values and an array of the respective probabilities.

utility_metrics.univariate_distributions.discretize_vector(vector, nr_of_bins)[source]¶

Discretize a vector based on a number of bins.

Parameters:

vector (ndarray[tuple[Any, ...], dtype[Any]]) – Input vector.
nr_of_bins (int) – Number of bins to use.

Return type:

ndarray[tuple[Any, ...], dtype[Any]]

Returns:

The discretized vector.

utility_metrics.univariate_distributions.plot_categorical_bar(vec1, vec2, col, original_data, synthetic_data)[source]¶

Plot the distributions of a categorical column for the original and synthetic data.

Parameters:

vec1 (ndarray[tuple[Any, ...], dtype[Any]]) – An array for discretized original data column.
vec2 (ndarray[tuple[Any, ...], dtype[Any]]) – An array for discretized synthetic data column.
col (str) – Selected column for which the distributions is plotted.
original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.

Return type:

tuple[dict[str, float], Figure]

Returns:

A dictionary mapping each counts per dataset and a plot.

utility_metrics.univariate_distributions.plot_numerical_bar(vec1_discretized, vec2_discretized, col, n_bins=50)[source]¶

Plot the distributions of a numerical column for the original and synthetic data.

Parameters:

vec1_discretized (ndarray[tuple[Any, ...], dtype[Any]]) – An array for discretised original data column.
vec2_discretized (ndarray[tuple[Any, ...], dtype[Any]]) – An array for discretised synthetic data column.
col (str) – Selected columns for which the distributions is plotted.
n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

tuple[list[list[int]], Figure]

Returns:

The counts per dataset and a plot.

utility_metrics.univariate_distributions.visualise_distributions(original_data, synthetic_data, cols, cat_col_names, n_bins=50)[source]¶

Plot the probability distributions of a column for the original and synthetic data.

Parameters:

original_data (DataFrame) – A pandas dataframe that contains the original data.
synthetic_data (DataFrame) – A pandas dataframe that contains the synthetic data.
cols (Sequence[str]) – List of the selected columns for which the distributions is plotted.
cat_col_names (Sequence[str]) – names of categorical columns
n_bins (int) – The number of bins that is used to discretise the numerical columns.

Return type:

dict[str, Figure]

Returns:

A dictionary mapping each column to an overview of the data in that column and a dictionary mapping each column to a respective plot.