secure_inner_join.lsh module¶
This implements Locality-Sensitive Hashing for dates and zip2-codes.
- secure_inner_join.lsh.encode(day, month, year, zip4_code)[source]¶
Encodes day, month, year and zip2 to a Tuple.
- Parameters:
day (
int) – day of birthmonth (
int) – month of birthyear (
int) – year of birthzip4_code (
int) – the four digits of the postal code
- Return type:
tuple[int,int,int,int]- Returns:
encoded representation
- secure_inner_join.lsh.get_hyper_planes(amount=2000, seed=42, mask=False)[source]¶
Construct a specified number of hyper planes with a set seed. We assume the following order: (day, month, year, zip2-code).
- Parameters:
amount (
int) – number of hyper planes to constructseed (
int) – seed to use for the random generatormask (
bool) – set to true to generate a bit mask to use for masking
- Return type:
ndarray[Any,dtype[int64]] |tuple[ndarray[Any,dtype[int64]],bitarray]- Returns:
array containing the random hyper planes
- secure_inner_join.lsh.lsh_hash(day, month, year, zip4_code, hyper_planes, bit_mask=None)[source]¶
Computes a hash encoding for a given encoded input, given a collection of hyperplanes
- Parameters:
day (
int) – day of birthmonth (
int) – month of birthyear (
int) – year of birthzip4_code (
int) – the four digits of the postal codehyper_planes (
ndarray[Any,dtype[int64]]) – $n$ hyperplanes sampled from $[0,62) imes[0,12) imes[0,100) imes[10,100)$bit_mask (
bitarray|None) – masking to apply to the hashing
- Return type:
bitarray- Returns:
an encode hash, first for $n$ bits belong to day, second $n$ bits belong to month, etc.
- secure_inner_join.lsh.weighted_hamming_distance(hash_1, hash_2)[source]¶
if score ~= 1 than we expect at most one element to be one-off
The score represents the actual distance between two encodings if the number of buckets is large enough :type hash_1:
bitarray:param hash_1: first hash :type hash_2:bitarray:param hash_2: second hash :rtype:tuple[float,tuple[float,float,float,float]] :return: an x-off distance score, and a tuple of x-off distances per (day, month, year, zip2)