secure_inner_join.lsh module¶
This implements Locality-Sensitive Hashing for dates and zip2-codes.
- secure_inner_join.lsh.encode(day, month, year, zip4_code)[source]¶
Encodes day, month, year and zip2 to a Tuple.
- Parameters:
day (
int
) – day of birthmonth (
int
) – month of birthyear (
int
) – year of birthzip4_code (
int
) – the four digits of the postal code
- Return type:
tuple
[int
,int
,int
,int
]- Returns:
encoded representation
- secure_inner_join.lsh.get_hyper_planes(amount=2000, seed=42, mask=False)[source]¶
Construct a specified number of hyper planes with a set seed. We assume the following order: (day, month, year, zip2-code).
- Parameters:
amount (
int
) – number of hyper planes to constructseed (
int
) – seed to use for the random generatormask (
bool
) – set to true to generate a bit mask to use for masking
- Return type:
ndarray
[Any
,dtype
[int64
]] |tuple
[ndarray
[Any
,dtype
[int64
]],bitarray
]- Returns:
array containing the random hyper planes
- secure_inner_join.lsh.lsh_hash(day, month, year, zip4_code, hyper_planes, bit_mask=None)[source]¶
Computes a hash encoding for a given encoded input, given a collection of hyperplanes
- Parameters:
day (
int
) – day of birthmonth (
int
) – month of birthyear (
int
) – year of birthzip4_code (
int
) – the four digits of the postal codehyper_planes (
ndarray
[Any
,dtype
[int64
]]) – $n$ hyperplanes sampled from $[0,62) imes[0,12) imes[0,100) imes[10,100)$bit_mask (
bitarray
|None
) – masking to apply to the hashing
- Return type:
bitarray
- Returns:
an encode hash, first for $n$ bits belong to day, second $n$ bits belong to month, etc.
- secure_inner_join.lsh.weighted_hamming_distance(hash_1, hash_2)[source]¶
if score ~= 1 than we expect at most one element to be one-off
The score represents the actual distance between two encodings if the number of buckets is large enough :type hash_1:
bitarray
:param hash_1: first hash :type hash_2:bitarray
:param hash_2: second hash :rtype:tuple
[float
,tuple
[float
,float
,float
,float
]] :return: an x-off distance score, and a tuple of x-off distances per (day, month, year, zip2)