flyhash package

Submodules

flyhash.FlyHash module

class flyhash.FlyHash.FlyHash(input_dim: int, hash_dim: int, density: int | float = 0.1, sparsity: float = 0.05, quant_step: float | None = None, dtype: type = <class 'numpy.int32'>, seed: int | None = None)[source]

Bases: object

FlyHash is a local sensitive hashing (LSH) algorithm, based on the paper “A neural algorithm for fundamental computing problems” by S. Dasgupta, C. F. Stevens, and S. Navlakha (2017).

FlyHash is a LSH algorithm that maps input data to a sparse hash embedding, where the dimension of the hash embedding is much larger than the input, and keeps the locality of the input data in the hash embedding.

FlyHash is designed to be cheap to compute, yet not ganranteeing memory efficiency. It is suitable for hashing small to medium sized data (d ~ 10-1000) to a large hash embedding (m ~ 100-10000).

Example

Using a large hash_dim m=100 for a small input_dim d=10:

>>> import numpy as np
>>> from flyhash import FlyHash
>>> d = 10
>>> m = 100
>>> flyhash = FlyHash(d, m)
>>> data = np.random.randn(5, d)
>>> hashed_data = flyhash(data)
__call__(input_data: ndarray) ndarray[source]

Hash input_data to hash_dim vector(s).

Parameters:

input_data (np.ndarray) – Input data to be hashed, must be 1D (input_dim,) or 2D (batch_size, input_dim).

Returns:

Hashed data, 1D (hash_dim,) or 2D (batch_size, hash_dim) according to the shape of input_data.

Return type:

np.ndarray

__init__(input_dim: int, hash_dim: int, density: int | float = 0.1, sparsity: float = 0.05, quant_step: float | None = None, dtype: type = <class 'numpy.int32'>, seed: int | None = None)[source]

Initialize FlyHash object.

Parameters:
  • input_dim (int) –

    Input dimension of the data to be hashed.

    Note that FlyHash can only hash data with a fixed input dimensions.

  • hash_dim (int) – Output dimension of the hash embeddings.

  • density (Union[int, float], optional) –

    The connection density from input to output, either a float between 0 and 1, or an integer greater than 0, by default 0.1.

    If density is a float, each column of the projection matrix has non-zero entries with probability density.

    If density is an integer, each column of the projection matrix has exactly density non-zero entries.

  • sparsity (float, optional) –

    The sparsity level of hash embeddings, must be a float between 0 and 1, by default 0.05.

    The sparsity level is defined as the fraction of non-zero entries in the hash embeddings. The precise number of non-zero entries is determined by hash_dim and sparsity as round(hash_dim * sparsity).

  • quant_step (Optional[float], optional) –

    The quantization step size to use, either a float representing the step size, or None for all-or-none clipping, by default None.

    If quant_step is a float, the hash embeddings will be quantized to an integer multiple of quant_step. To avoid information loss, the ceiling of the quantized value is used, i.e. the quantized value is always greater than 0 iff the original value is greater than 0.

  • dtype (type, optional) –

    Data type of the output hash embeddings, by default np.int32.

    Note that the data type must be one of the integer types.

  • seed (Optional[int], optional) – The random seed to use when generating the projection matrix, by default None.

_construct_projection_matrix()[source]

Construct projection matrix from input_dim to hash_dim.

Notes

The projection matrix W is a binary matrix of shape (hash_dim, input_dim).

This method constructs W by sampling from a binomial distribution with parameters n=input_dim and p=density if density is a float, or by sampling from a uniform distribution over input_dim if density is an integer.

This method is called by the constructor. It SHOULD NOT be called again later to avoid overwriting the projection matrix. Doing so will make the hash embeddings non-deterministic.

_winner_take_all(input_vector: ndarray) ndarray[source]

Winner-take-all operation.

This also includes quantization step using quant_step if specified, to reduce the number of unique values.

Parameters:

input_vector (np.ndarray) – 1D input vector to be winner-take-all-ed. The number of winners is pre-determined by num_winners.

Returns:

1D winner-take-all-ed vector of given integer type dtype.

Return type:

np.ndarray

reset(seed: int | None = None)[source]

Reset the random seed if provided and re-construct the projection matrix.

Parameters:

seed (Optional[int], optional) – The new seed to be used, omitted when set to None, by default None.

Module contents

class flyhash.FlyHash(input_dim: int, hash_dim: int, density: int | float = 0.1, sparsity: float = 0.05, quant_step: float | None = None, dtype: type = <class 'numpy.int32'>, seed: int | None = None)[source]

Bases: object

FlyHash is a local sensitive hashing (LSH) algorithm, based on the paper “A neural algorithm for fundamental computing problems” by S. Dasgupta, C. F. Stevens, and S. Navlakha (2017).

FlyHash is a LSH algorithm that maps input data to a sparse hash embedding, where the dimension of the hash embedding is much larger than the input, and keeps the locality of the input data in the hash embedding.

FlyHash is designed to be cheap to compute, yet not ganranteeing memory efficiency. It is suitable for hashing small to medium sized data (d ~ 10-1000) to a large hash embedding (m ~ 100-10000).

Example

Using a large hash_dim m=100 for a small input_dim d=10:

>>> import numpy as np
>>> from flyhash import FlyHash
>>> d = 10
>>> m = 100
>>> flyhash = FlyHash(d, m)
>>> data = np.random.randn(5, d)
>>> hashed_data = flyhash(data)
__call__(input_data: ndarray) ndarray[source]

Hash input_data to hash_dim vector(s).

Parameters:

input_data (np.ndarray) – Input data to be hashed, must be 1D (input_dim,) or 2D (batch_size, input_dim).

Returns:

Hashed data, 1D (hash_dim,) or 2D (batch_size, hash_dim) according to the shape of input_data.

Return type:

np.ndarray

__init__(input_dim: int, hash_dim: int, density: int | float = 0.1, sparsity: float = 0.05, quant_step: float | None = None, dtype: type = <class 'numpy.int32'>, seed: int | None = None)[source]

Initialize FlyHash object.

Parameters:
  • input_dim (int) –

    Input dimension of the data to be hashed.

    Note that FlyHash can only hash data with a fixed input dimensions.

  • hash_dim (int) – Output dimension of the hash embeddings.

  • density (Union[int, float], optional) –

    The connection density from input to output, either a float between 0 and 1, or an integer greater than 0, by default 0.1.

    If density is a float, each column of the projection matrix has non-zero entries with probability density.

    If density is an integer, each column of the projection matrix has exactly density non-zero entries.

  • sparsity (float, optional) –

    The sparsity level of hash embeddings, must be a float between 0 and 1, by default 0.05.

    The sparsity level is defined as the fraction of non-zero entries in the hash embeddings. The precise number of non-zero entries is determined by hash_dim and sparsity as round(hash_dim * sparsity).

  • quant_step (Optional[float], optional) –

    The quantization step size to use, either a float representing the step size, or None for all-or-none clipping, by default None.

    If quant_step is a float, the hash embeddings will be quantized to an integer multiple of quant_step. To avoid information loss, the ceiling of the quantized value is used, i.e. the quantized value is always greater than 0 iff the original value is greater than 0.

  • dtype (type, optional) –

    Data type of the output hash embeddings, by default np.int32.

    Note that the data type must be one of the integer types.

  • seed (Optional[int], optional) – The random seed to use when generating the projection matrix, by default None.

_construct_projection_matrix()[source]

Construct projection matrix from input_dim to hash_dim.

Notes

The projection matrix W is a binary matrix of shape (hash_dim, input_dim).

This method constructs W by sampling from a binomial distribution with parameters n=input_dim and p=density if density is a float, or by sampling from a uniform distribution over input_dim if density is an integer.

This method is called by the constructor. It SHOULD NOT be called again later to avoid overwriting the projection matrix. Doing so will make the hash embeddings non-deterministic.

_winner_take_all(input_vector: ndarray) ndarray[source]

Winner-take-all operation.

This also includes quantization step using quant_step if specified, to reduce the number of unique values.

Parameters:

input_vector (np.ndarray) – 1D input vector to be winner-take-all-ed. The number of winners is pre-determined by num_winners.

Returns:

1D winner-take-all-ed vector of given integer type dtype.

Return type:

np.ndarray

reset(seed: int | None = None)[source]

Reset the random seed if provided and re-construct the projection matrix.

Parameters:

seed (Optional[int], optional) – The new seed to be used, omitted when set to None, by default None.