Hashing

Hashing in machine learning

Hashing in machine learning refers to the use of hash functions to map data into a fixed-size space. The goal is rarely security (the way it would be in cryptography); instead, hashing is used as a tool for efficient storage, fast lookup, similarity search, and compact feature representation. A hash function h takes an input of arbitrary size (a word, a URL, a vector, a document) and returns an integer in some bounded range. By mapping different inputs into the same range, hashing turns awkward, variable-sized data into something a model or a data structure can index in constant time.

In modern machine learning practice, two families of techniques dominate. The first is feature hashing, also known as the hashing trick, which maps high-dimensional sparse features into a fixed-dimensional vector for use by linear models. The second is locality-sensitive hashing (LSH), a family of hash schemes designed so that similar inputs collide on purpose, enabling sub-linear approximate nearest neighbor search. A third application area, less glamorous but enormously important in the LLM era, is using hash sketches such as MinHash and Bloom filters to deduplicate the trillions of tokens used to pretrain language models.

Feature hashing (the hashing trick)

Feature hashing was popularized by Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg in their 2009 ICML paper "Feature Hashing for Large Scale Multitask Learning" (Weinberger et al., 2009). The idea is simple: instead of building a vocabulary of every possible feature name (every word, every n-gram, every user ID) and assigning each one a column in a feature vector, hash the feature name directly to a column index in a vector of fixed size d.

Formally, let h be a hash function that maps a feature name to an integer in {0, 1, ..., d-1}. For each occurrence of a feature in a sample, increment position h(feature) in the output vector. To keep the dot product unbiased in expectation, the hashing trick also uses a sign hash s that maps each feature name to either +1 or -1, so the contribution of feature f with weight v becomes s(f) * v at index h(f). With independent hash functions for index and sign, the inner product of two hashed vectors is an unbiased estimator of the inner product of the original vectors, and Weinberger et al. proved exponential tail bounds on the approximation error.

The practical payoff is large. The model never has to store a vocabulary, never has to do dictionary lookups, and the feature dimension is fixed at training time. This makes feature hashing the workhorse for streaming and online learning systems where new feature names appear continuously. Vowpal Wabbit, John Langford's online learning system, was an early adopter and uses 32-bit MurmurHash3 to hash feature names directly into the weight vector. The same idea powers HashingVectorizer and FeatureHasher in scikit-learn, which are the standard tools for fitting linear models over text or categorical data without holding a vocabulary in memory.

The trade-off is collisions. Two distinct features can hash to the same index, which corrupts the per-feature weight. The signed hash trick mitigates the bias of these collisions because the cross-terms cancel in expectation, but variance remains. With d chosen large enough (typically 2^18 to 2^24 for text problems), collisions are rare enough that downstream accuracy is barely affected. For small or dense problems with only a few thousand features, feature hashing offers little benefit and can hurt model quality, so vocabulary-based vectorizers like CountVectorizer and TfidfVectorizer are preferred there.

Locality-sensitive hashing

Locality-sensitive hashing was introduced by Piotr Indyk and Rajeev Motwani in their 1998 STOC paper "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality" (Indyk and Motwani, 1998). A family of hash functions H is called locality-sensitive for a similarity measure sim if, for hash function h drawn at random from H, the probability that two items x and y collide is high when they are similar and low when they are dissimilar. Concretely, for the (r, cr, p1, p2)-LSH definition, a random h from H satisfies Pr[h(x) = h(y)] >= p1 whenever dist(x, y) <= r and Pr[h(x) = h(y)] <= p2 whenever dist(x, y) >= cr, with p1 > p2.

Given such a family, you can build an index that answers approximate near-neighbor queries in time roughly O(n^rho) for some rho < 1, beating the brute-force O(n) scan and avoiding the exponential blow-up that classical tree structures (kd-trees, R-trees) suffer in high dimensions. This was the original motivation for LSH and remains the textbook example of an algorithm that side-steps the curse of dimensionality.

The specific LSH family you pick depends on the similarity measure you care about.

LSH family	Author and year	Similarity measure	Typical use case
MinHash	Broder, 1997	Jaccard similarity over sets	Near-duplicate web pages, duplicate detection
SimHash	Charikar, 2002	Cosine similarity over vectors	Near-duplicate documents at Google
Random hyperplane	Charikar, 2002	Cosine / angular distance	High-dimensional embedding search
p-stable LSH	Datar, Immorlica, Indyk and Mirrokni, 2004	Lp norm (Euclidean for p=2)	Image features, dense vectors
b-bit MinHash	Li and Konig, 2010	Jaccard, low memory	Large-scale retrieval, ad targeting

MinHash, defined by Andrei Broder in his 1997 paper "On the Resemblance and Containment of Documents", estimates the Jaccard similarity between two sets by hashing each set with k independent random permutations and recording the minimum hashed element under each. The fraction of permutations on which the two minima agree is an unbiased estimator of the Jaccard similarity, and the variance shrinks as 1/k. MinHash powered AltaVista's near-duplicate detection over a 30 million-document crawl in the 1990s and remains the canonical sketch for set similarity.

SimHash, introduced by Moses Charikar in his 2002 STOC paper "Similarity Estimation Techniques from Rounding Algorithms", produces a short binary fingerprint such that the Hamming distance between two fingerprints tracks the cosine similarity of the underlying vectors. Google adopted SimHash for near-duplicate detection in its web crawl, and the technique remains widely used wherever cosine similarity is the right notion of closeness.

The p-stable LSH scheme of Datar, Immorlica, Indyk and Mirrokni (2004) hashes points in Euclidean space directly using random projections drawn from p-stable distributions (Gaussian for L2, Cauchy for L1). Each hash bucket corresponds to a slab of width w in a random direction, and points that are close in the original space tend to land in the same slab. This was the first LSH family that worked natively for Lp norms without an intermediate embedding.

Hash functions used in machine learning

The hash functions used in machine learning workloads are almost always non-cryptographic. Speed and good bit-mixing matter; resistance to adversarial inputs usually does not. Cryptographic hashes (SHA-256, SHA-3) appear only when downstream auditing or tamper-evidence matters, which is rare in core ML pipelines.

Hash function	Author and year	Output	Notes
MurmurHash3	Austin Appleby, 2011	32 or 128 bit	Default in scikit-learn `FeatureHasher` and Vowpal Wabbit
xxHash (xxh64, XXH3)	Yann Collet, 2012 onward	32, 64, 128 bit	Extremely fast on large inputs, runs at RAM speed
CityHash / FarmHash	Google, 2011 onward	32, 64, 128 bit	Fast on small to medium inputs, used inside Google
SipHash	Aumasson and Bernstein, 2012	64 bit	Keyed; resists hash-flooding in dictionaries
SHA-256	NIST, 2001	256 bit	Cryptographic; used for content addressing, rarely for features

MurmurHash3 has become the de facto standard for feature hashing because it has good empirical distribution, no obvious bias on real text, and a tiny implementation. xxHash is faster on bulk data and is the typical choice when hashing large blobs (for instance, deduplicating training shards). For dictionary-style usage inside production services, SipHash is preferred because it accepts a secret key and resists adversarial collision attacks.

Hashing for LLM data deduplication

The largest current use of hashing in machine learning, by raw compute spent, is in deduplicating pretraining corpora for large language models. Web crawls contain enormous amounts of near-duplicate text: boilerplate, scraped copies, machine translations, SEO doorway pages, and so on. Training a language model on duplicated text wastes compute and degrades held-out perplexity, and it tends to memorize the duplicated passages verbatim, which becomes a privacy and copyright issue.

The standard pipeline uses a two-stage hash-based filter. First, exact deduplication via content hashes (typically SHA-1 or xxHash of the document body, or suffix-array-based exact substring matching at the token level). Second, fuzzy deduplication via MinHash plus LSH banding: each document is shingled into n-grams, MinHashed to a signature of a few hundred values, and bucketed via LSH so that any pair with Jaccard similarity above a threshold (commonly around 0.7 or 0.8) gets compared. The Pile (Gao et al., 2020) used MinHashLSH at the document level to remove roughly 26% duplicates from its Common Crawl subset and 28% from OpenWebText2. RefinedWeb (Penedo et al., 2023), the corpus behind the Falcon family of LLMs, applied MinHashLSH plus exact substring deduplication and ended up discarding nearly 90% of the original web content. Similar pipelines power Dolma, RedPajama, and the FineWeb corpora.

Bloom filters, another classical hashing structure, are used in this same pipeline to do fast set-membership tests across shards without storing every URL or hash explicitly. Count-min sketches are used to estimate token frequencies for filtering and dimensionality reduction of vocabulary tails.

Other applications in ML

Beyond features and dedup, hashing shows up across the ML stack:

Approximate nearest neighbor search for image retrieval, where SimHash and p-stable LSH historically competed with tree-based methods and now share the field with dense vector indexes such as FAISS, HNSW, and ScaNN.
Recommendation systems use MinHash and SimHash to find users with overlapping item sets or items with overlapping user sets without an explicit pairwise comparison.
Stream processing and online learning use Bloom filters and count-min sketches for membership and frequency estimation in bounded memory.
Hash-based clustering. Ranking by hash bucket gives a cheap initial partition that algorithms like k-means can refine.
Hashing of word embeddings and tokenizer vocabularies. Some tokenizers and embedding tables use the hashing trick to handle out-of-vocabulary tokens deterministically.

Learning to hash

The LSH families discussed above are data-independent; the hash function is chosen at random and works for any input. A natural question is whether you can do better by learning hash functions from data. This is the focus of "learning to hash" and "semantic hashing".

Weiss, Torralba, and Fergus introduced "Spectral Hashing" at NIPS 2008 (Weiss et al., 2008). They framed the problem of finding short binary codes whose Hamming distance reflects semantic similarity as a graph partitioning problem, relaxed it via spectral methods, and showed that the resulting codes outperformed random projection LSH on image retrieval benchmarks. Liu, Wang, Kumar, and Chang followed up with "Hashing with Graphs" at ICML 2011, which used anchor graphs to scale spectral hashing to larger datasets.

For most modern image and text retrieval, learned dense embeddings indexed by HNSW or product-quantization-based methods such as FAISS now dominate over learned binary codes, but the underlying intuition (that you can train a hash function jointly with the downstream similarity measure) carried over into modern retrieval systems and contrastive learning.

Limitations

Feature hashing is unforgiving on small problems and on problems where you need to interpret learned weights per feature. Because the hash function is one-way, you cannot easily recover which feature caused a particular weight to grow; debugging requires keeping a parallel dictionary or rerunning the model with vocabulary-based vectorization for inspection. Dense small-feature problems gain little from hashing and pay collision costs.

LSH is approximate by construction. Recall depends on the number of hash tables and bands, and tuning these is a nuisance. Modern dense ANN libraries (FAISS, ScaNN, HNSW) often beat classical LSH on recall at fixed query latency for the kinds of dense embedding vectors that come out of neural networks. LSH still wins in two regimes: extremely high-dimensional sparse data where dense vector methods are wasteful, and pipelines where the cost of generating a dense embedding is itself the bottleneck. For pure web-scale near-duplicate detection over text, MinHashLSH remains the standard.

Implementations

Tool	Language	What it provides
`sklearn.feature_extraction.FeatureHasher`	Python	General-purpose feature hashing for dict / list inputs
`sklearn.feature_extraction.text.HashingVectorizer`	Python	Hashed bag-of-words / n-grams for text
Vowpal Wabbit	C++ / CLI	Online learning with hashing trick as the native representation
datasketch	Python	MinHash, MinHash LSH, LSH Forest, Weighted MinHash, HyperLogLog
FAISS	C++ / Python	Dense ANN, also includes LSH and binary indexes
ScaNN	C++ / Python	Google's anisotropic vector quantization for ANN
Annoy	C++ / Python	Tree-based ANN, often cited alongside LSH
text-dedup, slimpajama-tools	Python	MinHashLSH pipelines for LLM corpus deduplication

Explain like I'm 5 (ELI5)

Imagine you have a huge box of differently-shaped and colored LEGO bricks. Now, you want to group them based on their color and shape to make it easier to find the one you need. Hashing in machine learning is like organizing these LEGO bricks into smaller containers with labels, so you can easily find the one you're looking for.

In machine learning, we deal with lots of data, like words in a book or images on the internet. Hashing helps us organize this data in a simpler and more efficient way, so computers can understand and work with it faster. It's like putting the LEGO bricks into labeled containers, so you can find them more easily when you want to build something. Sometimes two different bricks end up in the same container by accident; we call that a collision, and we either live with the small confusion or pick a bigger set of containers so it almost never happens.

References

Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J. (2009). "Feature Hashing for Large Scale Multitask Learning". Proceedings of the 26th International Conference on Machine Learning (ICML 2009). https://arxiv.org/abs/0902.2206
Indyk, P., and Motwani, R. (1998). "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality". Proceedings of the 30th ACM Symposium on Theory of Computing (STOC 1998), pp. 604-613. https://dl.acm.org/doi/10.1145/276698.276876
Broder, A. Z. (1997). "On the Resemblance and Containment of Documents". Compression and Complexity of Sequences 1997. https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf
Charikar, M. S. (2002). "Similarity Estimation Techniques from Rounding Algorithms". Proceedings of the 34th ACM Symposium on Theory of Computing (STOC 2002), pp. 380-388. https://dl.acm.org/doi/10.1145/509907.509965
Datar, M., Immorlica, N., Indyk, P., and Mirrokni, V. S. (2004). "Locality-Sensitive Hashing Scheme Based on p-Stable Distributions". Proceedings of the 20th Annual Symposium on Computational Geometry (SCG 2004), pp. 253-262. https://dl.acm.org/doi/10.1145/997817.997857
Weiss, Y., Torralba, A., and Fergus, R. (2008). "Spectral Hashing". Advances in Neural Information Processing Systems 21 (NIPS 2008). https://papers.nips.cc/paper_files/paper/2008/hash/d58072be2820e8682c0a27c0518e805e-Abstract.html
Liu, W., Wang, J., Kumar, S., and Chang, S.-F. (2011). "Hashing with Graphs". Proceedings of the 28th International Conference on Machine Learning (ICML 2011). https://icml.cc/Conferences/2011/papers/6_icmlpaper.pdf
Appleby, A. (2011). MurmurHash3 reference implementation and SMHasher test suite. https://github.com/aappleby/smhasher
scikit-learn developers. "FeatureHasher" and "HashingVectorizer" documentation. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html
Gao, L. et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling". https://arxiv.org/abs/2101.00027
Penedo, G. et al. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only". https://arxiv.org/abs/2306.01116
ekzhu / datasketch. "datasketch: MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++". https://github.com/ekzhu/datasketch
Douze, M., Guzhva, A., Deng, C., et al. (2024). "The FAISS Library". https://arxiv.org/pdf/2401.08281

Hashing in machine learning

Feature hashing (the hashing trick)

Locality-sensitive hashing

Hash functions used in machine learning

Hashing for LLM data deduplication

Other applications in ML

Learning to hash

Limitations

Implementations

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

MTEB (Massive Text Embedding Benchmark)

Average Precision

Candidate Generation

Similarity Measure

Vector database

Hashing in machine learning

Feature hashing (the hashing trick)

Locality-sensitive hashing

Hash functions used in machine learning

Hashing for LLM data deduplication

Other applications in ML

Learning to hash

Limitations

Implementations

Explain like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

MTEB (Massive Text Embedding Benchmark)

Average Precision

Candidate Generation

Similarity Measure

Vector database