See also: Machine learning terms
Hashing in machine learning refers to the use of hash functions to map data into a fixed-size space. The goal is rarely security (the way it would be in cryptography); instead, hashing is used as a tool for efficient storage, fast lookup, similarity search, and compact feature representation. A hash function h takes an input of arbitrary size (a word, a URL, a vector, a document) and returns an integer in some bounded range. By mapping different inputs into the same range, hashing turns awkward, variable-sized data into something a model or a data structure can index in constant time.
In modern machine learning practice, two families of techniques dominate. The first is feature hashing, also known as the hashing trick, which maps high-dimensional sparse features into a fixed-dimensional vector for use by linear models. The second is locality-sensitive hashing (LSH), a family of hash schemes designed so that similar inputs collide on purpose, enabling sub-linear approximate nearest neighbor search. A third application area, less glamorous but enormously important in the LLM era, is using hash sketches such as MinHash and Bloom filters to deduplicate the trillions of tokens used to pretrain language models.
Feature hashing was popularized by Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg in their 2009 ICML paper "Feature Hashing for Large Scale Multitask Learning" (Weinberger et al., 2009). The idea is simple: instead of building a vocabulary of every possible feature name (every word, every n-gram, every user ID) and assigning each one a column in a feature vector, hash the feature name directly to a column index in a vector of fixed size d.
Formally, let h be a hash function that maps a feature name to an integer in {0, 1, ..., d-1}. For each occurrence of a feature in a sample, increment position h(feature) in the output vector. To keep the dot product unbiased in expectation, the hashing trick also uses a sign hash s that maps each feature name to either +1 or -1, so the contribution of feature f with weight v becomes s(f) * v at index h(f). With independent hash functions for index and sign, the inner product of two hashed vectors is an unbiased estimator of the inner product of the original vectors, and Weinberger et al. proved exponential tail bounds on the approximation error.
The practical payoff is large. The model never has to store a vocabulary, never has to do dictionary lookups, and the feature dimension is fixed at training time. This makes feature hashing the workhorse for streaming and online learning systems where new feature names appear continuously. Vowpal Wabbit, John Langford's online learning system, was an early adopter and uses 32-bit MurmurHash3 to hash feature names directly into the weight vector. The same idea powers HashingVectorizer and FeatureHasher in scikit-learn, which are the standard tools for fitting linear models over text or categorical data without holding a vocabulary in memory.
The trade-off is collisions. Two distinct features can hash to the same index, which corrupts the per-feature weight. The signed hash trick mitigates the bias of these collisions because the cross-terms cancel in expectation, but variance remains. With d chosen large enough (typically 2^18 to 2^24 for text problems), collisions are rare enough that downstream accuracy is barely affected. For small or dense problems with only a few thousand features, feature hashing offers little benefit and can hurt model quality, so vocabulary-based vectorizers like CountVectorizer and TfidfVectorizer are preferred there.
Locality-sensitive hashing was introduced by Piotr Indyk and Rajeev Motwani in their 1998 STOC paper "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality" (Indyk and Motwani, 1998). A family of hash functions H is called locality-sensitive for a similarity measure sim if, for hash function h drawn at random from H, the probability that two items x and y collide is high when they are similar and low when they are dissimilar. Concretely, for the (r, cr, p1, p2)-LSH definition, a random h from H satisfies Pr[h(x) = h(y)] >= p1 whenever dist(x, y) <= r and Pr[h(x) = h(y)] <= p2 whenever dist(x, y) >= cr, with p1 > p2.
Given such a family, you can build an index that answers approximate near-neighbor queries in time roughly O(n^rho) for some rho < 1, beating the brute-force O(n) scan and avoiding the exponential blow-up that classical tree structures (kd-trees, R-trees) suffer in high dimensions. This was the original motivation for LSH and remains the textbook example of an algorithm that side-steps the curse of dimensionality.
The specific LSH family you pick depends on the similarity measure you care about.
| LSH family | Author and year | Similarity measure | Typical use case |
|---|---|---|---|
| MinHash | Broder, 1997 | Jaccard similarity over sets | Near-duplicate web pages, duplicate detection |
| SimHash | Charikar, 2002 | Cosine similarity over vectors | Near-duplicate documents at Google |
| Random hyperplane | Charikar, 2002 | Cosine / angular distance | High-dimensional embedding search |
| p-stable LSH | Datar, Immorlica, Indyk and Mirrokni, 2004 | Lp norm (Euclidean for p=2) | Image features, dense vectors |
| b-bit MinHash | Li and Konig, 2010 | Jaccard, low memory | Large-scale retrieval, ad targeting |
MinHash, defined by Andrei Broder in his 1997 paper "On the Resemblance and Containment of Documents", estimates the Jaccard similarity between two sets by hashing each set with k independent random permutations and recording the minimum hashed element under each. The fraction of permutations on which the two minima agree is an unbiased estimator of the Jaccard similarity, and the variance shrinks as 1/k. MinHash powered AltaVista's near-duplicate detection over a 30 million-document crawl in the 1990s and remains the canonical sketch for set similarity.
SimHash, introduced by Moses Charikar in his 2002 STOC paper "Similarity Estimation Techniques from Rounding Algorithms", produces a short binary fingerprint such that the Hamming distance between two fingerprints tracks the cosine similarity of the underlying vectors. Google adopted SimHash for near-duplicate detection in its web crawl, and the technique remains widely used wherever cosine similarity is the right notion of closeness.
The p-stable LSH scheme of Datar, Immorlica, Indyk and Mirrokni (2004) hashes points in Euclidean space directly using random projections drawn from p-stable distributions (Gaussian for L2, Cauchy for L1). Each hash bucket corresponds to a slab of width w in a random direction, and points that are close in the original space tend to land in the same slab. This was the first LSH family that worked natively for Lp norms without an intermediate embedding.
The hash functions used in machine learning workloads are almost always non-cryptographic. Speed and good bit-mixing matter; resistance to adversarial inputs usually does not. Cryptographic hashes (SHA-256, SHA-3) appear only when downstream auditing or tamper-evidence matters, which is rare in core ML pipelines.
| Hash function | Author and year | Output | Notes |
|---|---|---|---|
| MurmurHash3 | Austin Appleby, 2011 | 32 or 128 bit | Default in scikit-learn FeatureHasher and Vowpal Wabbit |
| xxHash (xxh64, XXH3) | Yann Collet, 2012 onward | 32, 64, 128 bit | Extremely fast on large inputs, runs at RAM speed |
| CityHash / FarmHash | Google, 2011 onward | 32, 64, 128 bit | Fast on small to medium inputs, used inside Google |
| SipHash | Aumasson and Bernstein, 2012 | 64 bit | Keyed; resists hash-flooding in dictionaries |
| SHA-256 | NIST, 2001 | 256 bit | Cryptographic; used for content addressing, rarely for features |
MurmurHash3 has become the de facto standard for feature hashing because it has good empirical distribution, no obvious bias on real text, and a tiny implementation. xxHash is faster on bulk data and is the typical choice when hashing large blobs (for instance, deduplicating training shards). For dictionary-style usage inside production services, SipHash is preferred because it accepts a secret key and resists adversarial collision attacks.
The largest current use of hashing in machine learning, by raw compute spent, is in deduplicating pretraining corpora for large language models. Web crawls contain enormous amounts of near-duplicate text: boilerplate, scraped copies, machine translations, SEO doorway pages, and so on. Training a language model on duplicated text wastes compute and degrades held-out perplexity, and it tends to memorize the duplicated passages verbatim, which becomes a privacy and copyright issue.
The standard pipeline uses a two-stage hash-based filter. First, exact deduplication via content hashes (typically SHA-1 or xxHash of the document body, or suffix-array-based exact substring matching at the token level). Second, fuzzy deduplication via MinHash plus LSH banding: each document is shingled into n-grams, MinHashed to a signature of a few hundred values, and bucketed via LSH so that any pair with Jaccard similarity above a threshold (commonly around 0.7 or 0.8) gets compared. The Pile (Gao et al., 2020) used MinHashLSH at the document level to remove roughly 26% duplicates from its Common Crawl subset and 28% from OpenWebText2. RefinedWeb (Penedo et al., 2023), the corpus behind the Falcon family of LLMs, applied MinHashLSH plus exact substring deduplication and ended up discarding nearly 90% of the original web content. Similar pipelines power Dolma, RedPajama, and the FineWeb corpora.
Bloom filters, another classical hashing structure, are used in this same pipeline to do fast set-membership tests across shards without storing every URL or hash explicitly. Count-min sketches are used to estimate token frequencies for filtering and dimensionality reduction of vocabulary tails.
Beyond features and dedup, hashing shows up across the ML stack:
The LSH families discussed above are data-independent; the hash function is chosen at random and works for any input. A natural question is whether you can do better by learning hash functions from data. This is the focus of "learning to hash" and "semantic hashing".
Weiss, Torralba, and Fergus introduced "Spectral Hashing" at NIPS 2008 (Weiss et al., 2008). They framed the problem of finding short binary codes whose Hamming distance reflects semantic similarity as a graph partitioning problem, relaxed it via spectral methods, and showed that the resulting codes outperformed random projection LSH on image retrieval benchmarks. Liu, Wang, Kumar, and Chang followed up with "Hashing with Graphs" at ICML 2011, which used anchor graphs to scale spectral hashing to larger datasets.
For most modern image and text retrieval, learned dense embeddings indexed by HNSW or product-quantization-based methods such as FAISS now dominate over learned binary codes, but the underlying intuition (that you can train a hash function jointly with the downstream similarity measure) carried over into modern retrieval systems and contrastive learning.
Feature hashing is unforgiving on small problems and on problems where you need to interpret learned weights per feature. Because the hash function is one-way, you cannot easily recover which feature caused a particular weight to grow; debugging requires keeping a parallel dictionary or rerunning the model with vocabulary-based vectorization for inspection. Dense small-feature problems gain little from hashing and pay collision costs.
LSH is approximate by construction. Recall depends on the number of hash tables and bands, and tuning these is a nuisance. Modern dense ANN libraries (FAISS, ScaNN, HNSW) often beat classical LSH on recall at fixed query latency for the kinds of dense embedding vectors that come out of neural networks. LSH still wins in two regimes: extremely high-dimensional sparse data where dense vector methods are wasteful, and pipelines where the cost of generating a dense embedding is itself the bottleneck. For pure web-scale near-duplicate detection over text, MinHashLSH remains the standard.
| Tool | Language | What it provides |
|---|---|---|
sklearn.feature_extraction.FeatureHasher | Python | General-purpose feature hashing for dict / list inputs |
sklearn.feature_extraction.text.HashingVectorizer | Python | Hashed bag-of-words / n-grams for text |
| Vowpal Wabbit | C++ / CLI | Online learning with hashing trick as the native representation |
| datasketch | Python | MinHash, MinHash LSH, LSH Forest, Weighted MinHash, HyperLogLog |
| FAISS | C++ / Python | Dense ANN, also includes LSH and binary indexes |
| ScaNN | C++ / Python | Google's anisotropic vector quantization for ANN |
| Annoy | C++ / Python | Tree-based ANN, often cited alongside LSH |
| text-dedup, slimpajama-tools | Python | MinHashLSH pipelines for LLM corpus deduplication |
Imagine you have a huge box of differently-shaped and colored LEGO bricks. Now, you want to group them based on their color and shape to make it easier to find the one you need. Hashing in machine learning is like organizing these LEGO bricks into smaller containers with labels, so you can easily find the one you're looking for.
In machine learning, we deal with lots of data, like words in a book or images on the internet. Hashing helps us organize this data in a simpler and more efficient way, so computers can understand and work with it faster. It's like putting the LEGO bricks into labeled containers, so you can find them more easily when you want to build something. Sometimes two different bricks end up in the same container by accident; we call that a collision, and we either live with the small confusion or pick a bigger set of containers so it almost never happens.