See also: Machine learning terms
Sketching is a family of techniques in computer science, machine learning, and database systems that use small probabilistic data structures, called sketches, to approximate properties of very large datasets in sub-linear memory. Instead of scanning the full input every time a query is asked, a sketch maintains a compact summary that can be updated incrementally and queried in constant or near-constant time. The price of this compactness is a controlled loss of precision: answers come with a probabilistic accuracy guarantee, usually expressed as an (ε, δ) bound, meaning the error is at most ε with probability at least 1 − δ.
Sketches are foundational tools in streaming data analytics, big data systems, network monitoring, search, and modern LLM training pipelines, where exact computation over petabyte-scale corpora is infeasible. They are a central example of a randomized algorithm traded against memory.
The idea of summarizing a large stream with a small randomized data structure goes back to two foundational lines of work. In 1970, Burton H. Bloom published "Space/Time Trade-offs in Hash Coding with Allowable Errors" in Communications of the ACM, introducing the Bloom filter for approximate set-membership testing. In 1985, Philippe Flajolet and G. Nigel Martin published "Probabilistic Counting Algorithms for Data Base Applications," which used the position of the leftmost one-bit in hashing outputs to estimate cardinality with logarithmic memory.
The field crystallized in 1996 when Noga Alon, Yossi Matias, and Mario Szegedy published "The Space Complexity of Approximating the Frequency Moments," introducing the AMS sketch and effectively founding the modern theory of streaming algorithms. The three authors received the Gödel Prize in 2005 for this work. Most popular sketches in production today, including HyperLogLog, Count-Min, MinHash, and quantile summaries, descend from this lineage.
A sketch buys three things at once: small memory, fast updates, and fast queries. It pays for them with approximate answers and a non-zero probability of large error. Formally, most sketches give an (ε, δ) guarantee: the answer is within an error of ε (either additive or multiplicative, depending on the sketch) with probability at least 1 − δ. Memory typically scales as O((1/ε²) log(1/δ)), so halving the error quadruples the space, while shrinking the failure probability by a factor of two only adds one extra hash row.
Many sketches are also mergeable: two sketches built independently over disjoint partitions can be combined into a sketch of the union. This property is what makes sketches a natural fit for map-reduce, Spark, and Flink pipelines. A worker per partition builds a local sketch, and a single reducer merges them.
| Sketch | Year | Authors | Problem solved | Error type |
|---|---|---|---|---|
| Bloom filter | 1970 | Burton H. Bloom | Set membership | False positives, no false negatives |
| Flajolet-Martin (FM) | 1985 | Flajolet, Martin | Cardinality (count distinct) | Multiplicative |
| AMS sketch | 1996 | Alon, Matias, Szegedy | Frequency moments (F₂) | Multiplicative |
| MinHash | 1997 | Andrei Broder | Jaccard similarity | Additive |
| Greenwald-Khanna (GK) | 2001 | Greenwald, Khanna | Quantiles | Additive rank error |
| SimHash | 2002 | Moses Charikar | Cosine similarity | Additive |
| Count Sketch | 2002 | Charikar, Chen, Farach-Colton | Frequency estimation | Both signs |
| Count-Min sketch | 2005 | Cormode, Muthukrishnan | Frequency estimation | One-sided overestimate |
| SpaceSaving | 2005 | Metwally, Agrawal, El Abbadi | Top-K / heavy hitters | Counter-based |
| HyperLogLog | 2007 | Flajolet, Fusy, Gandouet, Meunier | Cardinality | Multiplicative, ~1.04/√m |
| t-digest | ~2013 | Ted Dunning | Quantiles | Adaptive |
| Cuckoo filter | 2014 | Fan, Andersen, Kaminsky, Mitzenmacher | Set membership with deletion | False positives |
| KLL sketch | 2016 | Karnin, Lang, Liberty | Quantiles | Asymptotically optimal |
The Bloom filter maintains a bit array of size m and k hash functions. Inserting an element sets k bits; a lookup checks whether all k bits are set. False negatives are impossible because no bit ever flips back. False positives occur when unrelated insertions have collectively set every bit a query happens to probe. The false positive rate falls roughly as (1 − e^{-kn/m})^k for n inserted items.
Classic Bloom filters do not support deletion, since clearing bits would silently introduce false negatives. The counting Bloom filter replaces each bit with a small counter, allowing deletes at the cost of more memory. The cuckoo filter, introduced by Bin Fan, Dave Andersen, Michael Kaminsky, and Michael Mitzenmacher at ACM CoNEXT 2014, stores small fingerprints in a cuckoo hash table. It supports deletions, has lower space overhead than space-optimized Bloom filters at the same false positive rate, and is faster on lookups.
The count-distinct problem asks how many unique items appeared in a stream. Flajolet and Martin's 1985 algorithm hashed each element and tracked the maximum number of leading zeros, an idea that exploits the fact that streams of n uniform random hashes are likely to produce one with about log₂(n) leading zeros. LogLog refined this in 2003, and HyperLogLog (HLL), published by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier at AofA 2007, uses the harmonic mean of register estimates to achieve a standard error of about 1.04/√m, where m is the number of registers. The original paper showed that HLL can estimate cardinalities well beyond 10⁹ with about 2% accuracy using only 1.5 KB of memory.
Google's HyperLogLog++ variant powers APPROX_COUNT_DISTINCT in BigQuery and Google Analytics. In one published benchmark, BigQuery counted 3 billion distinct values in 28 seconds with exact COUNT(DISTINCT) and in 5.7 seconds with HLL++, with the HLL result only 0.2% off the true value. HLL sketches are mergeable, which is why they show up in Druid, Presto, ClickHouse, Redshift, Snowflake, and almost every modern OLAP engine.
The Count-Min sketch, published by Graham Cormode and S. Muthukrishnan in the Journal of Algorithms in 2005, estimates how many times a given element has appeared. It maintains a d by w matrix of counters; each insertion increments one counter per row using d independent hash functions, and a query takes the minimum of the d counters indexed by the queried key. The estimator never underestimates, only overestimates, and the space-accuracy guarantee is the canonical (ε, δ) bound: with width w = ⌈e/ε⌉ and depth d = ⌈ln(1/δ)⌉, the estimate exceeds the true count by at most ε · N (where N is the stream length) with probability 1 − δ.
Count Sketch, by Moses Charikar, Kevin Chen, and Martin Farach-Colton in 2002, predates Count-Min and uses signed counters; it produces unbiased estimates with two-sided error and is more accurate for skewed distributions. The AMS sketch of Alon, Matias, and Szegedy estimates the second frequency moment F₂ = Σ fᵢ², which is essentially the squared L2 norm of the frequency vector and a building block for many later sketches.
The heavy-hitters problem asks for the most frequent items, not the count of every item. Misra-Gries (1982) is a simple deterministic counter-based summary that finds all items appearing more than n/k times. The SpaceSaving algorithm of Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi (2005) is its more accurate descendant; empirical studies consistently show SpaceSaving outperforms Misra-Gries and linear sketches in utility. RedisBloom exposes a TOPK data type built on SpaceSaving.
Andrei Broder's 1997 paper "On the Resemblance and Containment of Documents," published at the Compression and Complexity of Sequences conference, introduced MinHash for estimating the Jaccard similarity between two sets. Each set is reduced to a small signature consisting of the minima of k random hash functions; the fraction of signature positions on which two sets agree is an unbiased estimator of their Jaccard similarity. Broder built a clustering of 30 million web pages and 150 GB of input from the AltaVista search engine using this technique, producing 3.6 million clusters.
Moses Charikar's STOC 2002 paper "Similarity Estimation Techniques from Rounding Algorithms" showed that rounding algorithms for LP and SDP relaxations correspond to locality-sensitive hashing (LSH) schemes, and introduced what is now called SimHash for estimating cosine similarity. SimHash powered Google's near-duplicate web detection at crawl scale.
Quantile sketches answer rank queries: what is the median, the 95th percentile, the 99.9th percentile? Michael Greenwald and Sanjeev Khanna's 2001 SIGMOD paper gave the GK sketch, a deterministic algorithm using O(ε⁻¹ log(εn)) space; this bound was shown optimal in 2020. t-digest, designed by Ted Dunning and widely used in Elasticsearch, Spark, and Druid, achieves much better practical accuracy than KLL on real-world data but lacks worst-case guarantees. The KLL sketch, by Zohar Karnin, Kevin Lang, and Edo Liberty at FOCS 2016, resolved the long-standing question of optimal quantile approximation in the streaming model and is the default quantile sketch in Apache DataSketches.
| Application | Sketch used | Notes |
|---|---|---|
| LLM training data deduplication | MinHash + LSH | Used in The Pile, RefinedWeb, FineWeb, RedPajama, GPT-3, Gopher, Llama 3, OLMo |
| Web crawl near-duplicate detection | SimHash | Originated at Google |
| Database COUNT(DISTINCT) | HyperLogLog / HLL++ | BigQuery, Redshift, Snowflake, ClickHouse, Druid |
| A/B testing unique users | HyperLogLog | Mergeable across experiment buckets |
| Recommendation candidate generation | MinHash, SimHash | Sub-linear nearest neighbor |
| Online learning feature counts | Count-Min sketch, feature hashing | Vowpal Wabbit, online click prediction |
| Genome and metagenome distance | MinHash | Mash (Ondov 2016) clustered all 54,118 NCBI RefSeq genomes in 33 CPU hours |
| Network heavy-hitter detection | Count-Min sketch, SpaceSaving | DDoS detection, traffic engineering |
| Streaming analytics quantiles | t-digest, KLL, GK | Latency P99 monitoring at scale |
| Database join cardinality estimates | HyperLogLog, AGMS | Query optimizer cost models |
Large language model training corpora are dominated by Common Crawl scrapes that contain massive numbers of near-duplicates: boilerplate, scraped versions of the same article, mirror pages, near-identical templates. Deduplication is one of the largest single levers on training quality. Removing duplicates makes models emit memorized text about ten times less frequently and reduces the steps needed to reach a given loss.
MinHash combined with LSH is the standard tool for this job at petabyte scale because it is both mergeable and embarrassingly parallel. The 2023 RefinedWeb paper for the Falcon family computed 9,000 hashes per document over 5-grams, divided into 20 buckets of 450 hashes each. Hugging Face's FineWeb pipeline uses 112 hash functions split into 14 buckets of 8, targeting documents at least 75% similar. Similar pipelines back The Pile, RedPajama, StarCoder, OLMo, and Llama 3 training data preparation.
The hashing trick hashes feature names directly into a fixed-size weight vector, side-stepping the need for a vocabulary. It is closely related to Count-Min: both use hash functions to keep memory bounded as the universe of distinct keys grows. Vowpal Wabbit was an early production system to push this to its limits, defaulting to a 2¹⁸ = 262,144-entry weight vector and learning to tolerate the resulting collisions through gradient updates.
Brian Ondov and colleagues introduced Mash in Genome Biology (2016), applying MinHash to k-mers from biological sequences. Mash enabled clustering of all 54,118 NCBI RefSeq genomes in 33 CPU hours and real-time database search against assembled or unassembled Illumina, PacBio, and Oxford Nanopore reads. It triggered a wave of genomic sketching tools including BinDash, sourmash, and skani.
| Library | Language | Sketches included |
|---|---|---|
| Apache DataSketches | Java, C++, Python | HLL, Theta, KLL, REQ, T-Digest, Tuple, Quantiles |
| datasketch | Python | MinHash, LSH, LSH Forest, LSH Ensemble, Weighted MinHash, HyperLogLog, HyperLogLog++, HNSW |
| RedisBloom | Redis module | Bloom, Cuckoo, Count-Min, Top-K, T-Digest |
| postgresql-hll | PostgreSQL extension | HyperLogLog (Aggregate Knowledge port) |
| tdigest | PostgreSQL extension | T-Digest |
ClickHouse uniqHLL12, quantilesTDigest | Built-in | HLL, T-Digest |
Spark MLlib BloomFilter, CountMinSketch | Built-in | Bloom, Count-Min |
BigQuery APPROX_COUNT_DISTINCT, APPROX_QUANTILES | Built-in | HLL++, KLL |
Apache DataSketches began as an internal Yahoo project in 2012, was open-sourced in 2015, and became a top-level Apache project. The core team included Lee Rhodes, Jon Malkin, and Alex Saydakov at Yahoo, with academic contributors including Justin Thaler (Georgetown) and Edo Liberty (then at Yahoo, later AWS). Every algorithm in the library produces mergeable summaries with formal accuracy guarantees, which is the property that lets it slot into BigQuery, Druid, and Spark without losing correctness when results from many partitions are combined.
Sketching overlaps with several adjacent ideas without being identical to them. Dimensionality reduction techniques like random projection (justified by the Johnson-Lindenstrauss lemma) reduce the dimension of vectors while approximately preserving pairwise distances; the resulting reduced vectors can be viewed as sketches for L2 geometry. Approximate nearest neighbor structures like HNSW and IVF used in vector databases are sometimes grouped with sketches in practice, though they are graph or partition based rather than hash based. LSH families like SimHash sit at the intersection: they produce sketches and organize the search space.
Sketches are also a natural fit for differential privacy. Adding small noise to sketch counters yields privatized aggregations whose error is dominated by the sketch's existing approximation noise, giving privacy almost for free. They are increasingly used in federated learning aggregation for the same reason: each client sends a small sketch of its local data, and the server merges them without ever seeing raw records.
Sketches are powerful but not magic. A few practical caveats:
Despite these caveats, sketches are a default tool of modern data infrastructure. They are how Google answers COUNT DISTINCT over Search query logs, how Hugging Face deduplicates a few terabytes of FineWeb, and how Cloudflare detects DDoS attacks in real time. Their footprint in LLM data preparation alone has grown from a research curiosity to a load-bearing part of frontier-model pipelines.