Sketching

See also: Machine learning terms

Sketching is a family of techniques in computer science, machine learning, and database systems that use small probabilistic data structures, called sketches, to approximate properties of very large datasets in sub-linear memory. Instead of scanning the full input every time a query is asked, a sketch maintains a compact summary that can be updated incrementally and queried in constant or near-constant time. The price of this compactness is a controlled loss of precision: answers come with a probabilistic accuracy guarantee, usually expressed as an (ε, δ) bound, meaning the error is at most ε with probability at least 1 − δ.

Sketches are foundational tools in streaming data analytics, big data systems, network monitoring, search, and modern LLM training pipelines, where exact computation over petabyte-scale corpora is infeasible. They are a central example of a randomized algorithm traded against memory.

origins and motivation

The idea of summarizing a large stream with a small randomized data structure goes back to two foundational lines of work. In 1970, Burton H. Bloom published "Space/Time Trade-offs in Hash Coding with Allowable Errors" in Communications of the ACM, introducing the Bloom filter for approximate set-membership testing. In 1985, Philippe Flajolet and G. Nigel Martin published "Probabilistic Counting Algorithms for Data Base Applications," which used the position of the leftmost one-bit in hashing outputs to estimate cardinality with logarithmic memory.

The field crystallized in 1996 when Noga Alon, Yossi Matias, and Mario Szegedy published "The Space Complexity of Approximating the Frequency Moments," introducing the AMS sketch and effectively founding the modern theory of streaming algorithms. The three authors received the Gödel Prize in 2005 for this work. Most popular sketches in production today, including HyperLogLog, Count-Min, MinHash, and quantile summaries, descend from this lineage.

the core trade-off

A sketch buys three things at once: small memory, fast updates, and fast queries. It pays for them with approximate answers and a non-zero probability of large error. Formally, most sketches give an (ε, δ) guarantee: the answer is within an error of ε (either additive or multiplicative, depending on the sketch) with probability at least 1 − δ. Memory typically scales as O((1/ε²) log(1/δ)), so halving the error quadruples the space, while shrinking the failure probability by a factor of two only adds one extra hash row.

Many sketches are also mergeable: two sketches built independently over disjoint partitions can be combined into a sketch of the union. This property is what makes sketches a natural fit for map-reduce, Spark, and Flink pipelines. A worker per partition builds a local sketch, and a single reducer merges them.

major sketch families

Sketch	Year	Authors	Problem solved	Error type
Bloom filter	1970	Burton H. Bloom	Set membership	False positives, no false negatives
Flajolet-Martin (FM)	1985	Flajolet, Martin	Cardinality (count distinct)	Multiplicative
AMS sketch	1996	Alon, Matias, Szegedy	Frequency moments (F₂)	Multiplicative
MinHash	1997	Andrei Broder	Jaccard similarity	Additive
Greenwald-Khanna (GK)	2001	Greenwald, Khanna	Quantiles	Additive rank error
SimHash	2002	Moses Charikar	Cosine similarity	Additive
Count Sketch	2002	Charikar, Chen, Farach-Colton	Frequency estimation	Both signs
Count-Min sketch	2005	Cormode, Muthukrishnan	Frequency estimation	One-sided overestimate
SpaceSaving	2005	Metwally, Agrawal, El Abbadi	Top-K / heavy hitters	Counter-based
HyperLogLog	2007	Flajolet, Fusy, Gandouet, Meunier	Cardinality	Multiplicative, ~1.04/√m
t-digest	~2013	Ted Dunning	Quantiles	Adaptive
Cuckoo filter	2014	Fan, Andersen, Kaminsky, Mitzenmacher	Set membership with deletion	False positives
KLL sketch	2016	Karnin, Lang, Liberty	Quantiles	Asymptotically optimal

bloom filter and its descendants

The Bloom filter maintains a bit array of size m and k hash functions. Inserting an element sets k bits; a lookup checks whether all k bits are set. False negatives are impossible because no bit ever flips back. False positives occur when unrelated insertions have collectively set every bit a query happens to probe. The false positive rate falls roughly as (1 − e^{-kn/m})^k for n inserted items.

Classic Bloom filters do not support deletion, since clearing bits would silently introduce false negatives. The counting Bloom filter replaces each bit with a small counter, allowing deletes at the cost of more memory. The cuckoo filter, introduced by Bin Fan, Dave Andersen, Michael Kaminsky, and Michael Mitzenmacher at ACM CoNEXT 2014, stores small fingerprints in a cuckoo hash table. It supports deletions, has lower space overhead than space-optimized Bloom filters at the same false positive rate, and is faster on lookups.

cardinality sketches

The count-distinct problem asks how many unique items appeared in a stream. Flajolet and Martin's 1985 algorithm hashed each element and tracked the maximum number of leading zeros, an idea that exploits the fact that streams of n uniform random hashes are likely to produce one with about log₂(n) leading zeros. LogLog refined this in 2003, and HyperLogLog (HLL), published by Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier at AofA 2007, uses the harmonic mean of register estimates to achieve a standard error of about 1.04/√m, where m is the number of registers. The original paper showed that HLL can estimate cardinalities well beyond 10⁹ with about 2% accuracy using only 1.5 KB of memory.

Google's HyperLogLog++ variant powers APPROX_COUNT_DISTINCT in BigQuery and Google Analytics. In one published benchmark, BigQuery counted 3 billion distinct values in 28 seconds with exact COUNT(DISTINCT) and in 5.7 seconds with HLL++, with the HLL result only 0.2% off the true value. HLL sketches are mergeable, which is why they show up in Druid, Presto, ClickHouse, Redshift, Snowflake, and almost every modern OLAP engine.

frequency sketches

The Count-Min sketch, published by Graham Cormode and S. Muthukrishnan in the Journal of Algorithms in 2005, estimates how many times a given element has appeared. It maintains a d by w matrix of counters; each insertion increments one counter per row using d independent hash functions, and a query takes the minimum of the d counters indexed by the queried key. The estimator never underestimates, only overestimates, and the space-accuracy guarantee is the canonical (ε, δ) bound: with width w = ⌈e/ε⌉ and depth d = ⌈ln(1/δ)⌉, the estimate exceeds the true count by at most ε · N (where N is the stream length) with probability 1 − δ.

Count Sketch, by Moses Charikar, Kevin Chen, and Martin Farach-Colton in 2002, predates Count-Min and uses signed counters; it produces unbiased estimates with two-sided error and is more accurate for skewed distributions. The AMS sketch of Alon, Matias, and Szegedy estimates the second frequency moment F₂ = Σ fᵢ², which is essentially the squared L2 norm of the frequency vector and a building block for many later sketches.

top-k and heavy hitters

The heavy-hitters problem asks for the most frequent items, not the count of every item. Misra-Gries (1982) is a simple deterministic counter-based summary that finds all items appearing more than n/k times. The SpaceSaving algorithm of Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi (2005) is its more accurate descendant; empirical studies consistently show SpaceSaving outperforms Misra-Gries and linear sketches in utility. RedisBloom exposes a TOPK data type built on SpaceSaving.

similarity sketches

Andrei Broder's 1997 paper "On the Resemblance and Containment of Documents," published at the Compression and Complexity of Sequences conference, introduced MinHash for estimating the Jaccard similarity between two sets. Each set is reduced to a small signature consisting of the minima of k random hash functions; the fraction of signature positions on which two sets agree is an unbiased estimator of their Jaccard similarity. Broder built a clustering of 30 million web pages and 150 GB of input from the AltaVista search engine using this technique, producing 3.6 million clusters.

Moses Charikar's STOC 2002 paper "Similarity Estimation Techniques from Rounding Algorithms" showed that rounding algorithms for LP and SDP relaxations correspond to locality-sensitive hashing (LSH) schemes, and introduced what is now called SimHash for estimating cosine similarity. SimHash powered Google's near-duplicate web detection at crawl scale.

quantile sketches

Quantile sketches answer rank queries: what is the median, the 95th percentile, the 99.9th percentile? Michael Greenwald and Sanjeev Khanna's 2001 SIGMOD paper gave the GK sketch, a deterministic algorithm using O(ε⁻¹ log(εn)) space; this bound was shown optimal in 2020. t-digest, designed by Ted Dunning and widely used in Elasticsearch, Spark, and Druid, achieves much better practical accuracy than KLL on real-world data but lacks worst-case guarantees. The KLL sketch, by Zohar Karnin, Kevin Lang, and Edo Liberty at FOCS 2016, resolved the long-standing question of optimal quantile approximation in the streaming model and is the default quantile sketch in Apache DataSketches.

applications in machine learning and data systems

Application	Sketch used	Notes
LLM training data deduplication	MinHash + LSH	Used in The Pile, RefinedWeb, FineWeb, RedPajama, GPT-3, Gopher, Llama 3, OLMo
Web crawl near-duplicate detection	SimHash	Originated at Google
Database COUNT(DISTINCT)	HyperLogLog / HLL++	BigQuery, Redshift, Snowflake, ClickHouse, Druid
A/B testing unique users	HyperLogLog	Mergeable across experiment buckets
Recommendation candidate generation	MinHash, SimHash	Sub-linear nearest neighbor
Online learning feature counts	Count-Min sketch, feature hashing	Vowpal Wabbit, online click prediction
Genome and metagenome distance	MinHash	Mash (Ondov 2016) clustered all 54,118 NCBI RefSeq genomes in 33 CPU hours
Network heavy-hitter detection	Count-Min sketch, SpaceSaving	DDoS detection, traffic engineering
Streaming analytics quantiles	t-digest, KLL, GK	Latency P99 monitoring at scale
Database join cardinality estimates	HyperLogLog, AGMS	Query optimizer cost models

LLM training data deduplication

Large language model training corpora are dominated by Common Crawl scrapes that contain massive numbers of near-duplicates: boilerplate, scraped versions of the same article, mirror pages, near-identical templates. Deduplication is one of the largest single levers on training quality. Removing duplicates makes models emit memorized text about ten times less frequently and reduces the steps needed to reach a given loss.

MinHash combined with LSH is the standard tool for this job at petabyte scale because it is both mergeable and embarrassingly parallel. The 2023 RefinedWeb paper for the Falcon family computed 9,000 hashes per document over 5-grams, divided into 20 buckets of 450 hashes each. Hugging Face's FineWeb pipeline uses 112 hash functions split into 14 buckets of 8, targeting documents at least 75% similar. Similar pipelines back The Pile, RedPajama, StarCoder, OLMo, and Llama 3 training data preparation.

feature hashing and online learning

The hashing trick hashes feature names directly into a fixed-size weight vector, side-stepping the need for a vocabulary. It is closely related to Count-Min: both use hash functions to keep memory bounded as the universe of distinct keys grows. Vowpal Wabbit was an early production system to push this to its limits, defaulting to a 2¹⁸ = 262,144-entry weight vector and learning to tolerate the resulting collisions through gradient updates.

genomics

Brian Ondov and colleagues introduced Mash in Genome Biology (2016), applying MinHash to k-mers from biological sequences. Mash enabled clustering of all 54,118 NCBI RefSeq genomes in 33 CPU hours and real-time database search against assembled or unassembled Illumina, PacBio, and Oxford Nanopore reads. It triggered a wave of genomic sketching tools including BinDash, sourmash, and skani.

implementations

Library	Language	Sketches included
Apache DataSketches	Java, C++, Python	HLL, Theta, KLL, REQ, T-Digest, Tuple, Quantiles
datasketch	Python	MinHash, LSH, LSH Forest, LSH Ensemble, Weighted MinHash, HyperLogLog, HyperLogLog++, HNSW
RedisBloom	Redis module	Bloom, Cuckoo, Count-Min, Top-K, T-Digest
postgresql-hll	PostgreSQL extension	HyperLogLog (Aggregate Knowledge port)
tdigest	PostgreSQL extension	T-Digest
ClickHouse `uniqHLL12`, `quantilesTDigest`	Built-in	HLL, T-Digest
Spark MLlib `BloomFilter`, `CountMinSketch`	Built-in	Bloom, Count-Min
BigQuery `APPROX_COUNT_DISTINCT`, `APPROX_QUANTILES`	Built-in	HLL++, KLL

Apache DataSketches began as an internal Yahoo project in 2012, was open-sourced in 2015, and became a top-level Apache project. The core team included Lee Rhodes, Jon Malkin, and Alex Saydakov at Yahoo, with academic contributors including Justin Thaler (Georgetown) and Edo Liberty (then at Yahoo, later AWS). Every algorithm in the library produces mergeable summaries with formal accuracy guarantees, which is the property that lets it slot into BigQuery, Druid, and Spark without losing correctness when results from many partitions are combined.

relationship to other approximate techniques

Sketching overlaps with several adjacent ideas without being identical to them. Dimensionality reduction techniques like random projection (justified by the Johnson-Lindenstrauss lemma) reduce the dimension of vectors while approximately preserving pairwise distances; the resulting reduced vectors can be viewed as sketches for L2 geometry. Approximate nearest neighbor structures like HNSW and IVF used in vector databases are sometimes grouped with sketches in practice, though they are graph or partition based rather than hash based. LSH families like SimHash sit at the intersection: they produce sketches and organize the search space.

Sketches are also a natural fit for differential privacy. Adding small noise to sketch counters yields privatized aggregations whose error is dominated by the sketch's existing approximation noise, giving privacy almost for free. They are increasingly used in federated learning aggregation for the same reason: each client sends a small sketch of its local data, and the server merges them without ever seeing raw records.

limitations and design considerations

Sketches are powerful but not magic. A few practical caveats:

They give approximate answers. For accounting, billing, or any application where exact counts matter, the error bound has to be acceptable.
The choice of hash function matters. Most analyses assume idealized hash functions (often k-wise independent). In practice MurmurHash3 and xxHash are common; cryptographic hashes are usually overkill.
Tuning ε and δ requires understanding the workload. Memory grows quadratically as ε shrinks for many sketches, so unrealistic accuracy targets blow up memory.
Skewed distributions affect different sketches differently. Count-Min is biased upward and worst on heavy-tailed data; Count Sketch is unbiased and often better for the same setting.
Sketches built with different parameter choices are not mergeable. Coordinating sketch parameters across teams and pipelines is a real operational concern at scale.

Despite these caveats, sketches are a default tool of modern data infrastructure. They are how Google answers COUNT DISTINCT over Search query logs, how Hugging Face deduplicates a few terabytes of FineWeb, and how Cloudflare detects DDoS attacks in real time. Their footprint in LLM data preparation alone has grown from a research curiosity to a load-bearing part of frontier-model pipelines.

references

Bloom, B. H. (1970). "Space/time trade-offs in hash coding with allowable errors". Communications of the ACM, 13(7), 422–426.
Flajolet, P., & Martin, G. N. (1985). "Probabilistic counting algorithms for data base applications". Journal of Computer and System Sciences, 31(2), 182–209.
Misra, J., & Gries, D. (1982). "Finding repeated elements". Science of Computer Programming, 2(2), 143–152.
Alon, N., Matias, Y., & Szegedy, M. (1996). "The space complexity of approximating the frequency moments". STOC '96.
Broder, A. Z. (1997). "On the resemblance and containment of documents". Compression and Complexity of Sequences, 21–29.
Greenwald, M., & Khanna, S. (2001). "Space-efficient online computation of quantile summaries". SIGMOD '01.
Charikar, M. S. (2002). "Similarity estimation techniques from rounding algorithms". STOC '02.
Charikar, M., Chen, K., & Farach-Colton, M. (2002). "Finding frequent items in data streams". ICALP '02.
Cormode, G., & Muthukrishnan, S. (2005). "An improved data stream summary: the count-min sketch and its applications". Journal of Algorithms, 55(1), 58–75.
Metwally, A., Agrawal, D., & El Abbadi, A. (2005). "Efficient computation of frequent and top-k elements in data streams". ICDT '05.
Flajolet, P., Fusy, É., Gandouet, O., & Meunier, F. (2007). "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm". AofA '07, 127–146.
Heule, S., Nunkesser, M., & Hall, A. (2013). "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm". EDBT '13.
Fan, B., Andersen, D. G., Kaminsky, M., & Mitzenmacher, M. D. (2014). "Cuckoo filter: practically better than Bloom". ACM CoNEXT 2014.
Karnin, Z., Lang, K., & Liberty, E. (2016). "Optimal quantile approximation in streams". FOCS 2016.
Ondov, B. D., Treangen, T. J., Melsted, P., et al. (2016). "Mash: fast genome and metagenome distance estimation using MinHash". Genome Biology, 17(1), 132.
Penedo, G., Malartic, Q., Hesslow, D., et al. (2023). "The RefinedWeb dataset for Falcon LLM". arXiv:2306.01116.
Penedo, G., Kydlíček, H., Lozhkov, A., et al. (2024). "The FineWeb datasets: decanting the web for the finest text data at scale". arXiv:2406.17557.
Apache DataSketches project documentation. https://datasketches.apache.org/
Cormode, G., & Veselý, P. (2020). "Tight lower bound for comparison-based quantile summaries". PODS '20.
Gao, J., et al. (2020). "Hashing trick and feature hashing". Vowpal Wabbit documentation.

Sketching

origins and motivation

the core trade-off

major sketch families

bloom filter and its descendants

cardinality sketches

frequency sketches

top-k and heavy hitters

similarity sketches

quantile sketches

applications in machine learning and data systems

LLM training data deduplication

feature hashing and online learning

genomics

implementations

relationship to other approximate techniques

limitations and design considerations

see also

references

Improve this article

origins and motivation

the core trade-off

major sketch families

bloom filter and its descendants

cardinality sketches

frequency sketches

top-k and heavy hitters

similarity sketches

quantile sketches

applications in machine learning and data systems

LLM training data deduplication

feature hashing and online learning

genomics

implementations

relationship to other approximate techniques

limitations and design considerations

see also

references