DCLM (DataComp for Language Models)
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,347 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,347 words
Add missing citations, update stale details, or suggest a clearer explanation.
DCLM, short for DataComp for Language Models (also styled DataComp-LM), is an open benchmark, software framework, and family of pretraining datasets for large language models. It was introduced in June 2024 by a consortium of researchers from Apple, the University of Washington, the Toyota Research Institute, Stanford University, and roughly 21 other institutions, organized through the ML Foundations lab. The project was first described in the paper "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models," posted to arXiv as preprint 2406.11794 on 17 June 2024, with Jeffrey Li and Alex Fang as lead authors among 59 co-authors.
The central artifact of the project is DCLM-Pool, a standardized, unfiltered web text corpus of approximately 240 trillion GPT-NeoX tokens extracted from every public Common Crawl snapshot collected before 2023. DCLM-Pool is paired with a fixed set of pretraining recipes built on the open_lm framework and a suite of 53 downstream evaluations, so that different filtering and curation strategies can be compared on equal footing across compute scales from 412 million up to 7 billion parameters. The headline result of the original paper was DCLM-Baseline, a filtered subset of roughly 3.8 to 4 trillion tokens that, when used to pretrain a 7B model for 2.6 trillion tokens, reached 64% five-shot accuracy on MMLU, beating MAP-Neo by 6.6 points while using 40% less compute, and approaching the open-weights performance of Mistral 7B v0.3 and Llama 3 8B with a fully open data pipeline.
DCLM sits in a small but growing family of large open web corpora released alongside training code and evaluations, next to FineWeb, Dolma, RedPajama, Nemotron-CC, the Common Pile, and the Common Corpus. It is the largest of the group in terms of unfiltered token pool, and it is unusual in shipping a benchmark on top of the data rather than just the data itself.
By mid-2024 it was widely accepted that the quality of pretraining data, not just its quantity, was a primary driver of language model performance. The data that goes into models such as GPT-4, Llama 3, and Claude is mostly proprietary, and the public open data ecosystem had grown into a patchwork of separately curated corpora using different filtering pipelines, different evaluation protocols, and different scales. C4, The Pile, RefinedWeb, RedPajama, Dolma, and FineWeb all reported strong numbers, but it was hard to know which design decisions actually mattered, because no two of them held the rest of the pipeline constant.
The ML Foundations group had already attacked a similar problem on the multimodal side. In 2023 the same lab released DataComp, a CLIP-focused benchmark in which competitors curated subsets of a fixed pool of 12.8 billion image and text pairs from Common Crawl, then trained CLIP models under a fixed compute budget and scored them on standardized zero-shot evaluations. That paper, led by Samir Yitzhak Gadre and Gabriel Ilharco and published at NeurIPS 2023, established a useful pattern: hold the model, the training recipe, and the evaluation fixed, and let the data be the variable.
DCLM is the deliberate language-model analog of that effort. Where the original DataComp tested image-text curation, DCLM tests text-only filtering, deduplication, and mixing for autoregressive language models. The motivation is to turn data work from a private craft into a reproducible science with a public leaderboard.
DCLM-Pool is the raw, unfiltered side of the benchmark. The authors reprocessed every public Common Crawl snapshot taken before 2023, extracted the main text from each web page with the Resiliparse HTML parser, applied light language identification to keep English content, and tokenized the result with the GPT-NeoX tokenizer. The output is a corpus of roughly 240 trillion tokens spanning approximately 200 billion documents. This is by some margin the largest published English web corpus drawn from Common Crawl.
DCLM-Pool is intentionally minimally filtered. The point is not to give downstream model trainers a clean dataset; it is to give researchers a common starting point from which any filtering pipeline can be reproduced and compared. The pool is hosted on the Hugging Face Hub under the CC-BY-4.0 license.
To make the benchmark tractable on smaller hardware budgets, the DCLM team defined five competition scales, each with its own pool of pre-sharded data and a fixed training token budget. Filtering recipes are applied within a scale, and Pareto frontiers are plotted across scales to see whether a recipe that wins at 400M parameters also wins at 7B.
| Scale | Parameters | Training tokens | Pool size | Approx. H100 hours |
|---|---|---|---|---|
| 400M-1x | 412M | 8.2B | 469B | 26 |
| 1B-1x | 1.4B | 28.8B | 1.64T | 240 |
| 1B-5x | 1.4B | 144B | 8.20T | 1,200 |
| 7B-1x | 6.9B | 138B | 7.85T | 3,700 |
| 7B-2x | 6.9B | 276B | 15.7T | 7,300 |
The 400M-1x scale is small enough to run on a single multi-GPU node in under a day, which is the main reason DCLM has been adopted by groups outside well-funded industrial labs. The 7B-2x scale is roughly the size of a single seed run for a frontier-adjacent open model.
The DCLM repository on GitHub at mlfoundations/dclm ships three things alongside the data: a reference filtering and deduplication pipeline, a standardized training stack, and a benchmark harness.
Filtering and deduplication pipeline. The reference pipeline reads WARC files, runs Resiliparse for HTML to text extraction, performs language identification with a fastText classifier, applies heuristic quality filters derived from the RefinedWeb recipe, deduplicates with a Bloom filter at the document level, and finally runs a model-based quality classifier that scores each document. The exact thresholds are configurable, and the pipeline is designed so that swapping in a new filter or a new classifier is a small code change rather than a rewrite.
Training stack. Models are trained with open_lm, an open-source framework from the same lab originally written for decoder-only language model experiments. The DCLM authors fixed model architectures, learning rate schedules, batch sizes, and tokenizer per scale, so that a submission to the benchmark differs from a baseline only in the data fed to the trainer. open_lm supports gradient accumulation, mixed precision, FSDP, and torchrun distributed training.
Evaluation suite. DCLM defines 53 downstream evaluations grouped into a CORE set of 22 low-variance tasks and an EXTENDED set covering all 53. The CORE set is used as the primary metric for ranking submissions because it produces small standard errors at small training budgets. MMLU 5-shot accuracy is reported as a separate headline number because of its visibility in the broader LLM community.
| Category | Examples | Notes |
|---|---|---|
| Commonsense reasoning | HellaSwag, COPA, Winograd, PIQA | Mostly zero-shot |
| Reading comprehension | BoolQ, SQuAD, OpenBookQA | Closed-book where relevant |
| World knowledge | ARC-Easy, ARC-Challenge, MMLU | MMLU reported 5-shot |
| Math and symbolic | GSM8K, MATH | Reported but not in CORE |
| Code | HumanEval, MBPP | Reported separately |
| Language understanding | LAMBADA, WSC, RTE | Standard NLU subset |
The CORE score is a normalized average across the 22 stable tasks. The EXTENDED score adds the remaining 31, including harder math and code tasks where small models score near floor and the variance is high.
DCLM-Baseline is the best published filtering recipe produced by the DCLM team themselves, intended both as a strong starting point for the community and as a sanity check on the benchmark. It is the artifact that everyone outside the project actually uses.
The pipeline takes DCLM-Pool, applies RefinedWeb-style heuristic filters, deduplicates with a Bloom filter, and then runs a fastText binary classifier trained to distinguish high-quality content from generic web text. The positive class for the classifier is built from a mixture of OpenHermes 2.5 instruction-style data and threads from the r/ExplainLikeImFive subreddit, both of which the authors argue are clear, well-written, and information-dense. The negative class is sampled from heuristically cleaned web text. The classifier uses bigram features and is small enough to score the entire pool in a few days on commodity CPUs.
Documents are sorted by the classifier's probability score and only the top decile is retained, with the exact threshold set at 0.018112 in the released code. The result is DCLM-Baseline 1.0, a corpus of approximately 3.8 trillion tokens across about 3 billion documents. The authors report that the fastText filter alone confers a roughly 3.5 point gain on the CORE metric over heuristic filtering, which makes it the single largest contribution of the recipe.
For the public 7B model release, DCLM-Baseline was combined with code data from StarCoder and mathematical data from ProofPile2 to produce a final 4.1 trillion token mixture. This combined corpus, not the bare baseline, is what was used to train the published DCLM-7B model.
DCLM-Baseline 1.0 is released on Hugging Face at mlfoundations/dclm-baseline-1.0 under CC-BY-4.0.
Alongside the framework and the dataset, the team released DCLM-7B, a decoder-only Transformer language model trained from scratch on the DCLM-Baseline mixture. The model is hosted under the apple namespace on Hugging Face at apple/DCLM-7B, with a long-context variant at apple/DCLM-7B-8k.
| Property | Value |
|---|---|
| Parameters | 6.9 billion |
| Layers | 32 |
| Hidden size | 4,096 |
| Attention heads | 32 |
| Context length | 2,048 tokens (8,192 in the 8k variant) |
| Tokenizer | GPT-NeoX |
| Pretraining tokens | 2.5 trillion (2.6T core, then continued training) |
| Training data | DCLM-Baseline + StarCoder + ProofPile2 (4.1T mixture) |
| Training hardware | H100 GPUs |
| Batch size | 2,048 sequences |
| Framework | open_lm |
| License | Apple Sample Code License (apple-ascl) |
The 8k variant was produced with a continued pretraining stage at extended context length using a small amount of long-context data, not by training from scratch at 8k tokens.
All scores below are reported in the DCLM paper and on the Hugging Face model card, on the standard 5-shot MMLU and zero-shot CORE benchmarks. Numbers for non-DCLM models are as reported by their respective authors.
| Model | Open data | Tokens trained | MMLU (5-shot) | CORE |
|---|---|---|---|---|
| DCLM-7B (2.6T) | Yes | 2.6T | 63.7 | 57.1 |
| DCLM-7B (2.5T) | Yes | 2.5T | 57.7 (0-shot) / 63.7 (few-shot) | 56.1 |
| MAP-Neo 7B | Yes | 4.5T | 57.1 | 50.2 |
| OLMo 1.7 7B | Yes | 2.05T | 54.0 | 47.0 |
| Falcon 7B | Partial | 1.5T | 27.8 | 44.1 |
| Mistral 7B v0.3 | No | not disclosed | ~63 | similar to DCLM |
| Llama 3 8B | No | 15T | ~66 | higher, with 6.6x more compute |
The headline comparison in the paper is against MAP-Neo, the previous strongest fully open-data 7B model at the time. DCLM-7B beats it by 6.6 points on MMLU while using roughly 40% less training compute. The comparison against Llama 3 8B is more nuanced: Llama 3 was trained on 15T proprietary tokens with substantially more compute, and DCLM-7B matches it on the CORE evaluation while using about 6.6 times less compute, but Llama 3 still leads on MMLU and on instruction-following tasks not measured by CORE.
DCLM is the second project in the DataComp series. The first, simply called DataComp, was introduced at NeurIPS 2023 by Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, and 31 co-authors in the paper "DataComp: In Search of the Next Generation of Multimodal Datasets." That benchmark used a fixed candidate pool of 12.8 billion image-text pairs scraped from Common Crawl, fixed CLIP training recipes, and a fixed set of 38 zero-shot evaluations. Its headline result was DataComp-1B, a 1.4 billion image-text subset that, when used to train a CLIP ViT-L/14 from scratch, reached 79.2% zero-shot accuracy on ImageNet, beating OpenAI's original CLIP ViT-L/14 by 3.7 points at the same compute.
DCLM ports the same idea to autoregressive text. Many of the design choices are direct analogs: a large unfiltered Common Crawl pool, a fixed training framework, a fixed evaluation suite, multiple compute scales for fair comparison, and a baseline filtering recipe that the authors release as a strong starting point. Several authors appear on both papers, most notably Alex Fang and lead organizers Ludwig Schmidt and Vaishaal Shankar.
The two efforts share infrastructure (open_clip and open_lm), Common Crawl-based pools, and the broader thesis that progress on foundation models is often progress on data, and that a public benchmark with a controlled training stack is the right way to study it.
DCLM-Baseline competes most directly with the FineWeb, Nemotron-CC, RedPajama, Dolma, and Common Pile corpora. The table below compares the published characteristics of each at the time of their initial release.
| Dataset | Released | Tokens (filtered) | Source | Filtering approach | License |
|---|---|---|---|---|---|
| DCLM-Baseline 1.0 | Jun 2024 | ~3.8T | Common Crawl pre-2023 | Heuristic + Bloom dedup + fastText OpenHermes/ELI5 classifier | CC-BY-4.0 |
| DCLM-Pool | Jun 2024 | ~240T (unfiltered) | All Common Crawl pre-2023 | Resiliparse extract + light language ID only | CC-BY-4.0 |
| FineWeb | Apr 2024 | ~15T | 96 Common Crawl snapshots, 2013 to Apr 2024 | Heuristic filters + MinHash dedup, no model classifier | ODC-By 1.0 |
| FineWeb-Edu | Jun 2024 | 1.3T | Subset of FineWeb | Educational-content classifier trained on Llama 3 70B labels | ODC-By 1.0 |
| Dolma 1.7 | Apr 2024 | ~2.3T (3T raw) | Common Crawl + books + code + papers + Reddit | Heuristic + dedup + Gopher-style quality | ImpACT (medium risk) |
| RedPajama v2 | Oct 2023 | ~20T raw, ~30T with quality signals | 84 Common Crawl snapshots | Quality signals provided, filtering left to user | Apache 2.0 |
| Nemotron-CC | Aug 2024 | ~6.3T | Common Crawl | Heuristic + classifier + LLM rewriting | Permissive |
| Common Pile v0.1 | Jun 2025 | 8T | 30 curated permissive sources | Copyright-aware sourcing only | Mixed permissive |
A few observations are worth pulling out. DCLM-Pool is approximately 16 times larger than FineWeb in raw tokens, simply because it does no aggressive heuristic filtering and includes every snapshot back to the beginning of Common Crawl. DCLM-Baseline is smaller than FineWeb in filtered tokens, but it scores higher on MMLU when used to pretrain models of equal size for equal token counts, mainly because of the fastText model-based filter. FineWeb-Edu later closed much of that gap by adding its own quality classifier trained on Llama 3 70B labels. The Common Pile is a different beast, focused on copyright-clean data rather than scale, and is best compared with DCLM only on permissively-licensed downstream use.
In the months following DCLM's release, the FineWeb team published FineWeb-Edu, Nvidia released Nemotron-CC, and Hugging Face and AI2 each shipped updated mixes that explicitly cited DCLM-Baseline as a strong baseline to beat. Most subsequent papers on open pretraining data have been written against the DCLM benchmark scales.
DCLM was the subject of significant attention from the machine learning press during the summer of 2024. The Apple Machine Learning Research page hosting the paper and the open-source release framed DCLM as Apple's most substantive contribution to fully open language model research to date, since Apple Intelligence and the company's internal foundation models remain closed. Outlets including MarkTechPost, AIBase, and Tom's Guide covered the 7B model release in July 2024, with most stories focusing on the gap that DCLM-7B closes against proprietary 7-to-8B models while using a fully transparent dataset.
Within the research community, DCLM was accepted as a workshop paper at the Efficient Natural Language and Speech Processing (ENLSP) Workshop at NeurIPS 2024 and was widely cited in 2024 and 2025 data-curation work. The benchmark scales have become a de facto standard for reporting open pretraining-data experiments at the 400M and 1B levels, where compute is cheap enough that academic groups can participate.
DCLM has also been a useful reference point for arguments about open versus proprietary data in AI policy discussions. It is one of the few openly released artifacts that comes close to the performance of Mistral and Llama at the 7B scale while shipping the full data pipeline, which has made it a touchstone for groups arguing that competitive open models are feasible without proprietary scraping.
The DCLM authors are explicit about several limitations of the benchmark, and a number of others have been raised in follow-up work.
The reference data class is small and idiosyncratic. The fastText classifier that drives DCLM-Baseline was trained on a positive class of OpenHermes 2.5 and r/ExplainLikeImFive samples. Both are useful sources, but they are heavily slanted toward conversational and explanatory English. Some follow-up work, including FineWeb-Edu and Nemotron-CC, replaces this with classifiers tuned to educational or instruction-following content.
Domain coverage is uneven. DCLM-Baseline carries the same biases as Common Crawl: heavy weight on Wikipedia mirrors, forums, blogs, and news; light weight on books, academic papers, and code. The published 7B model addresses this by mixing in StarCoder and ProofPile2 at train time, but the bare dataset is web-only.
Multilingual content is limited. The pipeline filters aggressively for English. Non-English documents are dropped, which makes DCLM-Pool effectively a 240T-token English corpus rather than a true multilingual web pool.
Licensing is permissive but not fully clean. Like all Common Crawl-derived corpora, DCLM-Pool and DCLM-Baseline contain copyrighted material whose redistribution under CC-BY-4.0 has not been blessed by individual rights holders. Projects such as the Common Pile and the Common Corpus have positioned themselves as copyright-clean alternatives at the cost of significantly smaller scale.
Benchmark coverage is biased toward static knowledge tasks. The CORE 22 leans on commonsense reasoning, multiple-choice knowledge, and reading comprehension. It does not capture instruction-following, conversational ability, agentic tool use, or long-context reasoning, all of which matter increasingly in 2025 and 2026 model evaluations. Several research groups have proposed extensions or replacements; none have displaced DCLM as the default at the time of writing.
Baseline calculations were revised. In September 2025 the DCLM team announced a set of bug fixes in their baseline scoring code that shifted some numbers slightly. The repository notes these in its changelog and recommends comparing against the post-September-2025 baselines for new work.