DCLM (DataComp for Language Models)

AI Benchmarks Data & Datasets Natural Language Processing Open Source AI

18 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 3,674 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DCLM, short for DataComp for Language Models (also styled DataComp-LM), is an open benchmark, dataset, and software framework, released in June 2024, for studying how data curation affects large language model quality. Its headline result is that model-based quality filtering of web text produces DCLM-Baseline, a dataset that trains a 7-billion-parameter model from scratch to 64% five-shot accuracy on MMLU using 2.6 trillion tokens, a 6.6 percentage-point gain over the prior best open-data model (MAP-Neo) while using 40% less compute.^[1] DCLM was built by a consortium of researchers from Apple, the University of Washington, the Toyota Research Institute, Stanford University, and roughly 21 other institutions, organized through the ML Foundations lab. The project was first described in the paper "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models," posted to arXiv as preprint 2406.11794 on 17 June 2024, with Jeffrey Li and Alex Fang as lead authors among 59 co-authors, and was accepted as a poster in the Datasets and Benchmarks Track of NeurIPS 2024.^[1]^[12]

The central artifact of the project is DCLM-Pool, a standardized, unfiltered web text corpus of approximately 240 trillion GPT-NeoX tokens extracted from every public Common Crawl snapshot collected before 2023, described by the authors as "the largest public corpus for language model training."^[1] DCLM-Pool is paired with a fixed set of pretraining recipes built on the OpenLM framework and a suite of 53 downstream evaluations, so that different filtering and curation strategies can be compared on equal footing across compute scales from 412 million up to 7 billion parameters.^[1] The headline dataset, DCLM-Baseline, is a filtered subset of roughly 3.8 to 4 trillion tokens that, when used to pretrain a 7B model for 2.6 trillion tokens, reached 64% five-shot accuracy on MMLU, beating MAP-Neo by 6.6 points while using 40% less compute, and approaching the open-weights performance of Mistral 7B v0.3 and Llama 3 8B (63% and 66% MMLU respectively) with a fully open data pipeline.^[1]

DCLM sits in a small but growing family of large open web corpora released alongside training code and evaluations, next to FineWeb, Dolma, RedPajama, Nemotron-CC, the Common Pile, and the Common Corpus. It is the largest of the group in terms of unfiltered token pool, and it is unusual in shipping a benchmark on top of the data rather than just the data itself.

Why was DCLM created?

By mid-2024 it was widely accepted that the quality of pretraining data, not just its quantity, was a primary driver of language model performance. The training data that goes into models such as GPT-4, Llama 3, and Claude is mostly proprietary, and the public open data ecosystem had grown into a patchwork of separately curated corpora using different filtering pipelines, different evaluation protocols, and different scales. C4, The Pile, RefinedWeb, RedPajama, Dolma, and FineWeb all reported strong numbers, but it was hard to know which design decisions actually mattered, because no two of them held the rest of the pipeline constant.

The ML Foundations group had already attacked a similar problem on the multimodal side. In 2023 the same lab released DataComp, a CLIP-focused benchmark in which competitors curated subsets of a fixed pool of 12.8 billion image and text pairs from Common Crawl, then trained CLIP models under a fixed compute budget and scored them on standardized zero-shot evaluations.^[7] That paper, led by Samir Yitzhak Gadre and Gabriel Ilharco and presented as an Oral at NeurIPS 2023, established a useful pattern: hold the model, the training recipe, and the evaluation fixed, and let the data be the variable.^[7]

DCLM is the deliberate language-model analog of that effort. Where the original DataComp tested image-text curation, DCLM tests text-only filtering, deduplication, and mixing for autoregressive language models. As the authors put it, their results "highlight the importance of dataset design for training language models and offer a starting point for further research on data curation."^[1] The motivation is to turn data work from a private craft into a reproducible science with a public leaderboard.

What is DCLM-Pool?

DCLM-Pool is the raw, unfiltered side of the benchmark. The authors reprocessed every public Common Crawl snapshot taken before 2023, extracted the main text from each web page with the Resiliparse HTML parser, applied light language identification to keep English content, and tokenized the result with the GPT-NeoX tokenizer. The output is a corpus of roughly 240 trillion tokens spanning approximately 200 billion documents. This is by some margin the largest published English web corpus drawn from Common Crawl, and the paper describes it as "the largest public corpus for language model training."^[1]

DCLM-Pool is intentionally minimally filtered. The point is not to give downstream model trainers a clean dataset; it is to give researchers a common starting point from which any filtering pipeline can be reproduced and compared. The pool is hosted on the Hugging Face Hub under the CC-BY-4.0 license.^[4]

To make the benchmark tractable on smaller hardware budgets, the DCLM team defined five competition scales, each with its own pool of pre-sharded data and a fixed training token budget. Filtering recipes are applied within a scale, and Pareto frontiers are plotted across scales to see whether a recipe that wins at 400M parameters also wins at 7B.

DCLM competition scales

Scale	Parameters	Training tokens	Pool size	Approx. H100 hours
400M-1x	412M	8.2B	469B	26
1B-1x	1.4B	28.8B	1.64T	240
1B-5x	1.4B	144B	8.20T	1,200
7B-1x	6.9B	138B	7.85T	3,700
7B-2x	6.9B	276B	15.7T	7,300

The 400M-1x scale is small enough to run on a single multi-GPU node in under a day, which is the main reason DCLM has been adopted by groups outside well-funded industrial labs. The 7B-2x scale is roughly the size of a single seed run for a frontier-adjacent open model.

How does the DCLM framework work?

The DCLM repository on GitHub at mlfoundations/dclm ships three things alongside the data: a reference filtering and deduplication pipeline, a standardized training stack, and a benchmark harness.^[3]

Filtering and deduplication pipeline. The reference pipeline reads WARC files, runs Resiliparse for HTML to text extraction, performs language identification with a fastText classifier, applies heuristic quality filters derived from the RefinedWeb recipe, deduplicates with a Bloom filter at the document level, and finally runs a model-based quality classifier that scores each document. The exact thresholds are configurable, and the pipeline is designed so that swapping in a new filter or a new classifier is a small code change rather than a rewrite.

Training stack. Models are trained with OpenLM, an open-source framework from the same lab originally written for decoder-only language model experiments. The DCLM authors fixed model architectures, learning rate schedules, batch sizes, and tokenizer per scale, so that a submission to the benchmark differs from a baseline only in the data fed to the trainer. OpenLM supports gradient accumulation, mixed precision, FSDP, and torchrun distributed training.

Evaluation suite. DCLM defines 53 downstream evaluations grouped into a CORE set of 22 low-variance tasks and an EXTENDED set covering all 53.^[1] The CORE set is used as the primary metric for ranking submissions because it produces small standard errors at small training budgets. MMLU 5-shot accuracy is reported as a separate headline number because of its visibility in the broader LLM community.

Evaluation suite at a glance

Category	Examples	Notes
Commonsense reasoning	HellaSwag, COPA, Winograd, PIQA	Mostly zero-shot
Reading comprehension	BoolQ, SQuAD, OpenBookQA	Closed-book where relevant
World knowledge	ARC-Easy, ARC-Challenge, MMLU	MMLU reported 5-shot
Math and symbolic	GSM8K, MATH	Reported but not in CORE
Code	HumanEval, MBPP	Reported separately
Language understanding	LAMBADA, WSC, RTE	Standard NLU subset

The CORE score is a normalized average across the 22 stable tasks. The EXTENDED score adds the remaining 31, including harder math and code tasks where small models score near floor and the variance is high.

What is DCLM-Baseline?

DCLM-Baseline is the best published filtering recipe produced by the DCLM team themselves, intended both as a strong starting point for the community and as a sanity check on the benchmark. It is the artifact that everyone outside the project actually uses. The paper's central empirical claim is blunt: "model-based filtering is key to assembling a high-quality training set."^[1]

The pipeline takes DCLM-Pool, applies RefinedWeb-style heuristic filters, deduplicates with a Bloom filter, and then runs a fastText binary classifier trained to distinguish high-quality content from generic web text. The positive class for the classifier is built from a mixture of OpenHermes 2.5 instruction-style data and threads from the r/ExplainLikeImFive subreddit, both of which the authors argue are clear, well-written, and information-dense. The negative class is sampled from heuristically cleaned web text. The classifier uses bigram features and is small enough to score the entire pool in a few days on commodity CPUs.

Documents are sorted by the classifier's probability score and only the top decile is retained, with the exact threshold set at 0.018112 in the released code. The result is DCLM-Baseline 1.0, a corpus of approximately 3.8 trillion tokens across about 3 billion documents.^[4] The authors report that the fastText filter alone confers a roughly 3.5 point gain on the CORE metric over heuristic filtering, which makes it the single largest contribution of the recipe.

For the public 7B model release, DCLM-Baseline was combined with code data from StarCoder and mathematical data from ProofPile2 to produce a final 4.1 trillion token mixture.^[5] This combined corpus, not the bare baseline, is what was used to train the published DCLM-7B model.

DCLM-Baseline 1.0 is released on Hugging Face at mlfoundations/dclm-baseline-1.0 under CC-BY-4.0.^[4]

The DCLM-7B model

Alongside the framework and the dataset, the team released DCLM-7B, a decoder-only Transformer language model trained from scratch on the DCLM-Baseline mixture.^[5] The model is hosted under the apple namespace on Hugging Face at apple/DCLM-7B, with a long-context variant at apple/DCLM-7B-8k.^[5]^[6]

DCLM-7B specifications

Property	Value
Parameters	6.9 billion (7B class)
Layers	32
Hidden size	4,096
Attention heads	32
Context length	2,048 tokens (8,192 in the 8k variant)
Tokenizer	GPT-NeoX
Pretraining tokens	2.5 trillion (2.6T core, then continued training)
Training data	DCLM-Baseline + StarCoder + ProofPile2 (4.1T mixture)
Optimizer	AdamW, peak learning rate 2e-3
Training hardware	H100 GPUs
Batch size	2,048 sequences
Framework	OpenLM (PyTorch)
License	Apple Sample Code License (apple-ascl)

The 8k variant was produced with a continued pretraining stage at extended context length using a small amount of long-context data, not by training from scratch at 8k tokens.^[6]

How does DCLM-7B compare to other open models?

All scores below are reported in the DCLM paper and on the Hugging Face model card, on the standard 5-shot MMLU and zero-shot CORE benchmarks. Numbers for non-DCLM models are as reported by their respective authors.

Model	Open data	Tokens trained	MMLU (5-shot)	CORE
DCLM-7B (2.6T)	Yes	2.6T	63.7	57.1
DCLM-7B (2.5T)	Yes	2.5T	57.7 (0-shot) / 63.7 (few-shot)	56.1
MAP-Neo 7B	Yes	4.5T	57.1	50.2
OLMo 1.7 7B	Yes	2.05T	54.0	47.0
Falcon 7B	Partial	1.5T	27.4	44.1
Mistral 7B v0.3	No	not disclosed	~63	similar to DCLM
Llama 3 8B	No	15T	~66	higher, with 6.6x more compute

The headline comparison in the paper is against MAP-Neo, the previous strongest fully open-data 7B model at the time. According to the authors, "DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute" than MAP-Neo.^[1] The comparison against Llama 3 8B is more nuanced: Llama 3 was trained on 15T proprietary tokens with substantially more compute, and the paper reports that DCLM-7B "performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B," though Llama 3 still leads on MMLU (66% vs 64%) and on instruction-following tasks not measured by CORE.^[1]

How does DCLM relate to the original DataComp project?

DCLM is the second project in the DataComp series. The first, simply called DataComp, was introduced at NeurIPS 2023 by Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, and 31 co-authors in the paper "DataComp: In Search of the Next Generation of Multimodal Datasets."^[7] That benchmark used a fixed candidate pool of 12.8 billion image-text pairs scraped from Common Crawl, fixed CLIP training recipes, and a fixed set of 38 zero-shot evaluations. Its headline result was DataComp-1B, a 1.4 billion image-text subset that, when used to train a CLIP ViT-L/14 from scratch, reached 79.2% zero-shot accuracy on ImageNet, beating OpenAI's original CLIP ViT-L/14 by 3.7 points at the same compute.^[7]

DCLM ports the same idea to autoregressive text. Many of the design choices are direct analogs: a large unfiltered Common Crawl pool, a fixed training framework, a fixed evaluation suite, multiple compute scales for fair comparison, and a baseline filtering recipe that the authors release as a strong starting point. Several authors appear on both papers, most notably Alex Fang and lead organizers Ludwig Schmidt and Vaishaal Shankar.

The two efforts share infrastructure (open_clip and OpenLM), Common Crawl-based pools, and the broader thesis that progress on foundation models is often progress on data, and that a public benchmark with a controlled training stack is the right way to study it.

How does DCLM compare to peer datasets?

DCLM-Baseline competes most directly with the FineWeb, Nemotron-CC, RedPajama, Dolma, and Common Pile corpora. The table below compares the published characteristics of each at the time of their initial release.

Dataset	Released	Tokens (filtered)	Source	Filtering approach	License
DCLM-Baseline 1.0	Jun 2024	~3.8T	Common Crawl pre-2023	Heuristic + Bloom dedup + fastText OpenHermes/ELI5 classifier	CC-BY-4.0
DCLM-Pool	Jun 2024	~240T (unfiltered)	All Common Crawl pre-2023	Resiliparse extract + light language ID only	CC-BY-4.0
FineWeb	Apr 2024	~15T	96 Common Crawl snapshots, 2013 to Apr 2024	Heuristic filters + MinHash dedup, no model classifier	ODC-By 1.0
FineWeb-Edu	Jun 2024	1.3T	Subset of FineWeb	Educational-content classifier trained on Llama 3 70B labels	ODC-By 1.0
Dolma 1.7	Apr 2024	~2.3T (3T raw)	Common Crawl + books + code + papers + Reddit	Heuristic + dedup + Gopher-style quality	ImpACT (medium risk)
RedPajama v2	Oct 2023	~20T raw, ~30T with quality signals	84 Common Crawl snapshots	Quality signals provided, filtering left to user	Apache 2.0
Nemotron-CC	Aug 2024	~6.3T	Common Crawl	Heuristic + classifier + LLM rewriting	Permissive
Common Pile v0.1	Jun 2025	8T	30 curated permissive sources	Copyright-aware sourcing only	Mixed permissive

A few observations are worth pulling out. DCLM-Pool is approximately 16 times larger than FineWeb in raw tokens, simply because it does no aggressive heuristic filtering and includes every snapshot back to the beginning of Common Crawl. DCLM-Baseline is smaller than FineWeb in filtered tokens, but it scores higher on MMLU when used to pretrain models of equal size for equal token counts, mainly because of the fastText model-based filter. FineWeb-Edu later closed much of that gap by adding its own quality classifier trained on Llama 3 70B labels. The Common Pile is a different beast, focused on copyright-clean data rather than scale, and is best compared with DCLM only on permissively-licensed downstream use.

In the months following DCLM's release, the FineWeb team published FineWeb-Edu, Nvidia released Nemotron-CC, and Hugging Face and AI2 each shipped updated mixes that explicitly cited DCLM-Baseline as a strong baseline to beat. Most subsequent papers on open pretraining data have been written against the DCLM benchmark scales.

Reception and impact

DCLM was the subject of significant attention from the machine learning press during the summer of 2024. The Apple Machine Learning Research page hosting the paper and the open-source release framed DCLM as Apple's most substantive contribution to fully open language model research to date, since Apple Intelligence and the company's internal foundation models remain closed.^[2] Outlets including MarkTechPost, AIBase, and Tom's Guide covered the 7B model release in July 2024, with most stories focusing on the gap that DCLM-7B closes against proprietary 7-to-8B models while using a fully transparent dataset.^[10]^[11]

Within the research community, DCLM was accepted as a poster in the Datasets and Benchmarks Track of NeurIPS 2024 and was widely cited in 2024 and 2025 data-curation work.^[12] The benchmark scales have become a de facto standard for reporting open pretraining-data experiments at the 400M and 1B levels, where compute is cheap enough that academic groups can participate.

DCLM has also been a useful reference point for arguments about open versus proprietary data in AI policy discussions. It is one of the few openly released artifacts that comes close to the performance of Mistral and Llama at the 7B scale while shipping the full data pipeline, which has made it a touchstone for groups arguing that competitive open models are feasible without proprietary scraping.

Is DCLM open source?

Yes. DCLM is one of the more completely open releases in the open-data ecosystem: the dataset, the curation and training code, the evaluation harness, and the trained model weights are all public. DCLM-Pool and DCLM-Baseline are released on the Hugging Face Hub under the permissive CC-BY-4.0 license, the framework and reference pipeline are on GitHub at mlfoundations/dclm, and the DCLM-7B weights are published under the Apple Sample Code License at apple/DCLM-7B.^[3]^[4]^[5] This end-to-end transparency, raw pool through to weights, is the main reason DCLM is cited as a reproducibility benchmark rather than just a dataset.

Limitations and criticism

The DCLM authors are explicit about several limitations of the benchmark, and a number of others have been raised in follow-up work.

The reference data class is small and idiosyncratic. The fastText classifier that drives DCLM-Baseline was trained on a positive class of OpenHermes 2.5 and r/ExplainLikeImFive samples. Both are useful sources, but they are heavily slanted toward conversational and explanatory English. Some follow-up work, including FineWeb-Edu and Nemotron-CC, replaces this with classifiers tuned to educational or instruction-following content.

Domain coverage is uneven. DCLM-Baseline carries the same biases as Common Crawl: heavy weight on Wikipedia mirrors, forums, blogs, and news; light weight on books, academic papers, and code. The published 7B model addresses this by mixing in StarCoder and ProofPile2 at train time, but the bare dataset is web-only.

Multilingual content is limited. The pipeline filters aggressively for English. Non-English documents are dropped, which makes DCLM-Pool effectively a 240T-token English corpus rather than a true multilingual web pool.

Licensing is permissive but not fully clean. Like all Common Crawl-derived corpora, DCLM-Pool and DCLM-Baseline contain copyrighted material whose redistribution under CC-BY-4.0 has not been blessed by individual rights holders. Projects such as the Common Pile and the Common Corpus have positioned themselves as copyright-clean alternatives at the cost of significantly smaller scale.

Benchmark coverage is biased toward static knowledge tasks. The CORE 22 leans on commonsense reasoning, multiple-choice knowledge, and reading comprehension. It does not capture instruction-following, conversational ability, agentic tool use, or long-context reasoning, all of which matter increasingly in 2025 and 2026 model evaluations. Several research groups have proposed extensions or replacements; none have displaced DCLM as the default at the time of writing.

Baseline calculations were revised. In September 2025 the DCLM team announced a set of bug fixes in their baseline scoring code that shifted some numbers slightly. The repository notes that "CORE/EXTENDED scores from before Sept 2025 are not directly comparable to new results," because of corrected baseline values for multiple-choice tasks; rank orderings stayed consistent, and the changelog recommends comparing against the post-September-2025 baselines for new work.^[3]

References

Li, J., Fang, A., Smyrnis, G., et al. "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models." arXiv:2406.11794, 17 June 2024. https://arxiv.org/abs/2406.11794 ↩
"DataComp-LM: In Search of the Next Generation of Training Sets for Language Models." Apple Machine Learning Research. https://machinelearning.apple.com/research/datacomp-lm-search ↩
mlfoundations. "DCLM: DataComp for Language Models." GitHub repository. https://github.com/mlfoundations/dclm ↩
mlfoundations. "dclm-baseline-1.0." Hugging Face dataset card. https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0 ↩
Apple. "DCLM-7B." Hugging Face model card. https://huggingface.co/apple/DCLM-7B ↩
Apple. "DCLM-7B-8k." Hugging Face model card. https://huggingface.co/apple/DCLM-7B-8k ↩
Gadre, S. Y., Ilharco, G., Fang, A., et al. "DataComp: In Search of the Next Generation of Multimodal Datasets." arXiv:2304.14108, NeurIPS 2023 (Oral). https://arxiv.org/abs/2304.14108 ↩
Penedo, G., Kydlicek, H., Ben Allal, L., et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." NeurIPS 2024 Datasets and Benchmarks Track.
Soldaini, L., Kinney, R., Bhagia, A., et al. "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." Allen Institute for AI, 2024.
"DataComp for Language Models (DCLM): An AI Benchmark for Language Model Training Data Curation." MarkTechPost, 19 June 2024. ↩
"Apple AI Released a 7B Open-Source Language Model Trained on 2.5T Tokens on Open Datasets." MarkTechPost, 21 July 2024. ↩
"DataComp-LM: In search of the next generation of training sets for language models." NeurIPS 2024 Datasets and Benchmarks Track (Poster 97814). https://neurips.cc/virtual/2024/poster/97814 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Common Corpus Common Pile DatologyAI Dolma Essential AI Nemotron-CC SlimPajama WRAP (Web Rephrase Augmented Pre-training)

Why was DCLM created?

What is DCLM-Pool?

DCLM competition scales

How does the DCLM framework work?

Evaluation suite at a glance

What is DCLM-Baseline?

The DCLM-7B model

DCLM-7B specifications

How does DCLM-7B compare to other open models?

How does DCLM relate to the original DataComp project?

How does DCLM compare to peer datasets?

Reception and impact

Is DCLM open source?

Limitations and criticism

See also

References

Improve this article

Related Articles

SuperGLUE

HotpotQA

The Pile (dataset)

FineWeb

RedPajama

Common Corpus

What links here

Related Articles

SuperGLUE

HotpotQA

The Pile (dataset)

FineWeb

RedPajama

Common Corpus

What links here