Dolma

Data & Datasets Large Language Models Open Source AI

19 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v5 · 3,708 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Dolma is an open three-trillion-token English pretraining corpus released by the Allen Institute for AI (AI2) to power its fully open OLMo language models and to let researchers study how training data shapes large language models.^[1]^[2] First published on August 18, 2023 and described in a peer-reviewed paper that won the Best Resource Paper Award at ACL 2024 (Soldaini et al., arXiv:2402.00159), Dolma was distilled from roughly 200 terabytes of raw text down to an 11-terabyte curated dataset spanning web crawls, code, academic papers, books, social media, and encyclopedic sources.^[1]^[2]^[13] It ships not only as data but as an open source toolkit that lets any group rebuild or remix the corpus from scratch.^[1]^[4]

Dolma was built specifically to feed the OLMo family of fully open language models.^[5] AI2 released the data, the filtering and deduplication code, the training framework, intermediate checkpoints, and the model weights together, treating the corpus as one piece of a single reproducible artifact rather than a separate product.^[5] Among the first generation of open pretraining sets, Dolma stands out for the depth of its release: every step from raw HTML to final tokens has been documented, the toxicity and personal information filters are described in detail, and the per-document metadata is preserved so researchers can audit which sources actually fed the model.^[1] The Dolma team framed the project as a direct response to a structural problem, writing that the "lack of access to pretraining corpora alongside corresponding language models has been a major obstacle for the broader research community."^[2]

The corpus has since gone through several revisions (v1.5, v1.6, v1.7) and has been folded into the larger pretraining mixes used for OLMo 2 and later AI2 models.^[6] It has become a common reference point for studies of data quality, deduplication, and contamination, and it sits alongside The Pile, RefinedWeb, RedPajama, FineWeb, and DataComp-LM as one of the main openly available pretraining datasets used by the wider research community.

Why was Dolma created?

The project came out of an explicit complaint from AI2's OLMo team. By 2023, frontier language models were trained on multi-trillion token corpora that researchers outside the labs that built them could not see, study, or reproduce. Even nominally open models like LLaMA documented their data only at the level of broad source categories.^[1] Without access to the actual text, questions about memorization, contamination of evaluation sets, demographic bias in the corpus, or the effect of any specific filtering decision could not be answered from the outside.^[1]

The Dolma authors argued that this opacity slowed the field down. If a model produced a surprising behavior, no one could check whether it came from the architecture, the training recipe, or some quirk of the underlying text. If a deduplication trick or a quality classifier improved benchmark scores, no one could replicate the experiment without building a comparable corpus from scratch. AI2 framed Dolma as the data half of a fully open release: the corpus, the toolkit that produced it, the model trained on it, and the evaluation harness used to measure it would all live in public, under permissive terms, with enough documentation to let anyone repeat the work.^[1] AI2 said it released Dolma so that researchers could "independently create better versions of this dataset, study the relationship between the data and any model trained on it, report any issues they observe when inspecting our data, and critique our curation practices."^[2]

The practical goal was a corpus large enough to train a competitive seven-billion-parameter model under modern compute budgets, broad enough in source coverage to mirror what closed labs were rumored to use, and clean enough that the resulting model would be worth studying rather than serving as a cautionary tale. Three trillion tokens was chosen to land near the Chinchilla-optimal training budget for a model in that size range, with extra headroom for the longer training runs the OLMo team eventually pursued.^[1]

What is Dolma made of?

Dolma draws on seven main source families. The web component is by far the largest, consistent with how most modern pretraining mixes are weighted, while books, code, papers, and reference text round out the corpus.^[1]

Source	Origin	Approximate share of v1.6 (tokens)
Common Crawl web pages	Snapshots of Common Crawl processed with the CCNet pipeline	~2.4 trillion
The Stack code subset	Permissively licensed source code from GitHub via The Stack	~430 billion
C4	The web subset from the C4 corpus, re-deduplicated	~175 billion
Reddit	Submissions and comments collected via the Pushshift archive	~80 billion
peS2o	Open-access academic papers from Semantic Scholar's peS2o collection	~60 billion
Project Gutenberg	Public-domain English books	~5 billion
Wikipedia and Wikibooks	English Wikimedia dumps	~3.5 billion

Exact totals shift slightly between releases. The original v1 announcement put the corpus at three trillion tokens across about five billion documents; v1.6 lands closer to 2.3 trillion tokens after additional deduplication and filtering, and v1.7 trims the web portion further while adding more code and paper data.^[2] The current default version, v1.7, breaks down to roughly 1,195.5 billion tokens from Dolma's own Common Crawl pipeline, 456.4 billion from RefinedWeb, 263.8 billion from StarCoder, 138.4 billion from C4, and 79.9 billion from Reddit, with the remainder drawn from arXiv, StackExchange, and other curated sources.^[3] The published paper covers v1.5 and v1.6 in detail; v1.7 was released alongside OLMo 1.7 with notes in the project's GitHub repository rather than a new paper.^[1]^[4]

A few choices shape the character of the corpus. The web tier uses Common Crawl's WARC archives processed through CCNet, which preserves more long-form text than the WET-based pipelines used by older corpora.^[1] The code tier is restricted to permissively licensed repositories from The Stack rather than the full GitHub crawl, which keeps the legal status clearer.^[1] The Reddit tier is a partial source: only top-level submissions and comments above a length threshold are included, and entire subreddits identified as toxic during pilot studies are dropped before deduplication.^[1] Books are limited to Project Gutenberg, which sidesteps the copyright disputes around the Books3 corpus that earlier projects relied on.^[1]

Language coverage is English-only in v1 through v1.7. AI2 has discussed multilingual extensions in later work, and the Dolma toolkit is language-agnostic, but the released corpus targets English alone.^[1]

How is Dolma filtered and deduplicated?

The pipeline that produced Dolma is published as the open source Dolma Toolkit.^[4] It is built around a series of independent filters that read and write the same JSONL format, which makes it easy to add or remove a step and rerun the corpus.^[4] The paper describes ablation experiments where each filter is held out in turn so the marginal contribution of every step can be measured against downstream task scores.^[1]

The stages, in roughly the order they run on the web tier, are:

Language identification. A fastText classifier tags each document with a language code; documents predicted to be English with confidence above a threshold are kept, and the rest are dropped.^[1]
Quality heuristics. A set of Gopher- and C4-style rules removes pages with too few alphabetic characters, too many short lines, excessive symbol density, repeated n-grams above a threshold, or a high ratio of stop words to content words. These rules are intentionally conservative; the goal is to drop obvious junk without aggressively shaping style.^[1]
Toxicity filtering. A classifier trained on the Jigsaw toxic comment dataset scores each document. Documents above a tuned threshold are dropped from the web and Reddit tiers. The paper reports the false-positive trade-offs at several thresholds and the cumulative effect on downstream scores.^[1]
Personal information removal. Email addresses, phone numbers, and IP addresses are masked using regular expressions tuned to keep recall high without rewriting too much surrounding context. The paper acknowledges that this is a coarse approach and not a substitute for a full PII pipeline.^[1]
Deduplication. Dolma applies several rounds: URL-level dedup across snapshots, paragraph-level exact-match dedup across the whole web tier, document-level near-duplicate detection using MinHash and Locality-Sensitive Hashing, and finally cross-source dedup so that material that exists in both Wikipedia and the web crawl is not counted twice.^[1]

Non-web tiers go through analogous but lighter pipelines. peS2o documents are filtered by license metadata and language ID, code is filtered for parseable file extensions and reasonable line lengths, and the Reddit tier inherits the toxicity and PII steps from the web pipeline.^[1]

Two design decisions deserve a note. First, the team chose not to apply a learned quality classifier on top of the heuristic filters in the released corpus, which differentiates Dolma from later projects like FineWeb-Edu and DCLM that lean heavily on classifier-based selection. The Dolma paper presents this as a deliberate choice: with a learned classifier, the quality signal becomes harder to interpret and reproducing the corpus requires the classifier itself, which complicates the open-data story.^[1] Second, the team logged not only the documents that survived but also the rules that dropped each one, so anyone using the toolkit can recover counts of how often each filter fired.^[1]

What versions of Dolma exist?

Dolma has been published as a sequence of incremental releases, each tied to an OLMo training run.^[1] Researchers using Dolma should be careful to specify which version they mean; results from one version do not always carry over to the next.

Version	Release date	Headline change	Used to train
Dolma v1.0	August 2023	First public release at ~3 trillion tokens, five billion documents	OLMo 1B and 7B preview runs
Dolma v1.5	January 2024	Refined web pipeline, broader source coverage, became the corpus described in the ACL paper	OLMo 7B (February 2024 release)
Dolma v1.6	February 2024	More aggressive cross-source deduplication, better metadata	OLMo 7B Twin 2T
Dolma v1.7	April 2024	Reweighted mix with more code and paper data, extra web snapshots	OLMo 1.7 7B
OLMo 2 mix ("Dolma 2" or DCLM-style mix)	November 2024	Mixed Dolma with Dclm-Baseline-1.0 and other curated sources	OLMo 2 7B and 13B
OLMo 2 32B mix	March 2025	Further refined multi-source pretraining mix	OLMo 2 32B

The ACL 2024 paper covers v1.5 and v1.6.^[1] v1.7 and the OLMo 2 mixes are documented in the OLMo technical reports and the AI2 GitHub repositories rather than as new corpus papers.^[6] AI2 has been careful to keep older versions available so that prior OLMo checkpoints can still be reproduced from scratch.^[4]

Is Dolma open source, and how do you access it?

Dolma was originally released in August 2023 under AI2's tiered ImpACT license, which placed the corpus at a medium-risk tier that permitted research and commercial use while banning applications such as surveillance, weapons development, and unauthorized law enforcement decisions.^[7] On April 15, 2024, alongside the v1.7 release, AI2 relicensed the corpus to the fully permissive Open Data Commons Attribution License (ODC-BY), aligning it with the licenses used by The Pile, FineWeb, and other open corpora.^[3] The dataset card states that "we have updated the license of Dolma to ODC-BY," which only requires attribution and lifts the use-restriction tiers that ImpACT imposed.^[3] Users still remain subject to the underlying licenses of the source datasets that Dolma was built from.^[3]

The data is hosted on the Hugging Face Hub under the allenai/dolma repository and mirrored on AI2's own infrastructure.^[3] Files are stored as gzipped JSONL with one document per line; per-document metadata records the source, the original URL or document identifier, and which filters were applied.^[3] The full v1.6 corpus is roughly 4.5 terabytes after compression, while the curated dataset as a whole was distilled from about 200 terabytes of raw text down to roughly 11 terabytes, which makes streaming access through Hugging Face's datasets library the standard way to consume it for training.^[3]^[13] The Dolma Toolkit repository on GitHub (allenai/dolma) ships with a command-line interface, configuration files reproducing every released version, and example pipelines for tagging custom corpora.^[4]

How is Dolma used in OLMo and other models?

Dolma's primary user is the OLMo family of fully open models.^[5] OLMo 1B and OLMo 7B, released in February 2024, were trained on 2.5 trillion and roughly 2.5 trillion tokens of Dolma respectively.^[5] OLMo 7B Twin 2T was a parallel run on a slightly different mix that AI2 released to study the variance of large training runs.^[5] OLMo 1.7 7B (April 2024) used Dolma v1.7 and reached substantially higher scores on standard benchmarks than the original OLMo 7B, which AI2 attributed largely to the data changes rather than to any architectural difference.

The November 2024 OLMo 2 release introduced a new pretraining mix that combined Dolma with other curated sources, including Dclm-Baseline-1.0.^[6] AI2 describes this mix as the successor to Dolma rather than a replacement; the toolkit and the older corpora remain available, and the OLMo 2 documentation gives the exact mixture proportions for anyone who wants to reproduce a run.^[6] OLMo 2 32B, released in March 2025, used a further-refined version of this mix and remains the largest fully open model trained on a Dolma-derived corpus.^[6]

Beyond AI2, Dolma has been used in academic studies of data attribution, memorization, and bias. Because every document in the corpus carries its source metadata, researchers can ask questions like "how often does a specific subreddit appear in the training data of this model" or "which Wikipedia articles are most likely to be memorized" without needing private access to the lab that trained the model.^[1] This kind of study is mostly impossible with closed corpora and was a central motivation for the project.^[1]

How does Dolma compare with other open pretraining datasets?

Dolma is one of several large open corpora that became available between 2020 and 2025. The table below gathers the headline numbers for the main alternatives. Token counts are approximate and depend on the tokenizer; reported figures use the original release notes for each project.

Dataset	Released by	Year	Approximate size	Source mix	License
The Pile	EleutherAI	2020	~825 GB / 340 billion tokens	22 curated sources including web, books, papers, code	MIT (with mixed source licenses)
C4	Google / Allen Institute for AI	2019	~750 GB / ~175 billion tokens	English Common Crawl, heuristically filtered	ODC-By 1.0
RefinedWeb	TII	2023	~5 trillion total, ~600 billion released	Common Crawl, heavily deduplicated	ODC-By 1.0
RedPajama v1	Together AI and partners	2023	~1.2 trillion tokens	Reproduction of LLaMA-1 mix	Apache 2.0 (per-source)
RedPajama v2	Together AI	2023	~30 trillion raw tokens	Common Crawl with quality scores	Apache 2.0
Dolma v1.7	Allen Institute for AI	2024	~2.3 trillion tokens	Web, code, papers, books, Reddit, Wikipedia	ODC-BY
FineWeb	Hugging Face	2024	~15 trillion tokens	Common Crawl with empirical filtering	ODC-By 1.0
FineWeb-Edu	Hugging Face	2024	~1.3 trillion tokens	FineWeb subset filtered for educational content	ODC-By 1.0
DataComp-LM	DCLM consortium	2024	~240 trillion raw tokens	Common Crawl with model-based selection	CC-BY-4.0

Dolma's distinguishing feature is the depth of its release rather than its raw size. FineWeb is larger, RedPajama v2 and DCLM are larger still, and several proprietary corpora used by frontier labs are believed to be larger by another order of magnitude.^[8]^[9]^[12] What Dolma offers is the toolkit, the per-document filter logs, the documented ablations, and the alignment with the OLMo training stack.^[1] For a researcher who wants to study how a specific filter changes downstream behavior, Dolma is one of the few corpora where the experiment can actually be run end to end on public data.

The trade-off is that Dolma's web tier is heuristically filtered rather than classifier-filtered, so models trained on Dolma alone tend to score below models trained on FineWeb-Edu or DCLM-Baseline at matched token counts.^[8]^[12] AI2's response in the OLMo 2 release was to mix Dolma with DCLM-Baseline rather than abandon it; the resulting corpus is treated as the operational successor.^[6]

Influence on open-data practice

The Dolma release set a baseline for what an open pretraining corpus should ship with. Before Dolma, most public corpora released the data and a paper describing it; the actual processing code was either missing or scattered across personal repositories.^[1] Dolma packaged the toolkit, the configurations, the metadata, and the corpus as a single artifact, which made it possible for other projects to inherit the pipeline rather than rebuild it.^[4] FineWeb, DCLM, and several smaller projects use Dolma's tagger format or its deduplication primitives as starting points.^[8]^[12]

The project also pushed forward the conversation about licensing for pretraining data. AI2's initial choice of the restrictive ImpACT license, followed by its 2024 switch to fully permissive ODC-BY, mirrored a broader move across the field toward attribution-only licensing for large web-scraped corpora.^[3]^[7] FineWeb and RedPajama opted for permissive licenses from the start, partly to avoid the legal complexity that tiered licenses introduce for downstream models.^[8]^[9] The debate is still live, and there is no consensus yet on the right license tier for a corpus that contains personal information, copyrighted material, and toxic content scraped from the public web.

Dolma's most lasting contribution may be its insistence on documentation. The published paper includes filter ablation tables, deduplication statistics, source breakdowns by token count and document count, and a discussion of the limits of the toxicity and PII pipelines.^[1] This level of detail, which earned the paper the Best Resource Paper Award at ACL 2024, has become a soft expectation for new corpora; FineWeb's data card and DCLM's technical report both adopt similar formats, and reviewers at major conferences increasingly ask for it.^[8]^[12]^[13]

Limitations and open questions

The Dolma authors are explicit about the corpus's limitations.^[1] The toxicity classifier is trained on Jigsaw data, which has known biases against African American English and other dialects; some of those biases propagate into what gets dropped.^[1] The PII regexes are coarse and miss obvious patterns like usernames embedded in URLs.^[1] The deduplication is paragraph-level rather than semantic, so paraphrased duplicates remain.^[1] The English-only filter is conservative on the high side, which means some borderline-multilingual pages are dropped that a human annotator would keep.^[1]

A broader open question is how much further heuristic filtering can be pushed before classifier-based selection takes over. The OLMo 2 mixes already lean on DCLM-Baseline, which uses a learned classifier, and the gap between heuristic and classifier-filtered web data has widened over time.^[6]^[12] Whether future Dolma releases stay heuristic or adopt classifier-based selection is unsettled in the public AI2 communications as of early 2026.

A second question is multilingual coverage. The current corpus is English-only, and there is real demand for non-English open pretraining data at the multi-trillion token scale.^[1] The Dolma Toolkit is language-agnostic and could in principle be applied to other languages, but no large multilingual Dolma release has been announced.^[4]

References

Soldaini, L., Kinney, R., Bhagia, A., Schwenk, D., Atkinson, D., Authur, R., Bogin, B., Chandu, K., Dumas, J., Elazar, Y., Hofmann, V., Jha, A. H., Kumar, S., Lucy, L., Lyu, X., Lambert, N., Magnusson, I., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M. E., Ravichander, A., Richardson, K., Shen, Z., Strubell, E., Subramani, N., Tafjord, O., Walsh, P., Zettlemoyer, L., Smith, N. A., Hajishirzi, H., Groeneveld, D., Beltagy, I., and Lo, K. (2024). "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 15725-15788. arXiv:2402.00159. ↩
Allen Institute for AI. (2023). "Ai2 Dolma: 3 trillion token open corpus for language model pretraining." Blog post, August 18, 2023, allenai.org/blog/dolma-3-trillion-tokens-open-llm-corpus. ↩
Hugging Face. (2024). Dataset card for `allenai/dolma`. huggingface.co/datasets/allenai/dolma. ↩
Allen Institute for AI. "Dolma Toolkit." GitHub repository, github.com/allenai/dolma. ↩
Groeneveld, D., Beltagy, I., Walsh, P., Bhagia, A., Kinney, R., Tafjord, O., Jha, A. H., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M. E., Pyatkin, V., Ravichander, A., Schwenk, D., Shah, S., Smith, W., Strubell, E., Subramani, N., Wortsman, M., Dasigi, P., Lambert, N., Richardson, K., Zettlemoyer, L., Dodge, J., Lo, K., Soldaini, L., Smith, N. A., and Hajishirzi, H. (2024). "OLMo: Accelerating the Science of Language Models." arXiv:2402.00838. ↩
OLMo 2 Team, Allen Institute for AI. (2024). "OLMo 2: The best fully open language model to date." Technical report and blog post, allenai.org. ↩
Allen Institute for AI. "AI2 ImpACT License." allenai.org/licenses/impact. ↩
Penedo, G., Kydlicek, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., and Wolf, T. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." NeurIPS 2024 Datasets and Benchmarks Track. ↩
Together AI. (2023). "RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models." together.ai blog. ↩
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. (2023). "The RefinedWeb Dataset for Falcon LLM." arXiv:2306.01116.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027.
Li, J., Fang, A., Smyrnis, G., et al. (2024). "DataComp-LM: In Search of the Next Generation of Training Sets for Language Models." arXiv:2406.11794. ↩
Association for Computational Linguistics. (2024). "Best Paper Awards: ACL 2024." 2024.aclweb.org/program/best_papers. Dolma named Best Resource Paper. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Common Corpus Common Crawl Common Pile DCLM (DataComp for Language Models)FineWeb FineWeb-2 FineWeb-Edu Hashing Molmo Nemotron-CC OLMo OLMo 2 OLMo 3 OLMoE SlimPajama The Stack v2 TxT360

Why was Dolma created?

What is Dolma made of?

How is Dolma filtered and deduplicated?

What versions of Dolma exist?

Is Dolma open source, and how do you access it?

How is Dolma used in OLMo and other models?

How does Dolma compare with other open pretraining datasets?

Influence on open-data practice

Limitations and open questions

See also

References

Improve this article

Related Articles

RefinedWeb

SlimPajama

OpenOrca

Cosmopedia

TxT360

The Pile (dataset)

What links here

Related Articles

RefinedWeb

SlimPajama

OpenOrca

Cosmopedia

TxT360

The Pile (dataset)

What links here