Dolma
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,434 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,434 words
Add missing citations, update stale details, or suggest a clearer explanation.
Dolma is an open pretraining corpus released by the Allen Institute for AI (AI2) to support reproducible research into how training data shapes large language models. The first release in August 2023 contained roughly three trillion tokens drawn from web crawls, code, academic papers, books, social media, and encyclopedic sources. A peer-reviewed description of the corpus and its curation pipeline appeared at ACL 2024 (Soldaini et al., arXiv:2402.00159), and an accompanying open source toolkit lets other groups rebuild or remix the corpus from scratch.
Dolma was built specifically to feed the OLMo family of fully open language models. AI2 released the data, the filtering and deduplication code, the training framework, intermediate checkpoints, and the model weights together, treating the corpus as one piece of a single reproducible artifact rather than a separate product. Among the first generation of open pretraining sets, Dolma stands out for the depth of its release: every step from raw HTML to final tokens has been documented, the toxicity and personal information filters are described in detail, and the per-document metadata is preserved so researchers can audit which sources actually fed the model.
The corpus has since gone through several revisions (v1.5, v1.6, v1.7) and has been folded into the larger pretraining mixes used for OLMo 2 and later AI2 models. It has become a common reference point for studies of data quality, deduplication, and contamination, and it sits alongside The Pile, RefinedWeb, RedPajama, FineWeb, and DataComp-LM as one of the main openly available pretraining datasets used by the wider research community.
The project came out of an explicit complaint from AI2's OLMo team. By 2023, frontier language models were trained on multi-trillion token corpora that researchers outside the labs that built them could not see, study, or reproduce. Even nominally open models like LLaMA documented their data only at the level of broad source categories. Without access to the actual text, questions about memorization, contamination of evaluation sets, demographic bias in the corpus, or the effect of any specific filtering decision could not be answered from the outside.
The Dolma authors argued that this opacity slowed the field down. If a model produced a surprising behavior, no one could check whether it came from the architecture, the training recipe, or some quirk of the underlying text. If a deduplication trick or a quality classifier improved benchmark scores, no one could replicate the experiment without building a comparable corpus from scratch. AI2 framed Dolma as the data half of a fully open release: the corpus, the toolkit that produced it, the model trained on it, and the evaluation harness used to measure it would all live in public, under permissive terms, with enough documentation to let anyone repeat the work.
The practical goal was a corpus large enough to train a competitive seven-billion-parameter model under modern compute budgets, broad enough in source coverage to mirror what closed labs were rumored to use, and clean enough that the resulting model would be worth studying rather than serving as a cautionary tale. Three trillion tokens was chosen to land near the Chinchilla-optimal training budget for a model in that size range, with extra headroom for the longer training runs the OLMo team eventually pursued.
Dolma draws on seven main source families. The web component is by far the largest, consistent with how most modern pretraining mixes are weighted, while books, code, papers, and reference text round out the corpus.
| Source | Origin | Approximate share of v1.6 (tokens) |
|---|---|---|
| Common Crawl web pages | Snapshots of Common Crawl processed with the CCNet pipeline | ~2.4 trillion |
| The Stack code subset | Permissively licensed source code from GitHub via The Stack | ~430 billion |
| C4 | The web subset from the C4 corpus, re-deduplicated | ~175 billion |
| Submissions and comments collected via the Pushshift archive | ~80 billion | |
| peS2o | Open-access academic papers from Semantic Scholar's peS2o collection | ~60 billion |
| Project Gutenberg | Public-domain English books | ~5 billion |
| Wikipedia and Wikibooks | English Wikimedia dumps | ~3.5 billion |
Exact totals shift slightly between releases. The original v1 announcement put the corpus at three trillion tokens across about five billion documents; v1.6 lands closer to 2.3 trillion tokens after additional deduplication and filtering, and v1.7 trims the web portion further while adding more code and paper data. The published paper covers v1.5 and v1.6 in detail; v1.7 was released alongside OLMo 1.7 with notes in the project's GitHub repository rather than a new paper.
A few choices shape the character of the corpus. The web tier uses Common Crawl's WARC archives processed through CCNet, which preserves more long-form text than the WET-based pipelines used by older corpora. The code tier is restricted to permissively licensed repositories from The Stack rather than the full GitHub crawl, which keeps the legal status clearer. The Reddit tier is a partial source: only top-level submissions and comments above a length threshold are included, and entire subreddits identified as toxic during pilot studies are dropped before deduplication. Books are limited to Project Gutenberg, which sidesteps the copyright disputes around the Books3 corpus that earlier projects relied on.
Language coverage is English-only in v1 through v1.7. AI2 has discussed multilingual extensions in later work, and the Dolma toolkit is language-agnostic, but the released corpus targets English alone.
The pipeline that produced Dolma is published as the open source Dolma Toolkit. It is built around a series of independent filters that read and write the same JSONL format, which makes it easy to add or remove a step and rerun the corpus. The paper describes ablation experiments where each filter is held out in turn so the marginal contribution of every step can be measured against downstream task scores.
The stages, in roughly the order they run on the web tier, are:
Non-web tiers go through analogous but lighter pipelines. peS2o documents are filtered by license metadata and language ID, code is filtered for parseable file extensions and reasonable line lengths, and the Reddit tier inherits the toxicity and PII steps from the web pipeline.
Two design decisions deserve a note. First, the team chose not to apply a learned quality classifier on top of the heuristic filters in the released corpus, which differentiates Dolma from later projects like FineWeb-Edu and DCLM that lean heavily on classifier-based selection. The Dolma paper presents this as a deliberate choice: with a learned classifier, the quality signal becomes harder to interpret and reproducing the corpus requires the classifier itself, which complicates the open-data story. Second, the team logged not only the documents that survived but also the rules that dropped each one, so anyone using the toolkit can recover counts of how often each filter fired.
Dolma has been published as a sequence of incremental releases, each tied to an OLMo training run. Researchers using Dolma should be careful to specify which version they mean; results from one version do not always carry over to the next.
| Version | Release date | Headline change | Used to train |
|---|---|---|---|
| Dolma v1.0 | August 2023 | First public release at ~3 trillion tokens, five billion documents | OLMo 1B and 7B preview runs |
| Dolma v1.5 | January 2024 | Refined web pipeline, broader source coverage, became the corpus described in the ACL paper | OLMo 7B (February 2024 release) |
| Dolma v1.6 | February 2024 | More aggressive cross-source deduplication, better metadata | OLMo 7B Twin 2T |
| Dolma v1.7 | April 2024 | Reweighted mix with more code and paper data, extra web snapshots | OLMo 1.7 7B |
| OLMo 2 mix ("Dolma 2" or DCLM-style mix) | November 2024 | Mixed Dolma with Dclm-Baseline-1.0 and other curated sources | OLMo 2 7B and 13B |
| OLMo 2 32B mix | March 2025 | Further refined multi-source pretraining mix | OLMo 2 32B |
The ACL 2024 paper covers v1.5 and v1.6. v1.7 and the OLMo 2 mixes are documented in the OLMo technical reports and the AI2 GitHub repositories rather than as new corpus papers. AI2 has been careful to keep older versions available so that prior OLMo checkpoints can still be reproduced from scratch.
The Dolma corpus is distributed under AI2's ImpACT license. ImpACT is a tiered framework that distinguishes between low-risk, medium-risk, and high-risk artifacts and applies stricter use restrictions as the risk level rises. The Dolma data sits at the medium-risk tier, which permits research and commercial use but bans applications such as surveillance, weapons development, and unauthorized law enforcement decisions. The license also requires downstream users to honor takedown requests from individuals whose data appears in the corpus and to acknowledge the source datasets that Dolma was built from.
The data is hosted on the Hugging Face Hub under the allenai/dolma repository and mirrored on AI2's own infrastructure. Files are stored as gzipped JSONL with one document per line; per-document metadata records the source, the original URL or document identifier, and which filters were applied. The full v1.6 corpus is roughly 4.5 terabytes after compression, which makes streaming access through Hugging Face's datasets library the standard way to consume it for training. The Dolma Toolkit repository on GitHub (allenai/dolma) ships with a command-line interface, configuration files reproducing every released version, and example pipelines for tagging custom corpora.
Dolma's primary user is the OLMo family of fully open models. OLMo 1B and OLMo 7B, released in February 2024, were trained on 2.5 trillion and roughly 2.5 trillion tokens of Dolma respectively. OLMo 7B Twin 2T was a parallel run on a slightly different mix that AI2 released to study the variance of large training runs. OLMo 1.7 7B (April 2024) used Dolma v1.7 and reached substantially higher scores on standard benchmarks than the original OLMo 7B, which AI2 attributed largely to the data changes rather than to any architectural difference.
The November 2024 OLMo 2 release introduced a new pretraining mix that combined Dolma with other curated sources, including Dclm-Baseline-1.0. AI2 describes this mix as the successor to Dolma rather than a replacement; the toolkit and the older corpora remain available, and the OLMo 2 documentation gives the exact mixture proportions for anyone who wants to reproduce a run. OLMo 2 32B, released in March 2025, used a further-refined version of this mix and remains the largest fully open model trained on a Dolma-derived corpus.
Beyond AI2, Dolma has been used in academic studies of data attribution, memorization, and bias. Because every document in the corpus carries its source metadata, researchers can ask questions like "how often does a specific subreddit appear in the training data of this model" or "which Wikipedia articles are most likely to be memorized" without needing private access to the lab that trained the model. This kind of study is mostly impossible with closed corpora and was a central motivation for the project.
Dolma is one of several large open corpora that became available between 2020 and 2025. The table below gathers the headline numbers for the main alternatives. Token counts are approximate and depend on the tokenizer; reported figures use the original release notes for each project.
| Dataset | Released by | Year | Approximate size | Source mix | License |
|---|---|---|---|---|---|
| The Pile | EleutherAI | 2020 | ~825 GB / 340 billion tokens | 22 curated sources including web, books, papers, code | MIT (with mixed source licenses) |
| C4 | Google / Allen Institute for AI | 2019 | ~750 GB / ~175 billion tokens | English Common Crawl, heuristically filtered | ODC-By 1.0 |
| RefinedWeb | TII | 2023 | ~5 trillion total, ~600 billion released | Common Crawl, heavily deduplicated | ODC-By 1.0 |
| RedPajama v1 | Together AI and partners | 2023 | ~1.2 trillion tokens | Reproduction of LLaMA-1 mix | Apache 2.0 (per-source) |
| RedPajama v2 | Together AI | 2023 | ~30 trillion raw tokens | Common Crawl with quality scores | Apache 2.0 |
| Dolma v1.6 | Allen Institute for AI | 2024 | ~2.3 trillion tokens | Web, code, papers, books, Reddit, Wikipedia | AI2 ImpACT |
| FineWeb | Hugging Face | 2024 | ~15 trillion tokens | Common Crawl with empirical filtering | ODC-By 1.0 |
| FineWeb-Edu | Hugging Face | 2024 | ~1.3 trillion tokens | FineWeb subset filtered for educational content | ODC-By 1.0 |
| DataComp-LM | DCLM consortium | 2024 | ~240 trillion raw tokens | Common Crawl with model-based selection | CC-BY-4.0 |
Dolma's distinguishing feature is the depth of its release rather than its raw size. FineWeb is larger, RedPajama v2 and DCLM are larger still, and several proprietary corpora used by frontier labs are believed to be larger by another order of magnitude. What Dolma offers is the toolkit, the per-document filter logs, the documented ablations, and the alignment with the OLMo training stack. For a researcher who wants to study how a specific filter changes downstream behavior, Dolma is one of the few corpora where the experiment can actually be run end to end on public data.
The trade-off is that Dolma's web tier is heuristically filtered rather than classifier-filtered, so models trained on Dolma alone tend to score below models trained on FineWeb-Edu or DCLM-Baseline at matched token counts. AI2's response in the OLMo 2 release was to mix Dolma with DCLM-Baseline rather than abandon it; the resulting corpus is treated as the operational successor.
The Dolma release set a baseline for what an open pretraining corpus should ship with. Before Dolma, most public corpora released the data and a paper describing it; the actual processing code was either missing or scattered across personal repositories. Dolma packaged the toolkit, the configurations, the metadata, and the corpus as a single artifact, which made it possible for other projects to inherit the pipeline rather than rebuild it. FineWeb, DCLM, and several smaller projects use Dolma's tagger format or its deduplication primitives as starting points.
The project also pushed forward the conversation about licensing for pretraining data. By choosing ImpACT rather than a permissive license like ODC-By, AI2 took a position on what use restrictions should accompany large web-scraped corpora. Other groups disagreed: FineWeb and RedPajama opted for fully permissive licenses, partly to avoid the legal complexity that ImpACT introduces for downstream models. The debate is still live, and there is no consensus yet on the right license tier for a corpus that contains personal information, copyrighted material, and toxic content scraped from the public web.
Dolma's most lasting contribution may be its insistence on documentation. The published paper includes filter ablation tables, deduplication statistics, source breakdowns by token count and document count, and a discussion of the limits of the toxicity and PII pipelines. This level of detail has become a soft expectation for new corpora; FineWeb's data card and DCLM's technical report both adopt similar formats, and reviewers at major conferences increasingly ask for it.
The Dolma authors are explicit about the corpus's limitations. The toxicity classifier is trained on Jigsaw data, which has known biases against African American English and other dialects; some of those biases propagate into what gets dropped. The PII regexes are coarse and miss obvious patterns like usernames embedded in URLs. The deduplication is paragraph-level rather than semantic, so paraphrased duplicates remain. The English-only filter is conservative on the high side, which means some borderline-multilingual pages are dropped that a human annotator would keep.
A broader open question is how much further heuristic filtering can be pushed before classifier-based selection takes over. The OLMo 2 mixes already lean on DCLM-Baseline, which uses a learned classifier, and the gap between heuristic and classifier-filtered web data has widened over time. Whether future Dolma releases stay heuristic or adopt classifier-based selection is unsettled in the public AI2 communications as of early 2026.
A second question is multilingual coverage. The current corpus is English-only, and there is real demand for non-English open pretraining data at the multi-trillion token scale. The Dolma Toolkit is language-agnostic and could in principle be applied to other languages, but no large multilingual Dolma release has been announced.