Common Corpus

Data & Datasets Natural Language Processing Open Source AI

18 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 3,633 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Common Corpus is the largest fully open, multilingual dataset for pretraining large language models, assembled and released by the French AI research lab Pleias. It contains roughly two trillion tokens of text that is either in the public domain or distributed under explicitly permissive licenses, with documented provenance for every document so that model builders can avoid copyright and data-security risks. ^[1]^[3] The accompanying research paper describes it as "the largest open dataset for LLM pre-training" whose data "are either uncopyrighted or under open licenses, totaling about two trillion tokens." ^[1]

More precisely, the canonical Hugging Face release reports about 1.998 trillion tokens across roughly 517 million documents (and 2.27 trillion tokens, 4.49 terabytes of Parquet files, in the expanded version), making it, according to Pleias, the largest collection of fully openly licensed text ever published for language model training. ^[1]^[3] Every document is in the public domain or under an explicitly permissive license, including Creative Commons CC-BY and CC0, MIT, Apache 2.0, BSD, and the Open Data Commons family. An initial version was made public on the Hugging Face Hub on 13 November 2024 in partnership with Hugging Face, EleutherAI, Nomic AI, Occiglot, Mozilla Builders, the AI Alliance, and other collaborators. ^[2]^[4] A substantially expanded version was released in February 2025 for the Paris AI Action Summit, and the research paper, titled Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training, was posted to arXiv in June 2025 and accepted as an oral presentation at the ICLR 2026 conference. ^[1]^[11]

Common Corpus is organised into six thematic partitions covering government and legal documents, cultural heritage texts, scientific publications, source code, web encyclopedic content, and semantic knowledge graphs. ^[1] It is the first openly licensed pretraining corpus to simultaneously satisfy four properties that competing datasets had typically traded off against one another: scale at the trillion-token level, fully permissive licensing with documented provenance, coverage across many domains beyond web crawl, and substantial multilingual representation. More than 40 percent of the data is non-English, with nine languages exceeding ten billion tokens each and more than thirty languages exceeding one billion tokens. ^[1] The corpus was used to train the Pleias 1.0 family of small language models in December 2024, the first family of competitive language models pretrained exclusively on openly licensed data. ^[6]^[7]

What is Common Corpus?

Common Corpus is an open training data collection built so that a language model can be pretrained at competitive scale without ingesting copyrighted or proprietary material. The paper frames the problem directly: "Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations." ^[1] The authors position Common Corpus as the answer to that need, describing it as "the largest fully open pre-training dataset at about 2 trillion tokens and the only one in its size range having high multilingual diversity." ^[1]

Why was Common Corpus created?

The choice of pretraining data is one of the most consequential decisions in modern language model development. Web-scraped corpora such as Common Crawl, C4, The Pile, RefinedWeb, RedPajama, Dolma, FineWeb, and Nemotron-CC have powered nearly every state of the art LLM released since the early 2020s. These datasets, however, are assembled by scraping the open web and filtering out only the most obvious problems. They include very large quantities of copyrighted material that has not been explicitly licensed for reuse, ranging from news articles and book passages to web fiction and software documentation. Several high profile lawsuits, most notably The New York Times v. OpenAI and parallel author and music publisher actions against Anthropic, Meta, and others, have placed the legality of training on such mixtures in question. Regulators have responded as well, with the European Union's AI Act requiring that providers of general purpose AI models publish detailed summaries of their training data and respect machine readable opt outs.

A small but growing body of work has attempted to address this provenance problem by curating openly licensed alternatives. The Pile included a number of explicitly licensed components but mixed them with web content and copyrighted books. Books3, once a popular component of open corpora, was withdrawn after it became clear that most of its contents were pirated. Stack Exchange, Wikipedia, government publications, and certain scientific repositories have long been used as licensed seed sources, but they were generally too small to support pretraining at competitive scale. The KL3M dataset from 273 Ventures focused on administrative and legal English. The Common Pile, released by EleutherAI and collaborators in 2025, assembled eight terabytes of public domain and openly licensed text, but is English only. ^[8]

Common Corpus was motivated by the observation that an open and ethically sourced corpus large enough to train competitive multilingual models was achievable if one drew systematically on government archives, digitised cultural heritage holdings, scientific repositories with open access policies, permissively licensed code, and structured knowledge bases. The project's stated goal was to demonstrate, with reproducible artefacts, that there is no fundamental tradeoff between legal openness and pretraining utility. The authors describe the result as a moment when "there has been sufficient knowledge and infrastructure to collect and clean a dataset on this scale, which meets the legal and ethical criteria we have outlined." ^[1]

Who built Common Corpus, and when?

The corpus was led by Pleias, a startup co-founded by Pierre-Carl Langlais, Anastasia Stasenko, and Ivan Yamshchikov in Paris in 2023. Pleias positioned itself as a research lab and small model builder focused on data ethics, archival digitisation, and specialised LLMs for cultural heritage and information science applications. The Common Corpus initiative grew out of work on Open Culture and Open Government collections that Langlais and collaborators had been assembling for several years, and was formalised in early 2024 with funding and infrastructure support from the French Ministry of Culture's ALT-EDIC programme, the GENCI Jean Zay supercomputer, the Nvidia Inception Program, and a number of cloud partners. ^[2]

The AI Alliance, a consortium founded by IBM and Meta in late 2023, adopted Common Corpus as a flagship project of its Open Trusted Data Initiative, lending coordination, governance review, and additional partners. ^[4] Hugging Face hosted the dataset and contributed engineering support. EleutherAI, Nomic AI, and Occiglot collaborated on filtering, evaluation, and multilingual coverage. Mozilla Builders supported the release through its 2024 cohort. ^[5] Wikimedia Enterprise contributed the Wikidata and YouTube Commons partitions, and Libraries Without Borders supported access to cultural heritage materials.

The project followed a phased release. A preview spanning roughly 500 billion tokens was made available in March 2024 alongside the Pleias OCRonos error correction model. The full first release on 13 November 2024 contained approximately two trillion tokens, accompanied by a detailed blog post by Langlais on Hugging Face. ^[2] A substantially curated and rebalanced version was released in February 2025 at the Paris AI Action Summit, with cleaner formatting, additional code, and expanded language coverage. By the time of the arXiv paper in June 2025, the canonical Hugging Face dataset reported 2.27 trillion tokens across 517 million documents and 4.49 terabytes of Parquet files. ^[1]^[3]

What is in Common Corpus?

Common Corpus is divided into six top level partitions, each corresponding to a sourcing strategy and a set of source repositories. The total token counts as reported in the June 2025 paper, computed with the Pleias tokenizer and the Gemma 3 tokenizer for non-Western languages, are summarised below. ^[1]

Partition	Tokens (approx.)	Documents (approx.)	Primary sources
Open Culture	886 billion	93.2 million	Public domain books and newspapers from cultural heritage institutions, Project Gutenberg, Wikisource, Internet Archive holdings
Open Government	407 billion	75.6 million	SEC EDGAR filings, WTO documents, Europarl proceedings, Chinese case law, Finance Commons, Legal Commons
Open Code	283 billion	202.8 million	Permissively licensed GitHub repositories filtered with the ArmoRM quality classifier, retaining the top 80 percent
Open Science	281 billion	19.2 million	OpenAlex, open access journal articles, preprints, dissertations
Open Web	73 billion	96.2 million	Wikipedia (CC-BY-SA), YouTube Commons transcripts, Stack Exchange (CC-BY-SA)
Open Semantic	68 billion	30.1 million	Wikidata statements and triplets across 300+ languages
Total	~1.998 trillion	~517 million

Code accounts for roughly 18.8 percent of the corpus and Open Culture for slightly more than 44 percent, making cultural heritage the single largest source. Government and legal text together with scientific content provide the dominant share of high formality prose. The Open Web partition is intentionally small because most of the open web is not openly licensed, so Wikipedia, YouTube Commons audio transcripts, and Stack Exchange are the principal contributors. Open Semantic is unique among large pretraining corpora in including structured semantic triplets at scale, which the authors argue improves knowledge grounding for very small models. ^[1]

How multilingual is Common Corpus?

A central design goal of Common Corpus was multilingual breadth. Roughly 41 percent of the tokens are in languages other than English. The paper reports nine languages with more than ten billion tokens, and counts more than thirty languages with more than one billion tokens. ^[1] The high resource European languages dominate because of the large public domain holdings digitised by European national libraries, but the corpus also includes substantial low resource and historical language material, including Latin, Ancient Greek, Old French, and several non Indo European languages from cultural heritage collections.

Language	Tokens (approx.)
English	867 billion
French	266 billion
German	112 billion
Spanish	46 billion
Latin	34 billion
Dutch	29 billion
Italian	24 billion
Polish	11 billion
Greek	11 billion
Portuguese	9 billion
30+ additional languages	1 to 8 billion each

The paper highlights that the French and German shares are particularly high relative to other open corpora, a direct consequence of the long running digitisation programmes run by the Bibliotheque nationale de France, the Deutsche Nationalbibliothek, and partner institutions. Coverage of Chinese, Japanese, Korean, Arabic, and Hindi is also present in smaller but meaningful quantities, drawn primarily from open access scientific content, Wikidata, and the Open Government partition.

Is Common Corpus open source and properly licensed?

Every document in Common Corpus carries a recorded license string. The paper reports that the majority of the corpus, more than 1.1 trillion tokens, is in the public domain. The second largest license category is Creative Commons Attribution at approximately 288 billion tokens. ^[1] Substantial volumes of CC0, CC-BY-SA, MIT, Apache 2.0, BSD, and Open Data Commons material make up the remainder. Users can filter the dataset by license type, for example to retain only public domain and attribution only content for the strictest downstream applications. ^[3]

The authors argue that Common Corpus satisfies the Open Source Initiative's Open Source AI Definition with respect to training data and that it exceeds the data transparency thresholds set by the EU AI Act's Code of Practice for General Purpose AI. Provenance information at the document level allows downstream model builders to publish detailed training data summaries, attribute sources, and respect any opt out signals. As the project summarises it, Common Corpus "contains only data that is permissively licensed and provenance is documented." ^[2]

How was Common Corpus cleaned and curated?

Much of the technical work behind Common Corpus involved producing usable text from low quality digitisations and heterogeneous source formats. The project produced several specialised tools and intermediate models, some of which were released independently on Hugging Face. ^[1]

OCRonos. Cultural heritage holdings are typically distributed as image scans with optical character recognition applied at the time of digitisation, often decades ago. Pleias trained a 124 million parameter model called OCRonos-Vintage and a larger Llama 3 8B based model called OCRonos to correct OCR errors, repair broken word boundaries, and reconstruct text structure. On heavily degraded inputs the model functions more like synthetic rewriting than strict correction, while remaining faithful to the underlying material. Without OCRonos, the authors estimate that hundreds of billions of tokens from newspaper and book scans would have been unusable.

Vision language layout extraction. Scientific PDFs, government reports, and legal documents carry information in tables, figures, equations, and footnotes that simple text extraction destroys. The pipeline uses vision language models to preserve document structure before tokenisation, particularly for the Open Science partition.

Code quality filtering. The Open Code partition was filtered using the ArmoRM quality classifier to retain only repositories scoring in the top 80 percent of the quality distribution. Permissive license verification was performed at the repository level using SPDX identifiers.

Toxicity and PII filtering. Toxic content was identified and removed using the Celadon multilingual toxicity classifier developed for the project. Personally identifiable information was removed using Microsoft Presidio extended with language and country specific patterns to handle European document conventions.

Deduplication. Near duplicate documents were identified across partitions using MinHash and locality sensitive hashing, with thresholds calibrated to preserve legitimate near duplicates such as multiple translations of the same legal text.

What models were trained on Common Corpus?

Common Corpus was used to train the Pleias 1.0 family of small language models, announced in December 2024. ^[6] The family includes three base models at 350 million, 1.2 billion, and 3 billion parameters, branded Pleias-Pico, Pleias-Nano, and Pleias-Mono respectively. The 350M and 3B models were trained on the Jean Zay supercomputer in France, while the 1.2B model was trained in collaboration with Tracto AI. Two retrieval augmented generation variants, Pleias-RAG-350M and Pleias-RAG-1B, were also released and have been reported to lead public RAG benchmarks in their parameter range. ^[7]

The arXiv paper reports two reference training runs to validate the corpus. A 350 million parameter model was trained on approximately one trillion tokens for 2,944 H100 GPU hours, and a 1.2 billion parameter model was trained on the full corpus plus three additional epochs of a filtered subset for 23,040 H100 GPU hours. On multilingual evaluation benchmarks, the 350M model scored 0.774 on MultiBLiMP, 0.509 on XStoryCloze, and 0.533 on XCOPA, while the 1.2B model scored 0.797, 0.526, and 0.541 respectively. ^[1] The authors report that the 350M model outperforms most other models in the 1B range on multilingual benchmarks, with Gemma 3 1B being the only exception, and conclude that the two models "perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining." ^[1]

Beyond the Pleias models, Common Corpus has been adopted in part by several other organisations. The arXiv paper notes that components have been used in pretraining and continued training experiments by industry labs and academic groups, and frames the dataset as "an open science infrastructure dedicated to the entire lifecycle of language models." ^[1]

How does Common Corpus compare to other pretraining datasets?

Common Corpus occupies a distinctive position in the landscape of open pretraining corpora. The paper presents a comparison table summarising how Common Corpus differs from contemporaneous datasets along the axes of scale, licensing, multilinguality, and domain diversity. ^[1]

Dataset	Released	Approx. tokens	Multilingual	Fully open license	Multidomain
The Pile	2020	340 billion	No	Partial	Yes
C4	2020	175 billion	No	ODC-BY	Web only
RedPajama v2	2023	30 trillion (raw)	Yes	Mixed	Yes
Dolma	2024	~3 trillion	Limited	ODC-BY	Yes
FineWeb	2024	15 trillion	No	ODC-BY	Web only
FineWeb 2	2024	~3 trillion	Yes	ODC-BY	Web only
DCLM baseline	2024	4 trillion	No	Mixed	Web only
Nemotron-CC	2024	6.3 trillion	No	Mixed	Web only
Common Pile v0.1	2025	~1 trillion	No	Yes	Yes
Common Corpus	2024 to 2025	~2 trillion	Yes	Yes	Yes

The paper observes that less than two percent of pages and one percent of domain names overlap between FineWeb and Common Corpus, indicating that the two datasets capture substantially different slices of the textual world. ^[1] FineWeb is built from web crawls and is therefore dominated by contemporary online prose, while Common Corpus draws heavily on archival books, newspapers, parliamentary debates, court filings, and scientific repositories. The two corpora are arguably complementary rather than competing.

Relative to Common Pile, which is roughly half the size and English only, Common Corpus is larger, multilingual, and includes a much greater volume of cultural heritage content. ^[8] Relative to Dolma, which is approximately 50 percent larger but mostly English and partly assembled from non openly licensed sources under ODC-BY, Common Corpus is smaller but stricter on licensing and provenance. ^[10] Relative to RedPajama and Nemotron-CC, which derive much of their content from Common Crawl, Common Corpus is smaller and not web focused but offers cleaner licensing guarantees.

The authors note that the multilingual and multidomain combination remains unique. No other dataset in the open pretraining ecosystem at the time of writing simultaneously offered trillion token scale, full provenance, broad domain coverage, and significant non English content. ^[1]

What has been the reception and impact of Common Corpus?

Common Corpus received broad coverage in the AI press at its initial release. Simon Willison covered the dataset and the Pleias 1.0 models in his blog, MarkTechPost and VentureBeat published detailed write ups, and the project was featured in Mozilla Builders' showcase and in posts by Hugging Face and the AI Alliance. ^[5]^[7]^[12] The Walled Culture blog and Techdirt highlighted the legal and policy significance, framing Common Corpus as a counter example to the argument that competitive LLM training necessarily requires unlicensed copyrighted material.

The paper was accepted as an oral presentation at the International Conference on Learning Representations (ICLR) 2026, a leading venue for machine learning research. ^[11] As of mid 2026, the dataset has been mirrored or partially incorporated by several downstream projects, including continued pretraining of open models for European languages and as a licensed seed corpus for retrieval augmented generation systems in regulated industries such as finance and law.

Critics have noted that, even at two trillion tokens, Common Corpus is substantially smaller than the multitrillion token corpora used to train frontier closed models, and that some of its strongest content categories, particularly older cultural heritage material, may not align well with the prose styles encountered in modern downstream use. Proponents reply that strict licensing was the project's defining constraint and that the demonstration of competitive multilingual small models trained exclusively on the corpus is itself a proof of concept rather than a final word on scale.

How is Common Corpus accessed and governed?

Common Corpus is hosted on the Hugging Face Hub at huggingface.co/datasets/PleIAs/common_corpus and is freely downloadable in Parquet format using the standard datasets library. ^[3] Each record carries metadata fields including a unique identifier, the source collection, the partition (open_type), the license, a creation or publication date, the title, the originating creator institution, the detected language, a word count, and a token count, in addition to the cleaned full text. This metadata enables downstream filtering by license, date range, language, or domain.

The project is governed informally by Pleias with input from partner organisations through the AI Alliance Open Trusted Data Initiative. ^[4] Updates and new partitions are coordinated through public discussions on the Hugging Face dataset repository and the Pleias GitHub organisation. The authors invite contributions and corrections from the community, including the addition of new openly licensed sources for future versions.

References

Langlais, P.-C., Chizhov, P., Arnett, C., Hinostroza, C. R., Nee, M., Jones, E. K., Girard, I., Mach, D., Stasenko, A., and Yamshchikov, I. P. *Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training*. arXiv:2506.01732, June 2025. https://arxiv.org/abs/2506.01732 ↩
Pleias and Hugging Face. "Releasing the largest multilingual open pretraining dataset." Hugging Face blog, 13 November 2024. https://huggingface.co/blog/Pclanglais/two-trillion-tokens-open ↩
PleIAs. *common_corpus* dataset card, Hugging Face Hub. https://huggingface.co/datasets/PleIAs/common_corpus ↩
AI Alliance. "Pleias Releases Common Corpus, The Largest Open Multilingual Dataset for LLM training." thealliance.ai blog, November 2024. https://thealliance.ai/blog/pleias-releases-common-corpus-open-multilingual-dataset-for-llm-training ↩
Mozilla Builders. "Announcing Common Corpus." builders.mozilla.org, November 2024. https://builders.mozilla.org/announcing-common-corpus/ ↩
Langlais, P.-C. "They Said It Couldn't Be Done." Hugging Face blog, December 2024, on the Pleias 1.0 model family. https://huggingface.co/blog/Pclanglais/common-models ↩
Willison, S. "New Pleias 1.0 LLMs trained exclusively on openly licensed data." simonwillison.net, 5 December 2024. https://simonwillison.net/2024/Dec/5/pleias-llms/ ↩
Kandpal, N. et al. *The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text*. arXiv:2506.05209, June 2025. https://arxiv.org/abs/2506.05209 ↩
Penedo, G. et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." NeurIPS 2024 Datasets and Benchmarks Track.
Soldaini, L. et al. "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." ACL 2024. ↩
ICLR 2026 program listing for *Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training* (oral). https://iclr.cc/virtual/2026/poster/10011885 ↩
MarkTechPost. "Pleias Introduces Common Corpus: The Largest Multilingual Dataset for Pretraining Language Models." 18 November 2024. https://www.marktechpost.com/2024/11/18/pleias-introduces-common-corpus-the-largest-multilingual-dataset-for-pretraining-language-models/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Common Pile DCLM (DataComp for Language Models)Nemotron-CC Pleias

What is Common Corpus?

Why was Common Corpus created?

Who built Common Corpus, and when?

What is in Common Corpus?

How multilingual is Common Corpus?

Is Common Corpus open source and properly licensed?

How was Common Corpus cleaned and curated?

What models were trained on Common Corpus?

How does Common Corpus compare to other pretraining datasets?

What has been the reception and impact of Common Corpus?

How is Common Corpus accessed and governed?

See also

References

Improve this article

Related Articles

The Pile (dataset)

FineWeb

RedPajama

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here

Related Articles

The Pile (dataset)

FineWeb

RedPajama

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here