Common Corpus
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,270 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 3,270 words
Add missing citations, update stale details, or suggest a clearer explanation.
Common Corpus is a large-scale, multilingual pretraining dataset for large language models assembled and released by the French AI research lab Pleias. At more than two trillion tokens spread across roughly 517 million documents, it is the largest collection of fully openly licensed text ever published for language model training. Every document in the corpus is either in the public domain or distributed under an explicitly permissive license, including Creative Commons CC-BY and CC0, MIT, Apache 2.0, BSD, and the Open Data Commons family of licenses. An initial version was made public on the Hugging Face Hub on 13 November 2024 in partnership with Hugging Face, EleutherAI, Nomic AI, Occiglot, Mozilla Builders, the AI Alliance, and other collaborators. A substantially expanded version was released in February 2025 for the Paris AI Action Summit, and the accompanying research paper, titled Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training, was posted to arXiv in June 2025 and accepted as an oral presentation at the ICLR 2026 conference.
Common Corpus is organised into six thematic partitions covering government and legal documents, cultural heritage texts, scientific publications, source code, web encyclopedic content, and semantic knowledge graphs. It is the first openly licensed pretraining corpus to simultaneously satisfy four properties that competing datasets had typically traded off against one another: scale at the trillion-token level, fully permissive licensing with documented provenance, coverage across many domains beyond web crawl, and substantial multilingual representation. More than 40 percent of the data is non-English, with eight languages exceeding ten billion tokens each and more than thirty languages exceeding one billion tokens. The corpus was used to train the Pleias 1.0 family of small language models in December 2024, the first family of competitive language models pretrained exclusively on openly licensed data.
The choice of pretraining data is one of the most consequential decisions in modern language model development. Web-scraped corpora such as Common Crawl, C4, The Pile, RefinedWeb, RedPajama, Dolma, FineWeb, and Nemotron-CC have powered nearly every state of the art LLM released since the early 2020s. These datasets, however, are assembled by scraping the open web and filtering out only the most obvious problems. They include very large quantities of copyrighted material that has not been explicitly licensed for reuse, ranging from news articles and book passages to web fiction and software documentation. Several high profile lawsuits, most notably The New York Times v. OpenAI and parallel author and music publisher actions against Anthropic, Meta, and others, have placed the legality of training on such mixtures in question. Regulators have responded as well, with the European Union's AI Act requiring that providers of general purpose AI models publish detailed summaries of their training data and respect machine readable opt outs.
A small but growing body of work has attempted to address this provenance problem by curating openly licensed alternatives. The Pile included a number of explicitly licensed components but mixed them with web content and copyrighted books. Books3, once a popular component of open corpora, was withdrawn after it became clear that most of its contents were pirated. Stack Exchange, Wikipedia, government publications, and certain scientific repositories have long been used as licensed seed sources, but they were generally too small to support pretraining at competitive scale. The KL3M dataset from 273 Ventures focused on administrative and legal English. The Common Pile, released by EleutherAI and collaborators in 2025, assembled eight terabytes of public domain and openly licensed text, but is English only.
Common Corpus was motivated by the observation that an open and ethically sourced corpus large enough to train competitive multilingual models was achievable if one drew systematically on government archives, digitised cultural heritage holdings, scientific repositories with open access policies, permissively licensed code, and structured knowledge bases. The project's stated goal was to demonstrate, with reproducible artefacts, that there is no fundamental tradeoff between legal openness and pretraining utility.
The corpus was led by Pleias, a startup co-founded by Pierre-Carl Langlais, Anastasia Stasenko, and Ivan Yamshchikov in Paris in 2023. Pleias positioned itself as a research lab and small model builder focused on data ethics, archival digitisation, and specialised LLMs for cultural heritage and information science applications. The Common Corpus initiative grew out of work on Open Culture and Open Government collections that Langlais and collaborators had been assembling for several years, and was formalised in early 2024 with funding and infrastructure support from the French Ministry of Culture's ALT-EDIC programme, the GENCI Jean Zay supercomputer, the Nvidia Inception Program, and a number of cloud partners.
The AI Alliance, a consortium founded by IBM and Meta in late 2023, adopted Common Corpus as a flagship project of its Open Trusted Data Initiative, lending coordination, governance review, and additional partners. Hugging Face hosted the dataset and contributed engineering support. EleutherAI, Nomic AI, and Occiglot collaborated on filtering, evaluation, and multilingual coverage. Mozilla Builders supported the release through its 2024 cohort. Wikimedia Enterprise contributed the Wikidata and YouTube Commons partitions, and Libraries Without Borders supported access to cultural heritage materials.
The project followed a phased release. A preview spanning roughly 500 billion tokens was made available in March 2024 alongside the Pleias OCRonos error correction model. The full first release on 13 November 2024 contained approximately two trillion tokens, accompanied by a detailed blog post by Langlais on Hugging Face. A substantially curated and rebalanced version was released in February 2025 at the Paris AI Action Summit, with cleaner formatting, additional code, and expanded language coverage. By the time of the arXiv paper in June 2025, the canonical Hugging Face dataset reported 2.27 trillion tokens across 517 million documents and 4.49 terabytes of Parquet files.
Common Corpus is divided into six top level partitions, each corresponding to a sourcing strategy and a set of source repositories. The total token counts as reported in the June 2025 paper, computed with the Pleias tokenizer and the Gemma 3 tokenizer for non-Western languages, are summarised below.
| Partition | Tokens (approx.) | Documents (approx.) | Primary sources |
|---|---|---|---|
| Open Culture | 886 billion | 93.2 million | Public domain books and newspapers from cultural heritage institutions, Project Gutenberg, Wikisource, Internet Archive holdings |
| Open Government | 407 billion | 75.6 million | SEC EDGAR filings, WTO documents, Europarl proceedings, Chinese case law, Finance Commons, Legal Commons |
| Open Code | 283 billion | 202.8 million | Permissively licensed GitHub repositories filtered with the ArmoRM quality classifier, retaining the top 80 percent |
| Open Science | 281 billion | 19.2 million | OpenAlex, open access journal articles, preprints, dissertations |
| Open Web | 73 billion | 96.2 million | Wikipedia (CC-BY-SA), YouTube Commons transcripts, Stack Exchange (CC-BY-SA) |
| Open Semantic | 68 billion | 30.1 million | Wikidata statements and triplets across 300+ languages |
| Total | ~1.998 trillion | ~517 million |
Code accounts for roughly 18.8 percent of the corpus and Open Culture for slightly more than 44 percent, making cultural heritage the single largest source. Government and legal text together with scientific content provide the dominant share of high formality prose. The Open Web partition is intentionally small because most of the open web is not openly licensed, so Wikipedia, YouTube Commons audio transcripts, and Stack Exchange are the principal contributors. Open Semantic is unique among large pretraining corpora in including structured semantic triplets at scale, which the authors argue improves knowledge grounding for very small models.
A central design goal of Common Corpus was multilingual breadth. Roughly 41 percent of the tokens are in languages other than English. The paper reports eight languages with more than ten billion tokens, and counts thirty plus languages with more than one billion tokens. The high resource European languages dominate because of the large public domain holdings digitised by European national libraries, but the corpus also includes substantial low resource and historical language material, including Latin, Ancient Greek, Old French, and several non Indo European languages from cultural heritage collections.
| Language | Tokens (approx.) |
|---|---|
| English | 867 billion |
| French | 266 billion |
| German | 112 billion |
| Spanish | 46 billion |
| Latin | 34 billion |
| Dutch | 29 billion |
| Italian | 24 billion |
| Polish | 11 billion |
| Greek | 11 billion |
| Portuguese | 9 billion |
| 30+ additional languages | 1 to 8 billion each |
The paper highlights that the French and German shares are particularly high relative to other open corpora, a direct consequence of the long running digitisation programmes run by the Bibliotheque nationale de France, the Deutsche Nationalbibliothek, and partner institutions. Coverage of Chinese, Japanese, Korean, Arabic, and Hindi is also present in smaller but meaningful quantities, drawn primarily from open access scientific content, Wikidata, and the Open Government partition.
Every document in Common Corpus carries a recorded license string. The paper reports that the majority of the corpus, more than 1.1 trillion tokens, is in the public domain. The second largest license category is Creative Commons Attribution at approximately 288 billion tokens. Substantial volumes of CC0, CC-BY-SA, MIT, Apache 2.0, BSD, and Open Data Commons material make up the remainder. Users can filter the dataset by license type, for example to retain only public domain and attribution only content for the strictest downstream applications.
The authors argue that Common Corpus satisfies the Open Source Initiative's Open Source AI Definition with respect to training data and that it exceeds the data transparency thresholds set by the EU AI Act's Code of Practice for General Purpose AI. Provenance information at the document level allows downstream model builders to publish detailed training data summaries, attribute sources, and respect any opt out signals.
Much of the technical work behind Common Corpus involved producing usable text from low quality digitisations and heterogeneous source formats. The project produced several specialised tools and intermediate models, some of which were released independently on Hugging Face.
OCRonos. Cultural heritage holdings are typically distributed as image scans with optical character recognition applied at the time of digitisation, often decades ago. Pleias trained a 124 million parameter model called OCRonos-Vintage and a larger Llama 3 8B based model called OCRonos to correct OCR errors, repair broken word boundaries, and reconstruct text structure. On heavily degraded inputs the model functions more like synthetic rewriting than strict correction, while remaining faithful to the underlying material. Without OCRonos, the authors estimate that hundreds of billions of tokens from newspaper and book scans would have been unusable.
Vision language layout extraction. Scientific PDFs, government reports, and legal documents carry information in tables, figures, equations, and footnotes that simple text extraction destroys. The pipeline uses vision language models to preserve document structure before tokenisation, particularly for the Open Science partition.
Code quality filtering. The Open Code partition was filtered using the ArmoRM quality classifier to retain only repositories scoring in the top 80 percent of the quality distribution. Permissive license verification was performed at the repository level using SPDX identifiers.
Toxicity and PII filtering. Toxic content was identified and removed using the Celadon multilingual toxicity classifier developed for the project. Personally identifiable information was removed using Microsoft Presidio extended with language and country specific patterns to handle European document conventions.
Deduplication. Near duplicate documents were identified across partitions using MinHash and locality sensitive hashing, with thresholds calibrated to preserve legitimate near duplicates such as multiple translations of the same legal text.
Common Corpus was used to train the Pleias 1.0 family of small language models, announced in December 2024. The family includes three base models at 350 million, 1.2 billion, and 3 billion parameters, branded Pleias-Pico, Pleias-Nano, and Pleias-Mono respectively. The 350M and 3B models were trained on the Jean Zay supercomputer in France, while the 1.2B model was trained in collaboration with Tracto AI. Two retrieval augmented generation variants, Pleias-RAG-350M and Pleias-RAG-1B, were also released and have been reported to lead public RAG benchmarks in their parameter range.
The arXiv paper reports two reference training runs to validate the corpus. A 350 million parameter model was trained on approximately one trillion tokens for 2,944 H100 GPU hours, and a 1.2 billion parameter model was trained on the full corpus plus three additional epochs of a filtered subset for 23,040 H100 GPU hours. On multilingual evaluation benchmarks, the 350M model scored 0.774 on MultiBLiMP, 0.509 on XStoryCloze, and 0.533 on XCOPA, while the 1.2B model scored 0.797, 0.526, and 0.541 respectively. The authors report that the 350M model outperforms most other models in the 1B range on multilingual benchmarks, with Gemma 3 1B being the only exception.
Beyond the Pleias models, Common Corpus has been adopted in part by several other organisations. The arXiv paper notes that components have been used in pretraining and continued training experiments by industry labs and academic groups. The dataset is downloaded over one hundred thousand times per month from the Hugging Face Hub.
Common Corpus occupies a distinctive position in the landscape of open pretraining corpora. The paper presents a comparison table summarising how Common Corpus differs from contemporaneous datasets along the axes of scale, licensing, multilinguality, and domain diversity.
| Dataset | Released | Approx. tokens | Multilingual | Fully open license | Multidomain |
|---|---|---|---|---|---|
| The Pile | 2020 | 340 billion | No | Partial | Yes |
| C4 | 2020 | 175 billion | No | ODC-BY | Web only |
| RedPajama v2 | 2023 | 30 trillion (raw) | Yes | Mixed | Yes |
| Dolma | 2024 | ~3 trillion | Limited | ODC-BY | Yes |
| FineWeb | 2024 | 15 trillion | No | ODC-BY | Web only |
| FineWeb 2 | 2024 | ~3 trillion | Yes | ODC-BY | Web only |
| DCLM baseline | 2024 | 4 trillion | No | Mixed | Web only |
| Nemotron-CC | 2024 | 6.3 trillion | No | Mixed | Web only |
| Common Pile v0.1 | 2025 | ~1 trillion | No | Yes | Yes |
| Common Corpus | 2024 to 2025 | ~2 trillion | Yes | Yes | Yes |
The paper observes that less than two percent of pages and one percent of domain names overlap between FineWeb and Common Corpus, indicating that the two datasets capture substantially different slices of the textual world. FineWeb is built from web crawls and is therefore dominated by contemporary online prose, while Common Corpus draws heavily on archival books, newspapers, parliamentary debates, court filings, and scientific repositories. The two corpora are arguably complementary rather than competing.
Relative to Common Pile, which is roughly half the size and English only, Common Corpus is larger, multilingual, and includes a much greater volume of cultural heritage content. Relative to Dolma, which is approximately 50 percent larger but mostly English and partly assembled from non openly licensed sources under ODC-BY, Common Corpus is smaller but stricter on licensing and provenance. Relative to RedPajama and Nemotron-CC, which derive much of their content from Common Crawl, Common Corpus is smaller and not web focused but offers cleaner licensing guarantees.
The authors note that the multilingual and multidomain combination remains unique. No other dataset in the open pretraining ecosystem at the time of writing simultaneously offered trillion token scale, full provenance, broad domain coverage, and significant non English content.
Common Corpus received broad coverage in the AI press at its initial release. Simon Willison covered the dataset and the Pleias 1.0 models in his blog, MarkTechPost and VentureBeat published detailed write ups, and the project was featured in Mozilla Builders' showcase and in posts by Hugging Face and the AI Alliance. The Walled Culture blog and Techdirt highlighted the legal and policy significance, framing Common Corpus as a counter example to the argument that competitive LLM training necessarily requires unlicensed copyrighted material.
The paper was accepted as an oral presentation at the International Conference on Learning Representations (ICLR) 2026, a leading venue for machine learning research. As of mid 2026, the dataset has been mirrored or partially incorporated by several downstream projects, including continued pretraining of open models for European languages and as a licensed seed corpus for retrieval augmented generation systems in regulated industries such as finance and law.
Critics have noted that, even at two trillion tokens, Common Corpus is substantially smaller than the multitrillion token corpora used to train frontier closed models, and that some of its strongest content categories, particularly older cultural heritage material, may not align well with the prose styles encountered in modern downstream use. Proponents reply that strict licensing was the project's defining constraint and that the demonstration of competitive multilingual small models trained exclusively on the corpus is itself a proof of concept rather than a final word on scale.
Common Corpus is hosted on the Hugging Face Hub at huggingface.co/datasets/PleIAs/common_corpus and is freely downloadable in Parquet format using the standard datasets library. Each record carries metadata fields including a unique identifier, the source collection, the partition (open_type), the license, a creation or publication date, the title, the originating creator institution, the detected language, a word count, and a token count, in addition to the cleaned full text. This metadata enables downstream filtering by license, date range, language, or domain.
The project is governed informally by Pleias with input from partner organisations through the AI Alliance Open Trusted Data Initiative. Updates and new partitions are coordinated through public discussions on the Hugging Face dataset repository and the Pleias GitHub organisation. The authors invite contributions and corrections from the community, including the addition of new openly licensed sources for future versions.