Pleias
Last reviewed
May 16, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 · 3,455 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
21 citations
Review status
Source-backed
Revision
v1 · 3,455 words
Add missing citations, update stale details, or suggest a clearer explanation.
Pleias (stylized PleIAs) is a Paris based artificial intelligence laboratory and small company that designs, pretrains, and releases large language models trained exclusively on public domain and permissively licensed data. Founded in 2024 by Pierre-Carl Langlais, Anastasia Stasenko, and Ivan Yamshchikov, the firm has positioned itself as one of the most prominent advocates of what it calls "ethical pretraining," the view that competitive language models can be built without scraping copyrighted material. Pleias is the principal coordinator of Common Corpus, a multilingual open dataset that exceeded two trillion tokens by 2025, and the maker of the Pleias 1.0 family of small language models, the first such family pretrained from scratch on a fully open corpus. The lab's stance has drawn attention from the open source community, European policymakers, and observers of the broader debate over copyright, data provenance, and AI ethics.
The company is small by industry standards, with roughly fifteen people listed on its website as of 2025, and has not disclosed a venture funding round. Despite that modest profile, Pleias has built a reputation for producing technical artifacts that influence larger players, including curated public domain datasets, multilingual tokenizers, optical character recognition correction models, and small reasoning models with native citation support. Its work has been adopted or referenced by organizations including Hugging Face, EleutherAI, Mozilla, Wikimedia Enterprise, Nomic AI, and Occiglot, and it is one of the founding members of the Open Trusted Data Initiative.
Pleias was incorporated in 2024 in Paris, France, with a small founding team drawn from academia, digital humanities, and industry. The three co-founders bring complementary backgrounds: Pierre-Carl Langlais holds a PhD in information science and is an associate researcher at the Sorbonne Center for Artificial Intelligence, with a long publication record in computational humanities and digitized text studies. Anastasia Stasenko, who serves as chief executive, completed a PhD in philosophy at the École Normale Supérieure (ENS Ulm) and previously worked at Hachette Publishing before becoming an associate senior lecturer at Sorbonne-Nouvelle. Ivan Yamshchikov, the third co-founder, is a research professor at CAIRO and was previously an executive at Yandex and ABBYY, where he worked on natural language systems and document understanding.
The name evokes the Greek Πλειάς and the sixteenth century French poetic group the Pléiade, while the stylization PleIAs embeds the French abbreviation IA (intelligence artificielle). The founders framed Pleias from the outset not as a frontier laboratory chasing scale, but as a research and engineering team focused on energy efficient models, data quality, and verifiability.
| Name | Role | Background |
|---|---|---|
| Anastasia Stasenko | Co-founder and CEO | PhD philosophy, ENS Ulm; ex Hachette Publishing |
| Pierre-Carl Langlais | Co-founder and CTO | PhD information science; researcher at Sorbonne Center for AI |
| Ivan Yamshchikov | Co-founder | PhD financial mathematics; research professor at CAIRO; ex Yandex, ABBYY |
Beyond the founders, the extended team includes AI scientists, data engineers, and product specialists with prior experience at institutions such as KU Leuven, Apple, Aleph Alpha, and ENS Paris. The lab has remained small, and its founders have repeatedly emphasized that the team prioritizes data and methods research over headcount growth.
Pleias's defining commitment is that competitive language models can be trained without using copyrighted material that has been scraped without permission. The argument, often associated with Pierre-Carl Langlais in public talks and blog posts, is twofold. First, the legal and ethical position of training on web scrapes is unsettled, and the prevailing industry practice of using opt-out only data collection is incompatible with European norms and likely with parts of the EU AI Act. Second, the engineering claim is that the universe of openly licensed and public domain text is far larger than commonly assumed, especially once digitized cultural heritage, scientific literature, government documents, and code repositories are aggregated.
In April 2024 a profile by Euronews described Pleias as "a French start-up that just proved OpenAI wrong" after the company published evidence that training on non copyrighted data was tractable. The same year, in a column for the Orange research outlet Hello Future, Langlais argued that the field had drifted toward an "extractive" relationship with the open web and that public institutions, cultural heritage collections, and scientific archives could supply enough text for general purpose pretraining. The thesis has since been picked up by parallel projects at EleutherAI (Common Pile), the Allen Institute for Artificial Intelligence (OLMo 2 and the Dolma corpus), and Hugging Face (FineWeb, which is openly licensed but does not enforce the same provenance discipline).
Pleias also frames its position around energy efficiency and EU regulatory alignment. The lab reports CO2 emissions per training run and emphasizes that its models are designed for CPU inference, on device deployment, and high regulation industries such as legal, financial, and healthcare services, where data lineage matters.
Pleias has organized its work into three product lines: the Common Corpus open data resource, the Pleias 1.0 base models, and a suite of specialized models and tools that includes Pleias-RAG, OCRonos-Vintage, the Celadon toxicity classifier, and the SYNTH synthetic data pipeline. The company also operates an enterprise stack branded STRATUM for agentic applications.
Common Corpus is Pleias's flagship dataset and the project that gave the lab its public profile. The first version was released in March 2024, comprising around 500 billion words of public domain text and described at the time as the largest fully open multilingual dataset for language model pretraining. The release was coordinated with Hugging Face, EleutherAI, Occiglot, Nomic AI, and Lang:IA, a French government program supported by the Ministry of Culture.
A substantially expanded version was announced in November 2024 in partnership with Mozilla Builders, reaching more than two trillion tokens and reorganized into six thematic collections. A peer reviewed description of the corpus was accepted for oral presentation at the International Conference on Learning Representations (ICLR) 2026.
| Collection | Approximate tokens | Description |
|---|---|---|
| Open Government | 406.6 billion | Financial, legal, and administrative documents in the public domain |
| Open Culture | 886 billion | Cultural heritage monographs and periodicals in more than thirteen languages |
| Open Science | 281.2 billion | Scientific papers and research under permissive licenses |
| Open Code | 283.2 billion | Source code from The Stack v1 and v2 under free licenses |
| Open Web | 73.2 billion | Wikipedia, Wikisource, YouTube transcripts, and StackExchange |
| Open Semantic | 68 billion | Wikidata converted into natural language |
The total corpus sits at roughly two trillion tokens, with public domain content accounting for about 1.14 trillion tokens (around 57 percent), Creative Commons Attribution material adding 287.7 billion tokens, MIT licensed content contributing 142.7 billion tokens, and Creative Commons Attribution ShareAlike material providing another 74.8 billion tokens. English remains the largest single language at roughly 867 billion tokens, but more than 40 percent of the corpus is non English, with French at 266 billion tokens, German at 112 billion, and significant collections in Spanish, Italian, Dutch, Latin, and Portuguese, among others. More than thirty languages have at least one billion tokens of representation.
The dataset is hosted on Hugging Face and released under licenses consistent with the original sources. Pleias also published a series of language specific subsets, including French-PD-Newspapers, French-PD-Books, German-PD, Italian-PD, Spanish-PD-Books, Dutch-PD, Portuguese-PD, and Greek-PD.
On December 5, 2024, Pleias released the Pleias 1.0 family, the first set of language models pretrained from scratch entirely on Common Corpus. The release marked a milestone for the open data thesis: until that point most claims that ethical pretraining was viable had relied on assertions about dataset availability rather than fully trained models. Pleias 1.0 produced models that, while small, demonstrated coherent multilingual generation, low toxicity, and competitive performance on retrieval augmented generation benchmarks.
| Model | Parameters | GPUs | GPU type | Training time | Reported CO2 |
|---|---|---|---|---|---|
| Pleias-Pico | 350 million | 64 | H100 | 1.92 days | 0.5 tCO2eq |
| Pleias-Nano | 1.2 billion | 192 | H100 | 5 days | 4 tCO2eq |
| Pleias-1.0 (base) | 3 billion | 192 | H100 | 20 days | 16 tCO2eq |
The models use a transformer architecture similar to Llama and GPT-NeoX, which Pleias justified as a deliberate choice to maximize compatibility with existing inference stacks. They were trained with Nanotron, the open source framework maintained by Hugging Face, on the TractoAI compute platform. A custom multilingual tokenizer, optimized for the writing systems represented in Common Corpus, was developed to address inefficiencies that Pleias reported in tokenizers trained on web scrapes.
In an internal RAG tournament reported by the company, Pleias-Nano-1.2B-RAG achieved an ELO of 1137.5, outperforming Llama 3.2 1.1B and EuroLLM 1.7B, while Pleias-Pico-350M-RAG reached an ELO of 1051.2, beating SmolLM 360M and Qwen2.5 500M. On multilingual language adherence, a metric for whether models continue in the prompted language, Pleias 350M scored 89.8 percent, the 1.2B model 90.4 percent, and the 3B model 90.7 percent, compared with 65.6 percent for SmolLM 360M and 86.9 percent for EuroLLM 1.7B. Toxic generation rates were lower than peers, with the 350M model producing toxic outputs in 22.9 percent of stressed prompts versus 41.4 percent for OLMo 1B.
The models are released under an Apache 2.0 license, can run on CPU only hardware without quantization, and are described by Pleias as fully compliant with the EU AI Act's data provenance requirements.
In April 2025 Pleias released a follow up family of small reasoning models named Pleias-RAG, focused on retrieval augmented generation with native citations. Two checkpoints were published: Pleias-RAG-350M and Pleias-RAG-1B. The release was accompanied by a paper titled "Even Small Reasoners Should Quote Their Sources," posted to arXiv on April 25, 2025.
The Pleias-RAG models were mid trained on a large synthetic dataset that simulated retrieval of multilingual passages from Common Corpus. They were designed to handle four RAG workflow components in a single forward pass: query routing, query reformulation, source reranking, and citation grounded answer generation. Citations use a custom syntax inspired by Wikipedia and include shortened excerpts when source passages are long. Unlike approaches such as Anthropic's Citation mode, where citations are produced through external chunking, Pleias-RAG generates citations integrally within the model.
On the standardized RAG benchmarks 2WikiMultiHopQA, HotPotQA, and MuSiQue, the two Pleias-RAG models outperformed every published small language model below four billion parameters and were reported as competitive with Qwen-2.5-7B, Llama-3.1-8B, and Gemma-3-4B, despite being four to twenty times smaller. The release emphasized that the models maintained consistent performance across major European languages, a property Pleias attributed to the multilingual balance of Common Corpus.
In 2025 Pleias began publicizing SYNTH, described as the first fully autonomous dataset for synthetic pretraining. SYNTH packages a synthetic data pipeline that allows the construction of structured training material for specialized tasks without relying on external scraped data.
The lab has also released several auxiliary tools used to clean and curate Common Corpus, all of which are themselves open source:
A demonstration application named ScholasticAI runs the Pico model locally and is published as open source.
For a small lab, Pleias has assembled an unusually broad partner network. The collaborations cluster around three themes: data sourcing, model release infrastructure, and applied deployments in regulated industries.
| Partner | Date | Nature |
|---|---|---|
| Hugging Face | 2024 | Dataset and model hosting; Nanotron training framework; co-announcement of Common Corpus |
| EleutherAI | 2024 | Co-coordination of Common Corpus; later parallel work on the Common Pile |
| Occiglot | 2024 | European multilingual model collaboration |
| Nomic AI | 2024 | Data infrastructure and embeddings |
| Lang:IA / French Ministry of Culture | 2024 | Funding and access to public cultural heritage data |
| Mozilla Builders | November 2024 | Co-announcement of expanded two trillion token Common Corpus |
| AI Alliance | 2024 onward | Open Trusted Data Initiative founding membership |
| Wikimedia Enterprise | February 18, 2025 | Access to structured Wikimedia datasets for the annealing phase of training |
| Mukwege Foundation | 2024 onward | Applied work on conflict related trauma support |
| Kajou | 2024 onward | Offline first systems for community health workers |
| SpineDAO | 2024 onward | Specialized medical knowledge applications |
The Wikimedia Enterprise agreement, announced in February 2025, gave Pleias access to pre-processed Wikipedia content, including infoboxes, sections, and RevertRisk credibility signals, intended for use in the annealing phase of model training. The Mozilla Builders relationship has been the most visible commercial validation of the Common Corpus thesis, since Mozilla used the partnership to argue that openly licensed alternatives to opaque pretraining datasets are feasible.
As of late 2025 Pleias has not publicly disclosed any venture funding round. Industry databases including Crunchbase and Tracxn list the company as unfunded, suggesting that its operations are sustained by a combination of consulting revenue, applied contracts, and grant or public funding tied to the Common Corpus effort. The founders have indicated in interviews that the lab takes a deliberately conservative approach to capital, on the grounds that a small open data team can ship competitive small models without frontier scale compute budgets.
Pleias sits inside a small but growing cohort of labs and consortia building openly licensed pretraining data and the models that result from it. The closest peers differ in geography, scale, and philosophy.
| Organization | Flagship dataset | Models | Approach to copyright | Scale |
|---|---|---|---|---|
| Pleias | Common Corpus (about 2T tokens) | Pleias 1.0 (350M, 1.2B, 3B); Pleias-RAG (350M, 1B) | Public domain or permissively licensed only | Small lab, Paris, no disclosed funding |
| EleutherAI | Common Pile v0.1 (8 TB; about 2T tokens) | Comma v0.1-2T | Public domain or openly licensed only | Nonprofit research collective, US |
| Allen Institute for AI | Dolma | OLMo 2 family | Mostly openly licensed; web crawl included with documentation | Major nonprofit institute, Seattle |
| Hugging Face | FineWeb, FineWeb-Edu | Various community models | Web filtered for permissive licenses, less strict on provenance | Large open source platform |
The most direct comparison is with EleutherAI's Common Pile, released in June 2025. Both projects share the same goal of openly licensed pretraining at trillion token scale, and both are co-coordinators of the AI Alliance Open Trusted Data Initiative. The Common Pile paper, published on arXiv as 2506.05209, reports that models trained on Common Pile outperform models trained on KL3M, OLC, and Common Corpus and perform comparably to those trained on the Pile or OSCAR. The same paper notes that Common Corpus models stably outperform OLMo 1B, which was also pre-trained on a publicly released dataset.
The Allen Institute's Dolma dataset and the OLMo 2 family take a slightly different view, including some web crawled content with careful documentation rather than restricting the corpus to public domain or permissively licensed sources only. Hugging Face's FineWeb is openly licensed in distribution but is filtered from Common Crawl and therefore relies on the underlying copyright posture of web scraping. Pleias's strict public domain stance is arguably the most conservative among these efforts, which has both costs (less raw content per unit effort) and benefits (cleaner legal and ethical posture).
The shared paper Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training, posted to arXiv as 2506.01732 in June 2025, makes the case for the broader category of "ethical data" and explicitly aligns Pleias with the AI Alliance Open Trusted Data Initiative.
The reception of Pleias's work in 2024 and 2025 reflected its dual status as both a niche technical project and a public argument about the future of AI training. Coverage by Euronews, Digital Watch Observatory, MarkTechPost, Slashdot, and VentureBeat consistently framed the lab as evidence that copyrighted scrapes are not strictly necessary for building useful language models. The November 2024 expansion of Common Corpus to two trillion tokens was described by Mozilla as a milestone for what the foundation called the open AI commons.
The Pleias 1.0 release in December 2024 prompted discussion within the small model community over how much performance can be recovered with a strict provenance discipline. Analysts noted that the 3B base model traded some benchmark performance against models such as OLMo and Llama 3.2 of similar size but compensated with stronger multilingual adherence, lower toxic generation rates, and dramatically lower training emissions.
The April 2025 Pleias-RAG release attracted broader attention because the citation grounded reasoning behavior, packaged in a sub one billion parameter model, was unusual in the small model landscape. Commentators including Simon Willison and Rohan Paul wrote that the result demonstrated that small models could be specialized for retrieval workflows without losing the citation discipline normally associated with much larger systems.
Pleias itself has reported reuse of its tools and datasets by researchers and engineers at Anthropic, IBM, StepFun, ElasticSearch, and Morgan Stanley, although the firm has not detailed those engagements publicly. Such adoption, while not equivalent to commercial partnerships, illustrates that the artifacts produced by a small team in Paris have circulated into industry research workflows.
The lab also participates in policy conversations about the EU AI Act's data transparency obligations. Pleias has argued in submissions and public posts that the act's requirements for documented training data are achievable without sacrificing model quality, and that the existence of Common Corpus and the Pleias 1.0 family is direct evidence for that position. Whether other firms follow remains an open question, but the lab has succeeded in making the question harder to ignore.
As of 2026 Pleias remains a small Paris based laboratory with a focused research agenda, no disclosed venture funding, and a growing catalog of open data, open models, and open tools. The trajectory of the field has begun to converge on the lab's central claim: that careful curation of permissively licensed content can support useful pretraining, and that the ethical and regulatory case for doing so is strengthening rather than fading. Whether the broader industry adopts the strict public domain discipline that Pleias practices, or settles on the looser openly licensed posture preferred by FineWeb and similar projects, is one of the defining open questions in large language model training in the late 2020s.