Pleias

AI Companies Natural Language Processing Open Source AI

17 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v2 · 3,455 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Pleias (stylized PleIAs) is a Paris based artificial intelligence laboratory and small company that designs, pretrains, and releases large language models trained exclusively on public domain and permissively licensed data.^[1] Founded in 2024 by Pierre-Carl Langlais, Anastasia Stasenko, and Ivan Yamshchikov, the firm has positioned itself as one of the most prominent advocates of what it calls "ethical pretraining," the view that competitive language models can be built without scraping copyrighted material. Pleias is the principal coordinator of Common Corpus, a multilingual open dataset that exceeded two trillion tokens by 2025, and the maker of the Pleias 1.0 family of small language models, the first such family pretrained from scratch on a fully open corpus.^[6] The lab's stance has drawn attention from the open source community, European policymakers, and observers of the broader debate over copyright, data provenance, and AI ethics.

The company is small by industry standards, with roughly fifteen people listed on its website as of 2025, and has not disclosed a venture funding round.^[1]^[20] Despite that modest profile, Pleias has built a reputation for producing technical artifacts that influence larger players, including curated public domain datasets, multilingual tokenizers, optical character recognition correction models, and small reasoning models with native citation support. Its work has been adopted or referenced by organizations including Hugging Face, EleutherAI, Mozilla, Wikimedia Enterprise, Nomic AI, and Occiglot, and it is one of the founding members of the Open Trusted Data Initiative.^[19]

Company overview

Pleias was incorporated in 2024 in Paris, France, with a small founding team drawn from academia, digital humanities, and industry. The three co-founders bring complementary backgrounds: Pierre-Carl Langlais holds a PhD in information science and is an associate researcher at the Sorbonne Center for Artificial Intelligence, with a long publication record in computational humanities and digitized text studies. Anastasia Stasenko, who serves as chief executive, completed a PhD in philosophy at the École Normale Supérieure (ENS Ulm) and previously worked at Hachette Publishing before becoming an associate senior lecturer at Sorbonne-Nouvelle. Ivan Yamshchikov, the third co-founder, is a research professor at CAIRO and was previously an executive at Yandex and ABBYY, where he worked on natural language systems and document understanding.^[1]

The name evokes the Greek Πλειάς and the sixteenth century French poetic group the Pléiade, while the stylization PleIAs embeds the French abbreviation IA (intelligence artificielle). The founders framed Pleias from the outset not as a frontier laboratory chasing scale, but as a research and engineering team focused on energy efficient models, data quality, and verifiability.

Founders and leadership

Name	Role	Background
Anastasia Stasenko	Co-founder and CEO	PhD philosophy, ENS Ulm; ex Hachette Publishing
Pierre-Carl Langlais	Co-founder and CTO	PhD information science; researcher at Sorbonne Center for AI
Ivan Yamshchikov	Co-founder	PhD financial mathematics; research professor at CAIRO; ex Yandex, ABBYY

Beyond the founders, the extended team includes AI scientists, data engineers, and product specialists with prior experience at institutions such as KU Leuven, Apple, Aleph Alpha, and ENS Paris. The lab has remained small, and its founders have repeatedly emphasized that the team prioritizes data and methods research over headcount growth.

Mission and the ethical pretraining thesis

Pleias's defining commitment is that competitive language models can be trained without using copyrighted material that has been scraped without permission. The argument, often associated with Pierre-Carl Langlais in public talks and blog posts, is twofold. First, the legal and ethical position of training on web scrapes is unsettled, and the prevailing industry practice of using opt-out only data collection is incompatible with European norms and likely with parts of the EU AI Act. Second, the engineering claim is that the universe of openly licensed and public domain text is far larger than commonly assumed, especially once digitized cultural heritage, scientific literature, government documents, and code repositories are aggregated.^[12]

In April 2024 a profile by Euronews described Pleias as "a French start-up that just proved OpenAI wrong" after the company published evidence that training on non copyrighted data was tractable.^[11] The same year, in a column for the Orange research outlet Hello Future, Langlais argued that the field had drifted toward an "extractive" relationship with the open web and that public institutions, cultural heritage collections, and scientific archives could supply enough text for general purpose pretraining.^[12] The thesis has since been picked up by parallel projects at EleutherAI (Common Pile), the Allen Institute for Artificial Intelligence (OLMo 2 and the Dolma corpus), and Hugging Face (FineWeb, which is openly licensed but does not enforce the same provenance discipline).

Pleias also frames its position around energy efficiency and EU regulatory alignment. The lab reports CO2 emissions per training run and emphasizes that its models are designed for CPU inference, on device deployment, and high regulation industries such as legal, financial, and healthcare services, where data lineage matters.

Products and research outputs

Pleias has organized its work into three product lines: the Common Corpus open data resource, the Pleias 1.0 base models, and a suite of specialized models and tools that includes Pleias-RAG, OCRonos-Vintage, the Celadon toxicity classifier, and the SYNTH synthetic data pipeline. The company also operates an enterprise stack branded STRATUM for agentic applications.

Common Corpus

Common Corpus is Pleias's flagship dataset and the project that gave the lab its public profile. The first version was released in March 2024, comprising around 500 billion words of public domain text and described at the time as the largest fully open multilingual dataset for language model pretraining.^[2] The release was coordinated with Hugging Face, EleutherAI, Occiglot, Nomic AI, and Lang:IA, a French government program supported by the Ministry of Culture.

A substantially expanded version was announced in November 2024 in partnership with Mozilla Builders, reaching more than two trillion tokens and reorganized into six thematic collections.^[3]^[4]^[21] A peer reviewed description of the corpus was accepted for oral presentation at the International Conference on Learning Representations (ICLR) 2026.

Common Corpus collections

Collection	Approximate tokens	Description
Open Government	406.6 billion	Financial, legal, and administrative documents in the public domain
Open Culture	886 billion	Cultural heritage monographs and periodicals in more than thirteen languages
Open Science	281.2 billion	Scientific papers and research under permissive licenses
Open Code	283.2 billion	Source code from The Stack v1 and v2 under free licenses
Open Web	73.2 billion	Wikipedia, Wikisource, YouTube transcripts, and StackExchange
Open Semantic	68 billion	Wikidata converted into natural language

The total corpus sits at roughly two trillion tokens, with public domain content accounting for about 1.14 trillion tokens (around 57 percent), Creative Commons Attribution material adding 287.7 billion tokens, MIT licensed content contributing 142.7 billion tokens, and Creative Commons Attribution ShareAlike material providing another 74.8 billion tokens.^[5] English remains the largest single language at roughly 867 billion tokens, but more than 40 percent of the corpus is non English, with French at 266 billion tokens, German at 112 billion, and significant collections in Spanish, Italian, Dutch, Latin, and Portuguese, among others.^[9] More than thirty languages have at least one billion tokens of representation.

The dataset is hosted on Hugging Face and released under licenses consistent with the original sources. Pleias also published a series of language specific subsets, including French-PD-Newspapers, French-PD-Books, German-PD, Italian-PD, Spanish-PD-Books, Dutch-PD, Portuguese-PD, and Greek-PD.^[9]

Pleias 1.0 family

On December 5, 2024, Pleias released the Pleias 1.0 family, the first set of language models pretrained from scratch entirely on Common Corpus.^[6] The release marked a milestone for the open data thesis: until that point most claims that ethical pretraining was viable had relied on assertions about dataset availability rather than fully trained models. Pleias 1.0 produced models that, while small, demonstrated coherent multilingual generation, low toxicity, and competitive performance on retrieval augmented generation benchmarks.^[7]

Pleias 1.0 base models

Model	Parameters	GPUs	GPU type	Training time	Reported CO2
Pleias-Pico	350 million	64	H100	1.92 days	0.5 tCO2eq
Pleias-Nano	1.2 billion	192	H100	5 days	4 tCO2eq
Pleias-1.0 (base)	3 billion	192	H100	20 days	16 tCO2eq

The models use a transformer architecture similar to Llama and GPT-NeoX, which Pleias justified as a deliberate choice to maximize compatibility with existing inference stacks. They were trained with Nanotron, the open source framework maintained by Hugging Face, on the TractoAI compute platform. A custom multilingual tokenizer, optimized for the writing systems represented in Common Corpus, was developed to address inefficiencies that Pleias reported in tokenizers trained on web scrapes.^[7]

In an internal RAG tournament reported by the company, Pleias-Nano-1.2B-RAG achieved an ELO of 1137.5, outperforming Llama 3.2 1.1B and EuroLLM 1.7B, while Pleias-Pico-350M-RAG reached an ELO of 1051.2, beating SmolLM 360M and Qwen2.5 500M.^[7] On multilingual language adherence, a metric for whether models continue in the prompted language, Pleias 350M scored 89.8 percent, the 1.2B model 90.4 percent, and the 3B model 90.7 percent, compared with 65.6 percent for SmolLM 360M and 86.9 percent for EuroLLM 1.7B. Toxic generation rates were lower than peers, with the 350M model producing toxic outputs in 22.9 percent of stressed prompts versus 41.4 percent for OLMo 1B.^[7]

The models are released under an Apache 2.0 license, can run on CPU only hardware without quantization, and are described by Pleias as fully compliant with the EU AI Act's data provenance requirements.^[7]

Pleias-RAG

In April 2025 Pleias released a follow up family of small reasoning models named Pleias-RAG, focused on retrieval augmented generation with native citations. Two checkpoints were published: Pleias-RAG-350M and Pleias-RAG-1B. The release was accompanied by a paper titled "Even Small Reasoners Should Quote Their Sources," posted to arXiv on April 25, 2025.^[8]

The Pleias-RAG models were mid trained on a large synthetic dataset that simulated retrieval of multilingual passages from Common Corpus. They were designed to handle four RAG workflow components in a single forward pass: query routing, query reformulation, source reranking, and citation grounded answer generation.^[8] Citations use a custom syntax inspired by Wikipedia and include shortened excerpts when source passages are long. Unlike approaches such as Anthropic's Citation mode, where citations are produced through external chunking, Pleias-RAG generates citations integrally within the model.^[10]

On the standardized RAG benchmarks 2WikiMultiHopQA, HotPotQA, and MuSiQue, the two Pleias-RAG models outperformed every published small language model below four billion parameters and were reported as competitive with Qwen-2.5-7B, Llama-3.1-8B, and Gemma-3-4B, despite being four to twenty times smaller.^[8] The release emphasized that the models maintained consistent performance across major European languages, a property Pleias attributed to the multilingual balance of Common Corpus.

SYNTH and supporting tools

In 2025 Pleias began publicizing SYNTH, described as the first fully autonomous dataset for synthetic pretraining. SYNTH packages a synthetic data pipeline that allows the construction of structured training material for specialized tasks without relying on external scraped data.

The lab has also released several auxiliary tools used to clean and curate Common Corpus, all of which are themselves open source:

OCRonos-Vintage, a small language model fine tuned to correct optical character recognition errors in digitized historical newspapers and books.
Celadon, a toxicity classifier used to filter pretraining data.
Nanotron-pleias, a fork of Hugging Face's Nanotron framework adapted to the data formats and training schedules used for the Pleias 1.0 family.

A demonstration application named ScholasticAI runs the Pico model locally and is published as open source.

Partnerships and ecosystem

For a small lab, Pleias has assembled an unusually broad partner network. The collaborations cluster around three themes: data sourcing, model release infrastructure, and applied deployments in regulated industries.

Major partnerships

Partner	Date	Nature
Hugging Face	2024	Dataset and model hosting; Nanotron training framework; co-announcement of Common Corpus
EleutherAI	2024	Co-coordination of Common Corpus; later parallel work on the Common Pile
Occiglot	2024	European multilingual model collaboration
Nomic AI	2024	Data infrastructure and embeddings
Lang:IA / French Ministry of Culture	2024	Funding and access to public cultural heritage data
Mozilla Builders	November 2024	Co-announcement of expanded two trillion token Common Corpus
AI Alliance	2024 onward	Open Trusted Data Initiative founding membership
Wikimedia Enterprise	February 18, 2025	Access to structured Wikimedia datasets for the annealing phase of training
Mukwege Foundation	2024 onward	Applied work on conflict related trauma support
Kajou	2024 onward	Offline first systems for community health workers
SpineDAO	2024 onward	Specialized medical knowledge applications

The Wikimedia Enterprise agreement, announced in February 2025, gave Pleias access to pre-processed Wikipedia content, including infoboxes, sections, and RevertRisk credibility signals, intended for use in the annealing phase of model training.^[15] The Mozilla Builders relationship has been the most visible commercial validation of the Common Corpus thesis, since Mozilla used the partnership to argue that openly licensed alternatives to opaque pretraining datasets are feasible.^[4]

As of late 2025 Pleias has not publicly disclosed any venture funding round. Industry databases including Crunchbase and Tracxn list the company as unfunded, suggesting that its operations are sustained by a combination of consulting revenue, applied contracts, and grant or public funding tied to the Common Corpus effort.^[20] The founders have indicated in interviews that the lab takes a deliberately conservative approach to capital, on the grounds that a small open data team can ship competitive small models without frontier scale compute budgets.

Comparison to peer organizations

Pleias sits inside a small but growing cohort of labs and consortia building openly licensed pretraining data and the models that result from it. The closest peers differ in geography, scale, and philosophy.

Organization	Flagship dataset	Models	Approach to copyright	Scale
Pleias	Common Corpus (about 2T tokens)	Pleias 1.0 (350M, 1.2B, 3B); Pleias-RAG (350M, 1B)	Public domain or permissively licensed only	Small lab, Paris, no disclosed funding
EleutherAI	Common Pile v0.1 (8 TB; about 2T tokens)	Comma v0.1-2T	Public domain or openly licensed only	Nonprofit research collective, US
Allen Institute for AI	Dolma	OLMo 2 family	Mostly openly licensed; web crawl included with documentation	Major nonprofit institute, Seattle
Hugging Face	FineWeb, FineWeb-Edu	Various community models	Web filtered for permissive licenses, less strict on provenance	Large open source platform

The most direct comparison is with EleutherAI's Common Pile, released in June 2025.^[18] Both projects share the same goal of openly licensed pretraining at trillion token scale, and both are co-coordinators of the AI Alliance Open Trusted Data Initiative. The Common Pile paper, published on arXiv as 2506.05209, reports that models trained on Common Pile outperform models trained on KL3M, OLC, and Common Corpus and perform comparably to those trained on the Pile or OSCAR. The same paper notes that Common Corpus models stably outperform OLMo 1B, which was also pre-trained on a publicly released dataset.^[17]

The Allen Institute's Dolma dataset and the OLMo 2 family take a slightly different view, including some web crawled content with careful documentation rather than restricting the corpus to public domain or permissively licensed sources only. Hugging Face's FineWeb is openly licensed in distribution but is filtered from Common Crawl and therefore relies on the underlying copyright posture of web scraping. Pleias's strict public domain stance is arguably the most conservative among these efforts, which has both costs (less raw content per unit effort) and benefits (cleaner legal and ethical posture).

The shared paper Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training, posted to arXiv as 2506.01732 in June 2025, makes the case for the broader category of "ethical data" and explicitly aligns Pleias with the AI Alliance Open Trusted Data Initiative.^[5]

Reception and influence

The reception of Pleias's work in 2024 and 2025 reflected its dual status as both a niche technical project and a public argument about the future of AI training. Coverage by Euronews, Digital Watch Observatory, MarkTechPost, Slashdot, and VentureBeat consistently framed the lab as evidence that copyrighted scrapes are not strictly necessary for building useful language models.^[11]^[13]^[14] The November 2024 expansion of Common Corpus to two trillion tokens was described by Mozilla as a milestone for what the foundation called the open AI commons.^[4]^[16]

The Pleias 1.0 release in December 2024 prompted discussion within the small model community over how much performance can be recovered with a strict provenance discipline. Analysts noted that the 3B base model traded some benchmark performance against models such as OLMo and Llama 3.2 of similar size but compensated with stronger multilingual adherence, lower toxic generation rates, and dramatically lower training emissions.^[7]

The April 2025 Pleias-RAG release attracted broader attention because the citation grounded reasoning behavior, packaged in a sub one billion parameter model, was unusual in the small model landscape.^[13] Commentators including Simon Willison and Rohan Paul wrote that the result demonstrated that small models could be specialized for retrieval workflows without losing the citation discipline normally associated with much larger systems.

Pleias itself has reported reuse of its tools and datasets by researchers and engineers at Anthropic, IBM, StepFun, ElasticSearch, and Morgan Stanley, although the firm has not detailed those engagements publicly. Such adoption, while not equivalent to commercial partnerships, illustrates that the artifacts produced by a small team in Paris have circulated into industry research workflows.

The lab also participates in policy conversations about the EU AI Act's data transparency obligations. Pleias has argued in submissions and public posts that the act's requirements for documented training data are achievable without sacrificing model quality, and that the existence of Common Corpus and the Pleias 1.0 family is direct evidence for that position. Whether other firms follow remains an open question, but the lab has succeeded in making the question harder to ignore.

Outlook

As of 2026 Pleias remains a small Paris based laboratory with a focused research agenda, no disclosed venture funding, and a growing catalog of open data, open models, and open tools. The trajectory of the field has begun to converge on the lab's central claim: that careful curation of permissively licensed content can support useful pretraining, and that the ethical and regulatory case for doing so is strengthening rather than fading. Whether the broader industry adopts the strict public domain discipline that Pleias practices, or settles on the looser openly licensed posture preferred by FineWeb and similar projects, is one of the defining open questions in large language model training in the late 2020s.

References

Pleias, official site, https://pleias.ai/ ↩
Langlais, Pierre-Carl. "Releasing Common Corpus: the largest public domain dataset for training LLMs." Hugging Face blog, March 20, 2024. ↩
Langlais, Pierre-Carl. "Releasing the largest multilingual open pretraining dataset." Hugging Face blog, November 14, 2024. ↩
Mozilla Builders. "Announcing Common Corpus: A 2+ trillion token dataset that's fully open and accessible." November 2024. ↩
Langlais, Pierre-Carl et al. "Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training." arXiv:2506.01732, June 2025. ↩
Langlais, Pierre-Carl. "They Said It Couldn't Be Done." Hugging Face blog, December 5, 2024. ↩
Pleias. "Pleias 1.0: the First Ever Family of Language Models Trained Exclusively on Open Data." Procedia Computer Science, 2025. ↩
Pleias. "Even Small Reasoners Should Quote Their Sources: Introducing the Pleias-RAG Model Family." arXiv:2504.18225, April 25, 2025. ↩
PleIAs/common_corpus dataset card. Hugging Face, accessed 2025. ↩
PleIAs/Pleias-RAG-350M model card. Hugging Face, accessed 2025. ↩
"This French start-up just proved OpenAI wrong." Euronews, April 2, 2024. ↩
"P-C. Langlais (PLEAIS): Our language models are trained on open corpora." Hello Future (Orange), 2024. ↩
"Ethically trained AI startup Pleias releases new small reasoning models optimized for RAG with built-in citations." VentureBeat, April 24, 2025. ↩
"Pleias Introduces Common Corpus: The Largest Multilingual Dataset for Pretraining Language Models." MarkTechPost, November 18, 2024. ↩
"Wikimedia Enterprise & Pleias Partner for Ethical AI Innovation." Wikimedia Enterprise blog, February 18, 2025. ↩
"Common Corpus: building AI as Commons." Open Future, 2024. ↩
Kandpal, Nikhil et al. "The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text." arXiv:2506.05209, June 2025. ↩
EleutherAI. "The Common Pile v0.1." EleutherAI blog, June 2025. ↩
Pleias Releases Common Corpus, AI Alliance blog, 2024. ↩
Pleias company profile, Crunchbase and Tracxn, accessed 2025. ↩
Willison, Simon. "Releasing the largest multilingual open pretraining dataset." simonwillison.net, November 14, 2024. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Common Corpus

Company overview

Founders and leadership

Mission and the ethical pretraining thesis

Products and research outputs

Common Corpus

Common Corpus collections

Pleias 1.0 family

Pleias 1.0 base models

Pleias-RAG

SYNTH and supporting tools

Partnerships and ecosystem

Major partnerships

Comparison to peer organizations

Reception and influence

Outlook

References

Improve this article

Related Articles

LLaMA

Qwen

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

LlamaIndex

Llama 3

What links here

Related Articles

LLaMA

Qwen

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

LlamaIndex

Llama 3