CamemBERT
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,986 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,986 words
Add missing citations, update stale details, or suggest a clearer explanation.
CamemBERT is a French monolingual language model based on the RoBERTa architecture, released in late 2019 by researchers at Inria, Facebook AI Research, and Sorbonne Université. It is a transformer encoder pretrained on a large quantity of French text, and it was the first widely adopted language model dedicated specifically to French. The model was introduced in the paper "CamemBERT: a Tasty French Language Model" by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, and colleagues, published at the 58th Annual Meeting of the Association for Computational Linguistics (ACL) in 2020. Its playful name follows the fashion, popular at the time, of naming BERT derivatives after foods and characters, in this case the French cheese camembert.
CamemBERT is distributed through Hugging Face, where it lives under the Almanach organization at almanach/camembert-base (the checkpoint was originally published as camembert-base). It is released under the permissive MIT license. Like other BERT-style encoders, it is not a text generator; instead it produces contextual representations of French text that are meant to be adapted, through fine-tuning, to downstream tasks in natural language processing such as part-of-speech tagging, dependency parsing, named entity recognition, and natural language inference. On all four of these tasks the model reached or improved on the state of the art for French at the time of its release, and it has since become one of the default backbones for French NLP, downloaded more than a million times per month and reused in well over a hundred fine-tuned derivatives.
When CamemBERT appeared, the dominant pretrained models were either English-only systems such as the original BERT and RoBERTa, or multilingual models such as multilingual BERT (mBERT) and XLM-RoBERTa that spread their capacity across a hundred or more languages at once. The multilingual models made French accessible, but they devoted only a fraction of their parameters and training data to it, which left room for a dedicated French model to do better. CamemBERT was built to test exactly that hypothesis: that a single language, trained on enough high-quality monolingual text, could yield representations stronger than those of a general multilingual model.
The central empirical claims of the paper were twofold. First, a monolingual French model could outperform the leading multilingual models on French tasks while using far fewer parameters, largely because it could afford a small French-only vocabulary instead of a large shared multilingual one. Second, and more surprisingly, the source of the pretraining text mattered more than its sheer volume: a relatively small web-crawled corpus of about 4 gigabytes produced results essentially as good as a much larger 138-gigabyte corpus, and web-crawled text proved preferable to clean but stylistically narrow Wikipedia text. This finding pushed back against the assumption that more data is always better, and it influenced how later monolingual models for other languages were trained.
The model was developed largely within the Almanach (formerly Alpage) project team at Inria, the French national institute for research in computer science, in collaboration with Facebook AI Research's Paris lab and Sorbonne Université. Inria has continued to maintain and extend the CamemBERT family, releasing successor architectures and updated checkpoints over the following years.
CamemBERT inherits its architecture directly from RoBERTa, which is itself a carefully retuned version of BERT. It is an encoder-only transformer that stacks layers of multi-head self-attention and position-wise feed-forward blocks, producing a contextual embedding for every input token. There is no decoder and no autoregressive generation. The base configuration matches BERT-base, with 12 transformer layers, a hidden size of 768, 12 attention heads, and roughly 110 million parameters. The paper also trained a larger variant matching BERT-large, with 24 layers, a hidden size of 1024, and about 335 million parameters.
The pretraining objective is masked language modeling. Following the RoBERTa recipe, CamemBERT drops BERT's next-sentence-prediction task entirely and uses dynamic masking, meaning the choice of which tokens to mask is regenerated on the fly rather than fixed once during preprocessing, which makes the model more robust when training for many epochs. Around 15 percent of tokens are selected for prediction; of those, 80 percent are replaced with a special mask token, 10 percent are replaced with a random token, and 10 percent are left unchanged. The single most important departure from RoBERTa is the use of whole-word masking. Because SentencePiece tokenization can split a single word into several subword pieces, CamemBERT masks all of the pieces belonging to a chosen word at once and trains the model to recover the entire word rather than guessing a missing fragment from its neighbors. The authors report that whole-word masking gives a measurable boost, including a roughly 1.3-point gain on named entity recognition.
The training corpus is the French portion of OSCAR (Open Super-large Crawled Aggregated coRpus), a multilingual collection extracted from Common Crawl web snapshots. Language separation in OSCAR is performed with a fastText linear classifier trained on Wikipedia, Tatoeba, and other sources, after which no further filtering is applied. CamemBERT uses a non-shuffled version of the French data, amounting to about 138 gigabytes of raw text, which is roughly 32.7 billion tokens after subword tokenization. Optimization used Adam with beta1 of 0.9 and beta2 of 0.98, run for 100,000 steps with very large batches of 8,192 sequences. The learning rate was warmed up over the first 10,000 steps and then decayed toward zero. Pretraining was carried out on 256 NVIDIA V100 GPUs, each with 32 gigabytes of memory. The authors noted that validation loss was still decreasing at 100,000 steps, suggesting the released models were, if anything, undertrained and could have improved with a longer schedule.
CamemBERT tokenizes text with SentencePiece, a subword segmentation method that is an extension of byte-pair encoding and WordPiece. SentencePiece operates directly on raw Unicode text and treats whitespace as an ordinary character, so it does not require any language-specific pre-tokenizer to split words first. This makes it a clean fit for French, where contractions, elisions such as the apostrophe in "l'eau" or "j'aime," and accented characters complicate naive word splitting.
The model uses a vocabulary of 32,000 subword units. This small French-only vocabulary is one of the main reasons CamemBERT is so parameter-efficient relative to multilingual models: the embedding and output-projection matrices scale with the vocabulary size, and a multilingual system such as XLM-RoBERTa needs around 250,000 entries to cover a hundred languages, against CamemBERT's 32,000 for French alone. Because so much less of the parameter budget is consumed by the vocabulary, a CamemBERT base model can match or beat much larger multilingual models on French while carrying a fraction of their total parameters.
The original release shipped several checkpoints that vary along three axes: model size (base or large), the source corpus (OSCAR, CCNet, or Wikipedia), and the amount of data (the full corpus or a 4-gigabyte subset). The smaller-data checkpoints exist mainly to support the paper's analysis of how corpus size and source affect quality, and they remain useful for low-resource experimentation. The principal variants are summarized below.
| Model | Parameters | Architecture | Pretraining corpus |
|---|---|---|---|
camembert-base | 110M | RoBERTa base (12 layers) | OSCAR, 138 GB |
camembert-large | 335M | RoBERTa large (24 layers) | CCNet, 135 GB |
camembert-base-ccnet | 110M | RoBERTa base | CCNet, 135 GB |
camembert-base-wikipedia-4gb | 110M | RoBERTa base | French Wikipedia, 4 GB |
camembert-base-oscar-4gb | 110M | RoBERTa base | OSCAR subset, 4 GB |
camembert-base-ccnet-4gb | 110M | RoBERTa base | CCNet subset, 4 GB |
Beyond these original RoBERTa-style checkpoints, the same research group released a distinct architectural successor in 2023 called CamemBERTa, introduced in the paper "Data-Efficient French Language Modeling with CamemBERTa" by Wissam Antoun, Benoît Sagot, and Djamé Seddah at Findings of ACL 2023. CamemBERTa is not a RoBERTa model: it is built on DeBERTaV3 and trained with the Replaced Token Detection (RTD) objective borrowed from ELECTRA, in which a small generator network proposes replacement tokens and the main discriminator network learns to flag which tokens were swapped, using gradient-disentangled embedding sharing to keep the two networks' updates separate. The headline result is data efficiency: CamemBERTa, with about 110 million parameters and trained on the French CCNet data, matched or exceeded the original CamemBERT on most downstream tasks while seeing only around 30 percent as many training tokens. It was notable at the time as the first publicly available DeBERTaV3-style model outside the original DeBERTa work.
In November 2024, the team released a refreshed generation under the banner "CamemBERT 2.0," described in a paper by Wissam Antoun, Francis Kulumba, Rian Touchent, and colleagues. This release pairs two updated checkpoints, CamemBERTv2 (RoBERTa, masked language modeling) and CamemBERTav2 (DeBERTaV3, Replaced Token Detection), both trained on a substantially larger and more recent French corpus with a longer context window and an updated tokenizer. The motivation was temporal concept drift: the original 2019 models were trained on older text and had begun to show their age on contemporary French, so the 2.0 models were designed as more current drop-in replacements.
The original paper evaluated CamemBERT on four French tasks and reported gains across all of them. For part-of-speech tagging and dependency parsing, the model was tested on four treebanks from the Universal Dependencies project: GSD, Sequoia, Spoken, and ParTUT. Fine-tuned CamemBERT reached state-of-the-art scores on every treebank and metric, beating both multilingual models such as mBERT and earlier task-specific systems such as UDify, in some cases by wide margins (for example, more than 4 points of labeled attachment score on Sequoia and over 5 points on ParTUT relative to UDify). The model performed comparably whether it was fine-tuned end to end or used as frozen input embeddings to a downstream parser, which the authors took as evidence that elaborate task-specific architectures are often unnecessary once a strong encoder is available.
The following table reports CamemBERT's fine-tuned scores against mBERT on the four treebanks, using UPOS accuracy for tagging and labeled attachment score (LAS) for parsing.
| Treebank | mBERT UPOS | mBERT LAS | CamemBERT UPOS | CamemBERT LAS |
|---|---|---|---|---|
| GSD | 97.48 | 89.73 | 98.18 | 92.57 |
| Sequoia | 98.41 | 91.24 | 99.29 | 94.20 |
| Spoken | 96.02 | 78.63 | 96.99 | 81.37 |
| ParTUT | 97.35 | 91.37 | 97.65 | 93.43 |
For named entity recognition, evaluated on the French Treebank (FTB), CamemBERT reached an F1 of 89.08 when fine-tuned and 89.55 when used as embeddings inside an LSTM-CRF tagger. Both figures improve on fine-tuned mBERT at 87.35 and on the prior best non-neural and neural CRF systems, which scored about 85.0 and 85.6 respectively.
On natural language inference, measured on the French portion of the XNLI benchmark, CamemBERT base reached 82.5 percent accuracy. That result beat mBERT at 76.9 percent by 5.6 points, the translation-augmented XLM model at 80.2 percent, and XLM-RoBERTa base at 80.1 percent, while using as little as half the parameters of those multilingual baselines (110 million against 270 million for XLM-RoBERTa base). The larger CamemBERT-large model reached 85.7 percent on French XNLI, edging past XLM-RoBERTa large at 85.2 percent despite XLM-RoBERTa large carrying roughly 550 million parameters. The XNLI comparison is summarized below.
| Model | French XNLI accuracy | Parameters |
|---|---|---|
| mBERT | 76.9 | ~175M |
| XLM (MLM+TLM) | 80.2 | ~250M |
| XLM-RoBERTa base | 80.1 | ~270M |
| CamemBERT base | 82.5 | ~110M |
| XLM-RoBERTa large | 85.2 | ~550M |
| CamemBERT large | 85.7 | ~335M |
The authors attributed CamemBERT's edge over XLM-RoBERTa partly to data: although XLM-RoBERTa was trained on an enormous 2.5 terabytes of text, only about 57 gigabytes of that was French, whereas CamemBERT trained on 138 gigabytes of French. They also noted that when XLM-RoBERTa was restricted to a handful of languages rather than a hundred, its French accuracy rose markedly, reinforcing the general lesson that concentrating capacity on the target language helps.
CamemBERT is a general-purpose French encoder, and in practice it is almost always fine-tuned for a specific task rather than used directly. Its common applications mirror those of BERT and RoBERTa in English:
The model integrates cleanly with the Hugging Face Transformers library, exposing the usual task heads for masked language modeling, sequence classification, token classification, multiple choice, and question answering. It can be loaded in a couple of lines of code and run a fill-mask pipeline out of the box, which makes it a low-friction starting point for French NLP projects. Its small size relative to large multilingual models also makes it comparatively cheap to fine-tune and to deploy, an advantage for teams without large GPU budgets.
CamemBERT is not a generative model. As a masked-language-model encoder, it is built for understanding tasks that read an entire input to make a decision, and it cannot produce fluent free-form text the way an autoregressive system can. Using it for open-ended generation is a misuse of the architecture.
The original checkpoints were trained on web text crawled before 2019, which carries two practical consequences. First, the model knows nothing about events, terminology, or named entities that emerged afterward. Second, because language and usage shift over time, the model's representations gradually drift out of step with contemporary French, a phenomenon the CamemBERT 2.0 release explicitly set out to address. The training text was also case-sensitive and drawn largely from well-capitalized written prose, which can hurt performance on inputs that depart from those conventions, such as transcribed speech where words are not capitalized. The paper observed exactly this effect on the Spoken treebank.
Because OSCAR is an unfiltered crawl of the open web, CamemBERT can absorb and reproduce social biases, stereotypes, and other undesirable content present in its training data, and those biases can carry through into fine-tuned systems. As with other web-pretrained models, predictions should be audited before use in sensitive applications. Finally, the model is monolingual by design: it handles French and is not intended for other languages or for cross-lingual transfer, which remain the province of multilingual models such as XLM-RoBERTa.
CamemBERT belongs to a wave of monolingual BERT-style models that followed the 2018 release of BERT and the 2019 release of RoBERTa, as research groups around the world built dedicated encoders for languages other than English. For French specifically, CamemBERT was developed essentially in parallel with FlauBERT, a second French model introduced around the same time by a separate group at the LIG laboratory in Grenoble and trained on a heterogeneous mixture of French corpora. The two models were released within weeks of each other in late 2019, were frequently benchmarked against each other, and together established strong French baselines; the FLUE evaluation suite that accompanied FlauBERT became a common testbed for both.
The model is best understood in contrast to the multilingual encoders it was designed to beat. Multilingual BERT (mBERT) trained a single BERT on Wikipedia text in over a hundred languages, and XLM-RoBERTa scaled that idea to 2.5 terabytes of filtered Common Crawl text across a hundred languages using the RoBERTa recipe. Both made French usable, but both also illustrated the trade-off CamemBERT exploited: spreading a fixed model across many languages dilutes per-language capacity and forces a very large shared vocabulary, leaving a well-resourced monolingual model room to win on its single language with far fewer parameters.
The CamemBERT line has continued to evolve. CamemBERTa in 2023 replaced the RoBERTa backbone with a DeBERTaV3 architecture and the Replaced Token Detection objective, achieving comparable quality with far less training data. The CamemBERT 2.0 release in 2024 then refreshed both the RoBERTa and DeBERTaV3 variants on newer and larger French text to counter temporal drift. Across these generations, the project has remained a flagship example of French open-source NLP from Inria, and the original camembert-base checkpoint remains one of the most heavily used French models on Hugging Face, valued for being compact, well documented, openly licensed, and reliable across a broad range of French language tasks.