BELEBELE
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,557 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,557 words
Add missing citations, update stale details, or suggest a clearer explanation.
Belebele is a multiple-choice machine reading comprehension (MRC) AI benchmark that is fully parallel across 122 language variants, meaning the same questions, passages, and answer choices are translated into every language so that model performance can be compared directly across languages. It was created by researchers at Meta AI and its Fundamental AI Research (FAIR) division and is built on passages drawn from the FLORES-200 translation dataset [1][2]. Because it covers high-, medium-, and low-resource languages with a single, perfectly aligned test set, Belebele is widely regarded as the de facto massively multilingual reading-comprehension evaluation and serves as a natural language understanding (NLU) complement to the FLORES machine-translation benchmark [1].
The benchmark was introduced in the paper "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants" by Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. It first appeared on arXiv on August 31, 2023, and was published in the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 749 to 775 [1][2]. The dataset and evaluation code are released publicly by Meta under a CC-BY-SA 4.0 license [3]. The name "Belebele" is a Bambara word meaning, loosely, "big, large, fat, or great" [3].
Belebele is a discriminative reading-comprehension test rather than a generative or open-ended one. Each item presents a short reading passage, a single question about that passage, and four candidate answers, exactly one of which is correct [2][3]. A model (or human) must select the correct option, so the task reduces to four-way classification and is scored simply by accuracy, with random guessing yielding 25 percent. The questions were deliberately authored to require genuine comprehension of the passage rather than surface pattern matching or memorized world knowledge, and they were curated to discriminate between models with different levels of generalizable language understanding [1][2].
The defining property of Belebele is that it is fully parallel. The 900 unique question-and-passage instances are the same in every one of the 122 variants, professionally translated so that item N tests the same comprehension skill regardless of language. This design isolates language as the variable: a drop in accuracy from English to a low-resource language reflects the model's competence in that language rather than any difference in question difficulty [1][3]. For many of the languages it includes, Belebele was the first NLU benchmark of any kind to become available [1].
Belebele consists of 900 unique multiple-choice questions, each rendered in all 122 language variants, for a total of 900 times 122, or 109,800 individual labeled items [2][3]. The questions are tied to 488 distinct passages taken from the FLORES-200 dataset; most passages carry two associated questions and some carry only one [2][3]. None of the underlying passages belong to the hidden FLORES test split, which keeps the source text usable as an open benchmark [3].
The 122 language variants span 115 distinct languages drawn from 27 language families and written in 29 different scripts, ranging from very high-resource languages such as English, Spanish, French, and Chinese down to low-resource languages such as Yoruba and many others [3]. Seven languages appear in two separate scripts: Belebele provides one of the first NLP benchmarks for the romanized variants of Hindi, Urdu, Bengali, Nepali, and Sinhala, in addition to their native-script forms [2].
| Attribute | Value |
|---|---|
| Unique questions | 900 |
| Language variants | 122 |
| Total items (900 x 122) | 109,800 |
| Distinct passages (from FLORES-200) | 488 |
| Answer choices per question | 4 (one correct) |
| Distinct languages | 115 |
| Language families | 27 |
| Scripts | 29 |
| Random-guess baseline | 25% |
| Human accuracy (English) | 97.6% |
Belebele was built entirely through human expertise, with no machine translation used at any stage; all translations were produced by experts fluent in both English and the target language [2]. The construction proceeded in two phases. First, the passages, questions, and answers were written and quality-controlled in English in collaboration with a Language Service Provider over five iterations, with feedback exchanged in each round [2]. The authors applied both manual inspection and programmatic statistical checks, using low-level textual features to detect questions that were too easy or guessable; roughly 20 percent of candidate questions were filtered out in the final iteration [2].
In the second phase, the finished English questions and answers were professionally translated into the remaining 121 variants, and each translation was proofread and edited by an additional annotator to maintain quality and parallelism [2][3]. The English subset alone proves difficult enough to challenge state-of-the-art language models, which the authors used as evidence that the items measure real comprehension rather than artifacts [1]. To establish a human ceiling, four of the authors each answered about 120 English questions, achieving a mean accuracy of 97.6 percent (with a 95 percent confidence interval of 93.1 to 99.5 percent) [2].
To support fine-tuning experiments, the authors also assembled a separate English training set drawn from existing reading-comprehension corpora, including RACE, SciQ, MultiRC, MCTest, MCScript2.0, and ReClor; this training set carries its own non-commercial license and is kept distinct from the evaluation data [3].
Belebele is used to evaluate the multilingual capabilities of large language models, multilingual masked language models, and machine-translation or representation systems. Because the task is multiple choice, it can be run in several modes [2][3]:
The original paper reported full results across all 122 languages for several systems, including Llama 1 and 2, Falcon, and GPT-3.5-turbo, alongside the multilingual encoders XLM-R, InfoXLM, and XLM-V [1][2]. In the best configurations reported, the fine-tuned encoder XLM-V with translate-train-all reached about 60.2 percent average accuracy (above 50 percent on 76.2 percent of languages); GPT-3.5-turbo reached about 51.1 percent average in zero-shot; and Llama 2 70B reached about 48.0 percent average with five-shot in-context learning [2]. All of these sit far below the roughly 97.6 percent human ceiling, underscoring how hard truly multilingual comprehension remains.
Two findings stand out. First, there are large gaps between high-resource and low-resource languages: English comprehension is consistently strong while accuracy on many low-resource languages falls close to the random-guess floor [1]. Second, despite the strong cross-lingual transfer shown by English-centric LLMs, much smaller masked language models pretrained on balanced multilingual data understood far more languages, and the authors found that larger and more carefully constructed vocabularies correlated with better low-resource performance [1]. Since its release, Belebele has been widely adopted in 2024 and 2025 evaluation studies as a standard probe of multilingual understanding for models such as Llama 3 and 3.1, Qwen 2, and Aya Expanse [2]. A follow-up dataset, 2M-Belebele, extends the questions to highly multilingual speech and American Sign Language comprehension [2].
Belebele is closely tied to the FLORES-200 evaluation suite: its passages are sampled directly from FLORES-200, which itself was produced as part of Meta's No Language Left Behind (NLLB) machine-translation project [1][2]. In effect, Belebele turns the translation-oriented FLORES corpus into a comprehension benchmark, pairing the FLORES test of whether a model can translate a sentence with a test of whether a model can understand a passage and reason about it. Because the two share source text and language coverage, they are frequently reported together as complementary measures of multilingual translation and understanding [1].
Belebele was explicitly designed to expand the language coverage of NLU evaluation far beyond earlier cross-lingual benchmarks. The authors note that prior multilingual reading-comprehension and inference datasets, including XNLI, XQuAD, MLQA, and XL-Sum, together cover fewer than 30 languages, whereas Belebele covers 122 variants in a single parallel set [1][2]. Unlike extractive question-answering benchmarks such as XQuAD, TyDiQA, or MLQA, which require selecting a span of text and are harder to evaluate consistently across scripts and tokenizers, Belebele's multiple-choice format makes scoring uniform and language-agnostic. The paper reports that Belebele results correlate very strongly with XNLI performance (a correlation of about 0.85), supporting its validity as a measure of general multilingual competence while greatly broadening the set of languages that can be assessed [2].