BELEBELE

AI Benchmarks Computer Vision

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v1 · 1,557 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Belebele is a multiple-choice machine reading comprehension (MRC) AI benchmark that is fully parallel across 122 language variants, meaning the same questions, passages, and answer choices are translated into every language so that model performance can be compared directly across languages. It was created by researchers at Meta AI and its Fundamental AI Research (FAIR) division and is built on passages drawn from the FLORES-200 translation dataset ^[1]^[2]. Because it covers high-, medium-, and low-resource languages with a single, perfectly aligned test set, Belebele is widely regarded as the de facto massively multilingual reading-comprehension evaluation and serves as a natural language understanding (NLU) complement to the FLORES machine-translation benchmark ^[1].

The benchmark was introduced in the paper "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants" by Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. It first appeared on arXiv on August 31, 2023, and was published in the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pages 749 to 775 ^[1]^[2]. The dataset and evaluation code are released publicly by Meta under a CC-BY-SA 4.0 license ^[3]. The name "Belebele" is a Bambara word meaning, loosely, "big, large, fat, or great" ^[3].

What Belebele is

Belebele is a discriminative reading-comprehension test rather than a generative or open-ended one. Each item presents a short reading passage, a single question about that passage, and four candidate answers, exactly one of which is correct ^[2]^[3]. A model (or human) must select the correct option, so the task reduces to four-way classification and is scored simply by accuracy, with random guessing yielding 25 percent. The questions were deliberately authored to require genuine comprehension of the passage rather than surface pattern matching or memorized world knowledge, and they were curated to discriminate between models with different levels of generalizable language understanding ^[1]^[2].

The defining property of Belebele is that it is fully parallel. The 900 unique question-and-passage instances are the same in every one of the 122 variants, professionally translated so that item N tests the same comprehension skill regardless of language. This design isolates language as the variable: a drop in accuracy from English to a low-resource language reflects the model's competence in that language rather than any difference in question difficulty ^[1]^[3]. For many of the languages it includes, Belebele was the first NLU benchmark of any kind to become available ^[1].

Structure: 122 languages and parallel design

Belebele consists of 900 unique multiple-choice questions, each rendered in all 122 language variants, for a total of 900 times 122, or 109,800 individual labeled items ^[2]^[3]. The questions are tied to 488 distinct passages taken from the FLORES-200 dataset; most passages carry two associated questions and some carry only one ^[2]^[3]. None of the underlying passages belong to the hidden FLORES test split, which keeps the source text usable as an open benchmark ^[3].

The 122 language variants span 115 distinct languages drawn from 27 language families and written in 29 different scripts, ranging from very high-resource languages such as English, Spanish, French, and Chinese down to low-resource languages such as Yoruba and many others ^[3]. Seven languages appear in two separate scripts: Belebele provides one of the first NLP benchmarks for the romanized variants of Hindi, Urdu, Bengali, Nepali, and Sinhala, in addition to their native-script forms ^[2].

Attribute	Value
Unique questions	900
Language variants	122
Total items (900 x 122)	109,800
Distinct passages (from FLORES-200)	488
Answer choices per question	4 (one correct)
Distinct languages	115
Language families	27
Scripts	29
Random-guess baseline	25%
Human accuracy (English)	97.6%

Construction

Belebele was built entirely through human expertise, with no machine translation used at any stage; all translations were produced by experts fluent in both English and the target language ^[2]. The construction proceeded in two phases. First, the passages, questions, and answers were written and quality-controlled in English in collaboration with a Language Service Provider over five iterations, with feedback exchanged in each round ^[2]. The authors applied both manual inspection and programmatic statistical checks, using low-level textual features to detect questions that were too easy or guessable; roughly 20 percent of candidate questions were filtered out in the final iteration ^[2].

In the second phase, the finished English questions and answers were professionally translated into the remaining 121 variants, and each translation was proofread and edited by an additional annotator to maintain quality and parallelism ^[2]^[3]. The English subset alone proves difficult enough to challenge state-of-the-art language models, which the authors used as evidence that the items measure real comprehension rather than artifacts ^[1]. To establish a human ceiling, four of the authors each answered about 120 English questions, achieving a mean accuracy of 97.6 percent (with a 95 percent confidence interval of 93.1 to 99.5 percent) ^[2].

To support fine-tuning experiments, the authors also assembled a separate English training set drawn from existing reading-comprehension corpora, including RACE, SciQ, MultiRC, MCTest, MCScript2.0, and ReClor; this training set carries its own non-commercial license and is kept distinct from the evaluation data ^[3].

Use and findings

Belebele is used to evaluate the multilingual capabilities of large language models, multilingual masked language models, and machine-translation or representation systems. Because the task is multiple choice, it can be run in several modes ^[2]^[3]:

Zero-shot: the model is given the passage, question, and options with natural-language instructions, in English or translated into the target language.
Few-shot in-context learning: typically five-shot, using examples drawn from the English training set.
Full fine-tuning: a model is fine-tuned on English data and evaluated zero-shot across languages, or trained on machine-translated data in a "translate-train-all" setting.
Translate-test: the target-language passage and question are machine-translated back into English before the model answers.

The original paper reported full results across all 122 languages for several systems, including Llama 1 and 2, Falcon, and GPT-3.5-turbo, alongside the multilingual encoders XLM-R, InfoXLM, and XLM-V ^[1]^[2]. In the best configurations reported, the fine-tuned encoder XLM-V with translate-train-all reached about 60.2 percent average accuracy (above 50 percent on 76.2 percent of languages); GPT-3.5-turbo reached about 51.1 percent average in zero-shot; and Llama 2 70B reached about 48.0 percent average with five-shot in-context learning ^[2]. All of these sit far below the roughly 97.6 percent human ceiling, underscoring how hard truly multilingual comprehension remains.

Two findings stand out. First, there are large gaps between high-resource and low-resource languages: English comprehension is consistently strong while accuracy on many low-resource languages falls close to the random-guess floor ^[1]. Second, despite the strong cross-lingual transfer shown by English-centric LLMs, much smaller masked language models pretrained on balanced multilingual data understood far more languages, and the authors found that larger and more carefully constructed vocabularies correlated with better low-resource performance ^[1]. Since its release, Belebele has been widely adopted in 2024 and 2025 evaluation studies as a standard probe of multilingual understanding for models such as Llama 3 and 3.1, Qwen 2, and Aya Expanse ^[2]. A follow-up dataset, 2M-Belebele, extends the questions to highly multilingual speech and American Sign Language comprehension ^[2].

Relationship to FLORES-200 and other multilingual benchmarks

Belebele is closely tied to the FLORES-200 evaluation suite: its passages are sampled directly from FLORES-200, which itself was produced as part of Meta's No Language Left Behind (NLLB) machine-translation project ^[1]^[2]. In effect, Belebele turns the translation-oriented FLORES corpus into a comprehension benchmark, pairing the FLORES test of whether a model can translate a sentence with a test of whether a model can understand a passage and reason about it. Because the two share source text and language coverage, they are frequently reported together as complementary measures of multilingual translation and understanding ^[1].

Belebele was explicitly designed to expand the language coverage of NLU evaluation far beyond earlier cross-lingual benchmarks. The authors note that prior multilingual reading-comprehension and inference datasets, including XNLI, XQuAD, MLQA, and XL-Sum, together cover fewer than 30 languages, whereas Belebele covers 122 variants in a single parallel set ^[1]^[2]. Unlike extractive question-answering benchmarks such as XQuAD, TyDiQA, or MLQA, which require selecting a span of text and are harder to evaluate consistently across scripts and tokenizers, Belebele's multiple-choice format makes scoring uniform and language-agnostic. The paper reports that Belebele results correlate very strongly with XNLI performance (a correlation of about 0.85), supporting its validity as a measure of general multilingual competence while greatly broadening the set of languages that can be assessed ^[2].

References

Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S. N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., and Khabsa, M. "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants." AI at Meta, Research Publications. https://ai.meta.com/research/publications/the-belebele-benchmark-a-parallel-reading-comprehension-dataset-in-122-language-variants/ ↩
Bandarkar, L., et al. "The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants." arXiv:2308.16884; published in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024), pp. 749 to 775. https://arxiv.org/abs/2308.16884 and https://aclanthology.org/2024.acl-long.44/ ↩
facebookresearch/belebele. "Repo for the Belebele dataset, a massively multilingual reading comprehension dataset." GitHub. https://github.com/facebookresearch/belebele ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

FLORES-200

Overview

What Belebele is

Structure: 122 languages and parallel design

Construction

Use and findings

Relationship to FLORES-200 and other multilingual benchmarks

References

Improve this article

Related Articles

Fox (benchmark)

Visual Question Answering Models

Frechet Inception Distance

CLIP Score

MMMU-Pro

EgoSchema