FLORES-200 is a multilingual evaluation benchmark for machine translation systems, covering 200 languages across a wide range of language families, scripts, and resource levels. Developed by the FAIR (Fundamental AI Research) team at Meta, the dataset provides professionally translated parallel sentences that enable direct evaluation of translation quality across more than 40,000 language direction pairs without requiring English as a pivot language. FLORES-200 was introduced as part of the No Language Left Behind (NLLB) project in 2022 and has since become one of the most widely used benchmarks for assessing multilingual translation capabilities.
The name "FLORES" originally stood for "Facebook LOw RESource" translation evaluation, reflecting its initial focus on under-resourced languages. Over time, the benchmark expanded dramatically in scope, evolving from a two-language-pair dataset in 2019 to a 200-language benchmark in 2022. As of 2024, the actively maintained successor is called FLORES+, managed by the Open Language Data Initiative (OLDI) under a CC-BY-SA-4.0 license.
The FLORES project began in 2019 with a paper titled "The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English," presented at EMNLP-IJCNLP 2019 in Hong Kong. The authors were Francisco Guzman, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc'Aurelio Ranzato, all affiliated with Facebook AI Research or Johns Hopkins University.
This initial release focused on just two low-resource language pairs: Nepali to English and Sinhala to English. The sentences were drawn from English Wikipedia and professionally translated by native speakers. The paper described a rigorous quality-checking process and reported baseline results using several training paradigms: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. The key finding was that existing translation systems performed poorly on these low-resource pairs, highlighting a significant gap between high-resource and low-resource language support.
In June 2021, the benchmark received a major expansion with FLORES-101, introduced in the paper "The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation" by Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzman, and Angela Fan. This paper was published in the Transactions of the Association for Computational Linguistics (TACL) in 2022.
FLORES-101 scaled the dataset from 2 languages to 101 languages. It introduced a new set of 3,001 sentences extracted from multiple Wikimedia projects (not only Wikipedia, but also Wikinews, Wikijunior, and Wikivoyage) to ensure broad topical diversity. Every sentence was professionally translated into all 101 languages, enabling direct many-to-many evaluation. The dataset was divided into three splits: a dev set (997 sentences), a devtest set (1,012 sentences), and a hidden test set (992 sentences). FLORES-101 also introduced the spBLEU metric and an accompanying SentencePiece model trained on all 101 languages to standardize tokenization across typologically diverse languages.
FLORES-200 was released alongside the No Language Left Behind (NLLB) project in July 2022. The corresponding paper, "No Language Left Behind: Scaling Human-Centered Machine Translation," was authored by the NLLB Team led by Marta R. Costa-jussa and many collaborators at Meta AI. A condensed version was later published in Nature in 2024 under the title "Scaling Neural Machine Translation to 200 Languages."
The expansion from 101 to 200 languages nearly doubled the benchmark's coverage. The same 3,001 source sentences were translated into an additional 99 languages, maintaining full alignment across all languages. Importantly, not all translations originated from English. Several languages were translated from Spanish, French, Russian, or Modern Standard Arabic when those served as more natural source languages for the target language community. The dataset retained the same three-split structure (dev, devtest, and hidden test).
Following the initial release, management of the dataset transitioned from Meta to the Open Language Data Initiative (OLDI), a community-driven organization. The actively maintained version was renamed FLORES+ (currently at version 4.4) to distinguish it from the original static releases. FLORES+ has expanded coverage to 228 language varieties and accepts community contributions through pull requests. The hidden test set remains managed by Meta separately.
The source sentences in FLORES-200 were selected from 842 distinct web articles across three Wikimedia projects:
| Source | Description | Content Type |
|---|---|---|
| Wikinews | International news articles | Current events, politics, world affairs |
| Wikijunior | Age-appropriate non-fiction books | Science, history, geography for younger audiences |
| Wikivoyage | Travel guide content | Destinations, culture, practical travel information |
Approximately one-third of the 3,001 sentences were drawn from each source. This distribution ensures topical diversity, covering news, educational content, and travel writing. The average sentence length is roughly 21 words. Wikimedia was chosen because its content is freely available under permissive licenses, allowing open redistribution of the benchmark.
The construction of FLORES-200 followed a four-phase workflow for each language:
On average, the entire process took 119 days per language, with some languages requiring up to 287 days. This lengthy timeline reflects the difficulty of finding qualified professional translators for many of the world's under-resourced languages.
Several automated and manual quality checks were applied throughout the pipeline:
FLORES-200 is organized into three data splits per language:
| Split | Sentences | Purpose |
|---|---|---|
| dev | 997 | Development and model tuning |
| devtest | 1,012 | Development testing and reported evaluation |
| test | 992 | Hidden test set for blind evaluation (managed by Meta) |
All splits are fully parallel, meaning every sentence has a corresponding translation in every language. This alignment enables evaluation on any of the roughly 40,000 possible translation direction pairs (approximately 200 x 200) without needing to pivot through a high-resource language.
Each language in FLORES-200 is identified using a code that combines an ISO 639-3 language code with an ISO 15924 script code, separated by an underscore. For example:
eng_Latn for English in Latin scripthin_Deva for Hindi in Devanagari scriptarb_Arab for Modern Standard Arabic in Arabic scriptzho_Hans for Chinese in Simplified Han scriptThis system accommodates languages written in multiple scripts. For example, Acehnese appears as both ace_Arab (Arabic script) and ace_Latn (Latin script), and Kashmiri appears as both kas_Arab and kas_Deva.
FLORES-200 covers 200 languages (204 language-script combinations) spanning dozens of language families and every inhabited continent. The dataset deliberately emphasizes low-resource languages; roughly three times as many low-resource languages as high-resource languages are included (where "high-resource" is defined as having at least one million parallel sentences available with another language).
| Language Family | Example Languages | Approximate Count |
|---|---|---|
| Indo-European | English, French, Hindi, Bengali, Russian, Portuguese, German, Spanish, Polish, Marathi | ~55 |
| Atlantic-Congo (Niger-Congo) | Yoruba, Igbo, Swahili, Zulu, Shona, Lingala, Wolof, Kikuyu, Ganda | ~25 |
| Austronesian | Indonesian, Tagalog, Javanese, Cebuano, Malay, Sundanese, Ilocano, Samoan, Fijian | ~15 |
| Afro-Asiatic | Arabic (multiple varieties), Hausa, Amharic, Somali, Tigrinya, Kabyle, Hebrew | ~15 |
| Turkic | Turkish, Azerbaijani, Kazakh, Kyrgyz, Uzbek, Turkmen, Tatar, Crimean Tatar, Uyghur | ~9 |
| Dravidian | Tamil, Telugu, Kannada, Malayalam | ~4 |
| Sino-Tibetan | Chinese (Simplified and Traditional), Burmese, Standard Tibetan | ~5 |
| Austroasiatic | Khmer, Vietnamese | ~2 |
| Tai-Kadai | Thai, Lao | ~2 |
| Japonic | Japanese | 1 |
| Koreanic | Korean | 1 |
| Uralic | Finnish, Estonian, Hungarian | 3 |
| Kartvelian | Georgian | 1 |
| Other families | Basque, Mongolian, Armenian, and others | ~15+ |
Several languages appear in multiple varieties or scripts:
FLORES-200 standardized two primary automatic evaluation metrics for multilingual translation, addressing long-standing inconsistencies in how translation quality was measured across different languages.
The spBLEU metric was introduced alongside FLORES-101 to solve a fundamental problem with the standard BLEU score: different tokenizers produce different BLEU scores for the same translation, making cross-language comparisons unreliable. Languages with complex morphology or non-Latin scripts are particularly affected, as standard tokenizers may split words differently depending on the language.
spBLEU works by first tokenizing both the reference translation and the system output using a single, language-agnostic SentencePiece model. This model was trained on monolingual data from all FLORES languages with a vocabulary of 256,000 subword tokens. Temperature upsampling during training ensured adequate representation of low-resource languages. After tokenization, the standard BLEU calculation (n-gram precision with brevity penalty) is applied.
The spBLEU metric has the advantage of being directly comparable across all 200 languages, since the same tokenizer is applied universally. The original SPM model released with FLORES-101 was extended for FLORES-200 (referred to as SPM-200).
To compute spBLEU on FLORES-200, researchers first tokenize their hypothesis file using the provided SentencePiece model, then run SacreBLEU on the tokenized output:
# Step 1: Tokenize with SentencePiece
python scripts/spm_encode.py --model flores200_sacrebleu_spm.model \
--output_format=piece < hyp.txt > hyp.spm
# Step 2: Compute BLEU
cat hyp.spm | sacrebleu ref.spm
chrF++ (character n-gram F-score with word n-grams) serves as the recommended primary metric for FLORES-200 evaluation. Unlike BLEU, which operates on word-level n-grams, chrF++ computes precision and recall at the character level (using character n-grams of order 1 through 6) and incorporates word-level unigrams and bigrams. This makes chrF++ naturally more robust to morphological variation and does not require any external tokenizer.
The metric is computed using SacreBLEU:
sacrebleu -m chrf --chrf-word-order 2 ref.txt < hyp.txt
Research conducted during the NLLB project found strong correlation between spBLEU and chrF++ across all language pairs, with Pearson correlation coefficients ranging from 0.94 to 0.98. Both metrics also showed reasonable correlation with human judgments of translation quality.
In addition to automatic metrics, the NLLB project developed a human evaluation protocol called Cross-lingual Semantic Textual Similarity (XSTS). Evaluators rate translations on a five-point scale, where a score of 3 represents the threshold of acceptable quality. The NLLB team reported the following Spearman correlation coefficients between aggregated XSTS scores and automatic metrics:
| Metric | Spearman's R |
|---|---|
| spBLEU | 0.710 |
| chrF++ (corpus-level) | 0.687 |
| chrF++ (sentence-level average) | 0.694 |
Of 55 translation directions evaluated with XSTS, 38 (approximately 69%) achieved scores above 4.0, indicating high translation quality.
FLORES-200 was developed as a core component of Meta's No Language Left Behind (NLLB) initiative, which aimed to build a single translation model capable of supporting 200 languages with high quality. The benchmark served as the primary evaluation tool throughout the NLLB project.
The flagship NLLB-200 model uses a Sparsely Gated Mixture of Experts (MoE) architecture. In this design, a quarter of the feed-forward layers in both the encoder and decoder are replaced with MoE layers, where each token is routed to the top-2 experts out of a larger expert pool. This conditional compute approach allows the model to maintain a large total parameter count (54.5 billion) while keeping the computation per token manageable.
The NLLB project released models at multiple scales:
| Model Variant | Parameters | Type |
|---|---|---|
| NLLB-200 MoE | 54.5B | Sparse Mixture of Experts |
| NLLB-200 Dense | 3.3B | Standard dense transformer |
| NLLB-200 Dense | 1.3B | Standard dense transformer |
| NLLB-200 Distilled | 1.3B | Distilled from the 54.5B model |
| NLLB-200 Distilled | 600M | Distilled from the 54.5B model |
All models were trained using a single SentencePiece vocabulary of 256,000 tokens and a maximum sequence length of 512 tokens.
The NLLB-200 54.5B MoE model achieved an average improvement of +7.3 spBLEU over the previous state-of-the-art system across all evaluated translation directions, representing a 44% relative improvement. For some African and Indian languages, the accuracy improvement exceeded 70%. The model was evaluated across more than 40,000 translation directions using the FLORES-200 devtest split.
The NLLB project generated over 1.1 billion parallel sentence pairs through automated data mining using LASER3 (Language-Agnostic SEntence Representations, version 3). LASER3 employed a teacher-student training approach to produce sentence embeddings for low-resource languages, using 12-layer transformer encoders with approximately 250 million parameters each.
Training data included a combination of mined parallel data, open-source parallel corpora, seed data (small professionally translated datasets), and back-translated monolingual data. A toxicity-based filtering procedure removed approximately 30% of mined parallel sentences while improving translation quality by roughly 5%.
Two additional training innovations proved important for low-resource languages:
These techniques combined improved performance on low-resource and very low-resource languages by approximately 2 chrF++ points.
FLORES-200 has been used as an evaluation benchmark in multiple editions of the Workshop on Machine Translation (WMT). The WMT 2023 Large-Scale Multilingual Translation shared task used the FLORES devtest split for evaluating participating systems. In 2024, the WMT shared task organized by OLDI continued to use FLORES+ as its evaluation framework, with community efforts focused on expanding and correcting translations for additional languages.
The parallel sentence structure of FLORES-200 has enabled the creation of several derivative evaluation resources:
| Benchmark | Year | Description |
|---|---|---|
| Belebele | 2023 | A multiple-choice machine reading comprehension benchmark spanning 122 language variants. Each question is based on a short passage drawn from FLORES-200. |
| SIB-200 | 2023 | A topic classification dataset covering 200+ languages and dialects, built on FLORES-200 passages. |
| FLORES+ Emakhuwa | 2024 | An expansion of FLORES+ for Portuguese-Emakhuwa translation, addressing orthographic variation. |
| FLORES-200 Corrections | 2024 | Error corrections for four African languages: Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. |
Research comparing large language models against specialized translation systems on FLORES-200 has revealed interesting patterns. Studies have found that models such as GPT-4 are competitive with NLLB-54B on high-resource language pairs but fall behind on low-resource languages. One study examining Claude 3 Opus found that it outperformed NLLB-54B on 55.6% of evaluated language pairs when translating into English, but only on 33.3% of pairs when translating out of English. These results suggest that while general-purpose LLMs have strong multilingual capabilities, dedicated translation models retain an advantage for lower-resource directions.
Despite its significance as a benchmark, FLORES-200 has faced several criticisms, particularly as researchers have examined translation quality in detail for specific languages.
A 2025 study titled "Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark" conducted human re-evaluation of FLORES+ translations across four typologically diverse languages and found significant quality issues:
| Language | Correct Sentences (out of 50) | Key Issues |
|---|---|---|
| Jinghpaw (Kachin) | 1 | Fundamental lexical gaps; 76% Translation Edit Rate |
| Japanese | 34 | Inappropriately formal register; 4 critical errors |
| South Azerbaijani | 12 | Systematic orthographic errors |
| Asante Twi | 27 | Minor issues; no critical errors |
These findings suggest that the benchmark's claimed 90% quality standard is not met uniformly across all languages, particularly for very low-resource languages where qualified translators are scarce.
The source material drawn from Wikimedia projects introduces certain biases. Annotators have reported that many sentences contain specialized jargon, culturally English-centric references, and content that lacks natural equivalents in target languages. Examples include sports terminology (tennis "net point," soccer "goal") that may not have direct translations, and seasonal references (e.g., "spring") that are irrelevant for tropical-region languages. This domain specificity can penalize translation systems that perform well on more naturalistic text.
Researchers demonstrated that simply copying named entities from source sentences into the hypothesis (without any actual translation) achieved non-zero scores across all languages, with average BLEU scores of 0.29. This vulnerability means that scores can be partially inflated by name overlap alone, rather than reflecting genuine translation capability.
A notable finding is that models fine-tuned on naturalistic, domain-general datasets sometimes underperform on FLORES+ relative to their real-world translation quality. This mismatch between benchmark scores and practical utility raises questions about how well FLORES-200 performance predicts actual deployment effectiveness.
The original FLORES-200 dataset is available through multiple channels:
facebookresearch/flores (archived; no longer updated)openlanguagedata/flores_plus (actively maintained FLORES+ version) and facebook/flores (original version)The hidden test set is not publicly distributed and remains managed by Meta for blind evaluation purposes.
In the FLORES+ version, each data point is stored as a JSON Lines record containing:
id: Sentence identifier, aligned across all languagesiso_639_3: ISO 639-3 language codeiso_15924: ISO 15924 script codeglottocode: Glottolog language identifiertext: The translated sentenceurl: Source article URLdomain: Source domain (wikinews, wikijunior, or wikivoyage)topic: Topic classificationsplit: Dataset split (dev or devtest)Researchers can load the dataset using the Hugging Face Datasets library:
from datasets import load_dataset
# Load all languages
ds = load_dataset("openlanguagedata/flores_plus")
# Load a specific language
ds_fra = load_dataset("openlanguagedata/flores_plus", "fra_Latn")
# Load a specific language and split
ds_fra_dev = load_dataset("openlanguagedata/flores_plus", "fra_Latn", split="dev")
FLORES-200 represents a landmark contribution to multilingual NLP research. Before its release, evaluation of massively multilingual translation systems was hindered by the lack of a single, consistent benchmark covering a large number of languages. Most existing benchmarks covered fewer than 30 languages, and evaluation often required pivoting through English, which introduced confounding factors.
By providing fully parallel translations across 200 languages, FLORES-200 enabled several advances:
The benchmark has been cited in hundreds of research papers and is used by both academic and industrial research groups working on multilingual AI systems.