FLORES-200

FLORES-200 is a multilingual evaluation benchmark for machine translation systems, covering 200 languages across a wide range of language families, scripts, and resource levels. Developed by the FAIR (Fundamental AI Research) team at Meta, the dataset provides professionally translated parallel sentences that enable direct evaluation of translation quality across more than 40,000 language direction pairs without requiring English as a pivot language. FLORES-200 was introduced as part of the No Language Left Behind (NLLB) project in 2022 and has since become one of the most widely used benchmarks for assessing multilingual translation capabilities.

The name "FLORES" originally stood for "Facebook LOw RESource" translation evaluation, reflecting its initial focus on under-resourced languages. Over time, the benchmark expanded dramatically in scope, evolving from a two-language-pair dataset in 2019 to a 200-language benchmark in 2022. As of 2024, the actively maintained successor is called FLORES+, managed by the Open Language Data Initiative (OLDI) under a CC-BY-SA-4.0 license.

History and Evolution

FLoRes v1 (2019)

The FLORES project began in 2019 with a paper titled "The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English," presented at EMNLP-IJCNLP 2019 in Hong Kong. The authors were Francisco Guzman, Peng-Jen Chen, Myle Ott, Juan Pino, Guillaume Lample, Philipp Koehn, Vishrav Chaudhary, and Marc'Aurelio Ranzato, all affiliated with Facebook AI Research or Johns Hopkins University.

This initial release focused on just two low-resource language pairs: Nepali to English and Sinhala to English. The sentences were drawn from English Wikipedia and professionally translated by native speakers. The paper described a rigorous quality-checking process and reported baseline results using several training paradigms: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. The key finding was that existing translation systems performed poorly on these low-resource pairs, highlighting a significant gap between high-resource and low-resource language support.

FLORES-101 (2021)

In June 2021, the benchmark received a major expansion with FLORES-101, introduced in the paper "The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation" by Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana Krishnan, Marc'Aurelio Ranzato, Francisco Guzman, and Angela Fan. This paper was published in the Transactions of the Association for Computational Linguistics (TACL) in 2022.

FLORES-101 scaled the dataset from 2 languages to 101 languages. It introduced a new set of 3,001 sentences extracted from multiple Wikimedia projects (not only Wikipedia, but also Wikinews, Wikijunior, and Wikivoyage) to ensure broad topical diversity. Every sentence was professionally translated into all 101 languages, enabling direct many-to-many evaluation. The dataset was divided into three splits: a dev set (997 sentences), a devtest set (1,012 sentences), and a hidden test set (992 sentences). FLORES-101 also introduced the spBLEU metric and an accompanying SentencePiece model trained on all 101 languages to standardize tokenization across typologically diverse languages.

FLORES-200 (2022)

FLORES-200 was released alongside the No Language Left Behind (NLLB) project in July 2022. The corresponding paper, "No Language Left Behind: Scaling Human-Centered Machine Translation," was authored by the NLLB Team led by Marta R. Costa-jussa and many collaborators at Meta AI. A condensed version was later published in Nature in 2024 under the title "Scaling Neural Machine Translation to 200 Languages."

The expansion from 101 to 200 languages nearly doubled the benchmark's coverage. The same 3,001 source sentences were translated into an additional 99 languages, maintaining full alignment across all languages. Importantly, not all translations originated from English. Several languages were translated from Spanish, French, Russian, or Modern Standard Arabic when those served as more natural source languages for the target language community. The dataset retained the same three-split structure (dev, devtest, and hidden test).

FLORES+ (2024 and beyond)

Following the initial release, management of the dataset transitioned from Meta to the Open Language Data Initiative (OLDI), a community-driven organization. The actively maintained version was renamed FLORES+ (currently at version 4.4) to distinguish it from the original static releases. FLORES+ has expanded coverage to 228 language varieties and accepts community contributions through pull requests. The hidden test set remains managed by Meta separately.

Dataset Construction

Source Material

The source sentences in FLORES-200 were selected from 842 distinct web articles across three Wikimedia projects:

Source	Description	Content Type
Wikinews	International news articles	Current events, politics, world affairs
Wikijunior	Age-appropriate non-fiction books	Science, history, geography for younger audiences
Wikivoyage	Travel guide content	Destinations, culture, practical travel information

Approximately one-third of the 3,001 sentences were drawn from each source. This distribution ensures topical diversity, covering news, educational content, and travel writing. The average sentence length is roughly 21 words. Wikimedia was chosen because its content is freely available under permissive licenses, allowing open redistribution of the benchmark.

Translation Process

The construction of FLORES-200 followed a four-phase workflow for each language:

Alignment: Source sentences were prepared and assigned to professional translators who are native speakers of the target language.
Translation with initial quality assurance: Translators produced first-pass translations. Automated checks flagged potential issues such as missing content, excessive copying from the source, or length anomalies.
Final quality assurance: Human reviewers evaluated the translations for adequacy and fluency. A quality threshold of 90 out of 100 was required before a language's translations could be approved.
Completion: Approved translations were integrated into the benchmark.

On average, the entire process took 119 days per language, with some languages requiring up to 287 days. This lengthy timeline reflects the difficulty of finding qualified professional translators for many of the world's under-resourced languages.

Quality Control

Several automated and manual quality checks were applied throughout the pipeline:

Automatic filtering used spBLEU heuristics to detect potential machine translation artifacts or copy-paste errors. For instance, if the spBLEU score between the source and a candidate translation was suspiciously high (suggesting the translation was too close to the original English), the sentence was flagged for review.
Multi-stage human review involved multiple raters scoring each translation for adequacy (does the translation preserve the meaning?) and fluency (does the translation read naturally in the target language?).
Cross-validation compared translations across related languages to catch systematic errors.

Dataset Structure

Splits

FLORES-200 is organized into three data splits per language:

Split	Sentences	Purpose
dev	997	Development and model tuning
devtest	1,012	Development testing and reported evaluation
test	992	Hidden test set for blind evaluation (managed by Meta)

All splits are fully parallel, meaning every sentence has a corresponding translation in every language. This alignment enables evaluation on any of the roughly 40,000 possible translation direction pairs (approximately 200 x 200) without needing to pivot through a high-resource language.

Language Identification

Each language in FLORES-200 is identified using a code that combines an ISO 639-3 language code with an ISO 15924 script code, separated by an underscore. For example:

eng_Latn for English in Latin script
hin_Deva for Hindi in Devanagari script
arb_Arab for Modern Standard Arabic in Arabic script
zho_Hans for Chinese in Simplified Han script

This system accommodates languages written in multiple scripts. For example, Acehnese appears as both ace_Arab (Arabic script) and ace_Latn (Latin script), and Kashmiri appears as both kas_Arab and kas_Deva.

Language Coverage

FLORES-200 covers 200 languages (204 language-script combinations) spanning dozens of language families and every inhabited continent. The dataset deliberately emphasizes low-resource languages; roughly three times as many low-resource languages as high-resource languages are included (where "high-resource" is defined as having at least one million parallel sentences available with another language).

Coverage by Language Family

Language Family	Example Languages	Approximate Count
Indo-European	English, French, Hindi, Bengali, Russian, Portuguese, German, Spanish, Polish, Marathi	~55
Atlantic-Congo (Niger-Congo)	Yoruba, Igbo, Swahili, Zulu, Shona, Lingala, Wolof, Kikuyu, Ganda	~25
Austronesian	Indonesian, Tagalog, Javanese, Cebuano, Malay, Sundanese, Ilocano, Samoan, Fijian	~15
Afro-Asiatic	Arabic (multiple varieties), Hausa, Amharic, Somali, Tigrinya, Kabyle, Hebrew	~15
Turkic	Turkish, Azerbaijani, Kazakh, Kyrgyz, Uzbek, Turkmen, Tatar, Crimean Tatar, Uyghur	~9
Dravidian	Tamil, Telugu, Kannada, Malayalam	~4
Sino-Tibetan	Chinese (Simplified and Traditional), Burmese, Standard Tibetan	~5
Austroasiatic	Khmer, Vietnamese	~2
Tai-Kadai	Thai, Lao	~2
Japonic	Japanese	1
Koreanic	Korean	1
Uralic	Finnish, Estonian, Hungarian	3
Kartvelian	Georgian	1
Other families	Basque, Mongolian, Armenian, and others	~15+

Notable Multi-Variant Languages

Several languages appear in multiple varieties or scripts:

Arabic: Modern Standard Arabic, Egyptian Arabic, Moroccan Arabic, North Levantine Arabic, South Levantine Arabic, Najdi Arabic, Tunisian Arabic, Mesopotamian Arabic, and Ta'izzi-Adeni Arabic
Chinese: Simplified Chinese, Traditional Chinese, and Yue Chinese (Cantonese)
Acehnese, Banjar, Kashmiri, Minangkabau: Each appears in two different scripts (Arabic and Latin, or Arabic and Devanagari)
Tamasheq: Latin script and Tifinagh script variants

Evaluation Metrics

FLORES-200 standardized two primary automatic evaluation metrics for multilingual translation, addressing long-standing inconsistencies in how translation quality was measured across different languages.

spBLEU (SentencePiece BLEU)

The spBLEU metric was introduced alongside FLORES-101 to solve a fundamental problem with the standard BLEU score: different tokenizers produce different BLEU scores for the same translation, making cross-language comparisons unreliable. Languages with complex morphology or non-Latin scripts are particularly affected, as standard tokenizers may split words differently depending on the language.

spBLEU works by first tokenizing both the reference translation and the system output using a single, language-agnostic SentencePiece model. This model was trained on monolingual data from all FLORES languages with a vocabulary of 256,000 subword tokens. Temperature upsampling during training ensured adequate representation of low-resource languages. After tokenization, the standard BLEU calculation (n-gram precision with brevity penalty) is applied.

The spBLEU metric has the advantage of being directly comparable across all 200 languages, since the same tokenizer is applied universally. The original SPM model released with FLORES-101 was extended for FLORES-200 (referred to as SPM-200).

To compute spBLEU on FLORES-200, researchers first tokenize their hypothesis file using the provided SentencePiece model, then run SacreBLEU on the tokenized output:

# Step 1: Tokenize with SentencePiece
python scripts/spm_encode.py --model flores200_sacrebleu_spm.model \
  --output_format=piece < hyp.txt > hyp.spm

# Step 2: Compute BLEU
cat hyp.spm | sacrebleu ref.spm

chrF++

chrF++ (character n-gram F-score with word n-grams) serves as the recommended primary metric for FLORES-200 evaluation. Unlike BLEU, which operates on word-level n-grams, chrF++ computes precision and recall at the character level (using character n-grams of order 1 through 6) and incorporates word-level unigrams and bigrams. This makes chrF++ naturally more robust to morphological variation and does not require any external tokenizer.

The metric is computed using SacreBLEU:

sacrebleu -m chrf --chrf-word-order 2 ref.txt < hyp.txt

Research conducted during the NLLB project found strong correlation between spBLEU and chrF++ across all language pairs, with Pearson correlation coefficients ranging from 0.94 to 0.98. Both metrics also showed reasonable correlation with human judgments of translation quality.

Human Evaluation: XSTS

In addition to automatic metrics, the NLLB project developed a human evaluation protocol called Cross-lingual Semantic Textual Similarity (XSTS). Evaluators rate translations on a five-point scale, where a score of 3 represents the threshold of acceptable quality. The NLLB team reported the following Spearman correlation coefficients between aggregated XSTS scores and automatic metrics:

Metric	Spearman's R
spBLEU	0.710
chrF++ (corpus-level)	0.687
chrF++ (sentence-level average)	0.694

Of 55 translation directions evaluated with XSTS, 38 (approximately 69%) achieved scores above 4.0, indicating high translation quality.

Relationship to the NLLB Project

FLORES-200 was developed as a core component of Meta's No Language Left Behind (NLLB) initiative, which aimed to build a single translation model capable of supporting 200 languages with high quality. The benchmark served as the primary evaluation tool throughout the NLLB project.

NLLB Model Architecture

The flagship NLLB-200 model uses a Sparsely Gated Mixture of Experts (MoE) architecture. In this design, a quarter of the feed-forward layers in both the encoder and decoder are replaced with MoE layers, where each token is routed to the top-2 experts out of a larger expert pool. This conditional compute approach allows the model to maintain a large total parameter count (54.5 billion) while keeping the computation per token manageable.

The NLLB project released models at multiple scales:

Model Variant	Parameters	Type
NLLB-200 MoE	54.5B	Sparse Mixture of Experts
NLLB-200 Dense	3.3B	Standard dense transformer
NLLB-200 Dense	1.3B	Standard dense transformer
NLLB-200 Distilled	1.3B	Distilled from the 54.5B model
NLLB-200 Distilled	600M	Distilled from the 54.5B model

All models were trained using a single SentencePiece vocabulary of 256,000 tokens and a maximum sequence length of 512 tokens.

Key Results on FLORES-200

The NLLB-200 54.5B MoE model achieved an average improvement of +7.3 spBLEU over the previous state-of-the-art system across all evaluated translation directions, representing a 44% relative improvement. For some African and Indian languages, the accuracy improvement exceeded 70%. The model was evaluated across more than 40,000 translation directions using the FLORES-200 devtest split.

Data Mining and Training

The NLLB project generated over 1.1 billion parallel sentence pairs through automated data mining using LASER3 (Language-Agnostic SEntence Representations, version 3). LASER3 employed a teacher-student training approach to produce sentence embeddings for low-resource languages, using 12-layer transformer encoders with approximately 250 million parameters each.

Training data included a combination of mined parallel data, open-source parallel corpora, seed data (small professionally translated datasets), and back-translated monolingual data. A toxicity-based filtering procedure removed approximately 30% of mined parallel sentences while improving translation quality by roughly 5%.

Two additional training innovations proved important for low-resource languages:

Expert Output Masking (EOM): A regularization technique that randomly masks expert outputs during training to prevent overfitting, particularly on low-resource language pairs.
Curriculum learning: Language pairs prone to overfitting were introduced in later phases of training rather than from the beginning, based on overfitting detection during initial training.

These techniques combined improved performance on low-resource and very low-resource languages by approximately 2 chrF++ points.

Use in Research and Competition

WMT Shared Tasks

FLORES-200 has been used as an evaluation benchmark in multiple editions of the Workshop on Machine Translation (WMT). The WMT 2023 Large-Scale Multilingual Translation shared task used the FLORES devtest split for evaluating participating systems. In 2024, the WMT shared task organized by OLDI continued to use FLORES+ as its evaluation framework, with community efforts focused on expanding and correcting translations for additional languages.

Derivative Benchmarks

The parallel sentence structure of FLORES-200 has enabled the creation of several derivative evaluation resources:

Benchmark	Year	Description
Belebele	2023	A multiple-choice machine reading comprehension benchmark spanning 122 language variants. Each question is based on a short passage drawn from FLORES-200.
SIB-200	2023	A topic classification dataset covering 200+ languages and dialects, built on FLORES-200 passages.
FLORES+ Emakhuwa	2024	An expansion of FLORES+ for Portuguese-Emakhuwa translation, addressing orthographic variation.
FLORES-200 Corrections	2024	Error corrections for four African languages: Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu.

Comparison with Large Language Models

Research comparing large language models against specialized translation systems on FLORES-200 has revealed interesting patterns. Studies have found that models such as GPT-4 are competitive with NLLB-54B on high-resource language pairs but fall behind on low-resource languages. One study examining Claude 3 Opus found that it outperformed NLLB-54B on 55.6% of evaluated language pairs when translating into English, but only on 33.3% of pairs when translating out of English. These results suggest that while general-purpose LLMs have strong multilingual capabilities, dedicated translation models retain an advantage for lower-resource directions.

Limitations and Criticisms

Despite its significance as a benchmark, FLORES-200 has faced several criticisms, particularly as researchers have examined translation quality in detail for specific languages.

Translation Quality Concerns

A 2025 study titled "Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark" conducted human re-evaluation of FLORES+ translations across four typologically diverse languages and found significant quality issues:

Language	Correct Sentences (out of 50)	Key Issues
Jinghpaw (Kachin)	1	Fundamental lexical gaps; 76% Translation Edit Rate
Japanese	34	Inappropriately formal register; 4 critical errors
South Azerbaijani	12	Systematic orthographic errors
Asante Twi	27	Minor issues; no critical errors

These findings suggest that the benchmark's claimed 90% quality standard is not met uniformly across all languages, particularly for very low-resource languages where qualified translators are scarce.

Domain Bias

The source material drawn from Wikimedia projects introduces certain biases. Annotators have reported that many sentences contain specialized jargon, culturally English-centric references, and content that lacks natural equivalents in target languages. Examples include sports terminology (tennis "net point," soccer "goal") that may not have direct translations, and seasonal references (e.g., "spring") that are irrelevant for tropical-region languages. This domain specificity can penalize translation systems that perform well on more naturalistic text.

Named Entity Sensitivity

Researchers demonstrated that simply copying named entities from source sentences into the hypothesis (without any actual translation) achieved non-zero scores across all languages, with average BLEU scores of 0.29. This vulnerability means that scores can be partially inflated by name overlap alone, rather than reflecting genuine translation capability.

Benchmark-Reality Gap

A notable finding is that models fine-tuned on naturalistic, domain-general datasets sometimes underperform on FLORES+ relative to their real-world translation quality. This mismatch between benchmark scores and practical utility raises questions about how well FLORES-200 performance predicts actual deployment effectiveness.

Technical Access

Availability

The original FLORES-200 dataset is available through multiple channels:

GitHub: The original repository at facebookresearch/flores (archived; no longer updated)
Hugging Face: Available as openlanguagedata/flores_plus (actively maintained FLORES+ version) and facebook/flores (original version)
License: CC-BY-SA-4.0

The hidden test set is not publicly distributed and remains managed by Meta for blind evaluation purposes.

Data Format

In the FLORES+ version, each data point is stored as a JSON Lines record containing:

id: Sentence identifier, aligned across all languages
iso_639_3: ISO 639-3 language code
iso_15924: ISO 15924 script code
glottocode: Glottolog language identifier
text: The translated sentence
url: Source article URL
domain: Source domain (wikinews, wikijunior, or wikivoyage)
topic: Topic classification
split: Dataset split (dev or devtest)

Loading the Dataset

Researchers can load the dataset using the Hugging Face Datasets library:

from datasets import load_dataset

# Load all languages
ds = load_dataset("openlanguagedata/flores_plus")

# Load a specific language
ds_fra = load_dataset("openlanguagedata/flores_plus", "fra_Latn")

# Load a specific language and split
ds_fra_dev = load_dataset("openlanguagedata/flores_plus", "fra_Latn", split="dev")

Impact and Significance

FLORES-200 represents a landmark contribution to multilingual NLP research. Before its release, evaluation of massively multilingual translation systems was hindered by the lack of a single, consistent benchmark covering a large number of languages. Most existing benchmarks covered fewer than 30 languages, and evaluation often required pivoting through English, which introduced confounding factors.

By providing fully parallel translations across 200 languages, FLORES-200 enabled several advances:

Direct many-to-many evaluation: Researchers can measure translation quality between any pair of supported languages without needing English as an intermediary.
Low-resource language visibility: The deliberate inclusion of many low-resource languages ensures that translation systems are evaluated on the languages where improvement is most needed.
Standardized metrics: The introduction of spBLEU and the adoption of chrF++ as standard metrics reduced inconsistencies in how translation quality was reported across different studies.
Downstream benchmarks: The parallel structure of FLORES-200 has enabled the creation of evaluation resources for tasks beyond translation, including reading comprehension (Belebele) and topic classification (SIB-200).

The benchmark has been cited in hundreds of research papers and is used by both academic and industrial research groups working on multilingual AI systems.

References

Guzman, F., Chen, P.J., Ott, M., Pino, J., Lample, G., Koehn, P., Chaudhary, V., & Ranzato, M. (2019). "The FLoRes Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English." *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP)*, pp. 6098-6111. Hong Kong, China.
Goyal, N., Gao, C., Chaudhary, V., Chen, P.J., Wenzek, G., Ju, D., Krishnan, S., Ranzato, M., Guzman, F., & Fan, A. (2022). "The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation." *Transactions of the Association for Computational Linguistics*, 10, pp. 522-538.
NLLB Team, Costa-jussa, M.R., et al. (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation." *arXiv preprint arXiv:2207.04672*.
NLLB Team, Costa-jussa, M.R., et al. (2024). "Scaling Neural Machine Translation to 200 Languages." *Nature*, 630(8018), pp. 841-846.
Bandarkar, L., Liang, D., Muller, B., Artetxe, M., Shukla, S.N., Husa, D., Goyal, N., Krishnan, A., Zettlemoyer, L., & Khabsa, M. (2023). "The Belebele Benchmark: A Parallel Reading Comprehension Dataset in 122 Language Variants." *arXiv preprint arXiv:2308.16884*.
Adelani, D.I., et al. (2024). "SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects." *Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL)*.
"Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark." *arXiv preprint arXiv:2508.20511* (2025).
Open Language Data Initiative (OLDI). FLORES+ dataset. Available at: https://huggingface.co/datasets/openlanguagedata/flores_plus
Facebook Research. FLORES GitHub repository. Available at: https://github.com/facebookresearch/flores

History and Evolution

FLoRes v1 (2019)

FLORES-101 (2021)

FLORES-200 (2022)

FLORES+ (2024 and beyond)

Dataset Construction

Source Material

Translation Process

Quality Control

Dataset Structure

Splits

Language Identification

Language Coverage

Coverage by Language Family

Notable Multi-Variant Languages

Evaluation Metrics

spBLEU (SentencePiece BLEU)

chrF++

Human Evaluation: XSTS

Relationship to the NLLB Project

NLLB Model Architecture

Key Results on FLORES-200

Data Mining and Training

Use in Research and Competition

WMT Shared Tasks

Derivative Benchmarks

Comparison with Large Language Models

Limitations and Criticisms

Translation Quality Concerns

Domain Bias

Named Entity Sensitivity

Benchmark-Reality Gap

Technical Access

Availability

Data Format

Loading the Dataset

Impact and Significance

See Also

References

Improve this article

Related Articles

MGSM (Multilingual Grade School Math)

Humanity's Last Exam

Bahdanau attention

BLEU (Bilingual Evaluation Understudy)

AA-LCR

MathArena

History and Evolution

FLoRes v1 (2019)

FLORES-101 (2021)

FLORES-200 (2022)

FLORES+ (2024 and beyond)

Dataset Construction

Source Material

Translation Process

Quality Control

Dataset Structure

Splits

Language Identification

Language Coverage

Coverage by Language Family

Notable Multi-Variant Languages

Evaluation Metrics

spBLEU (SentencePiece BLEU)

chrF++

Human Evaluation: XSTS

Relationship to the NLLB Project

NLLB Model Architecture

Key Results on FLORES-200

Data Mining and Training

Use in Research and Competition

WMT Shared Tasks

Derivative Benchmarks

Comparison with Large Language Models

Limitations and Criticisms

Translation Quality Concerns

Domain Bias

Named Entity Sensitivity

Benchmark-Reality Gap

Technical Access