Global-MMLU

AI Benchmarks Model Evaluation Natural Language Processing

11 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 2,296 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Global-MMLU is a multilingual evaluation benchmark that extends the MMLU question-answering dataset across 42 languages, with designated subsets labeled culturally sensitive (CS) and culturally agnostic (CA). It was released in December 2024 by researchers at Cohere Labs (Cohere For AI) together with collaborators from EPFL, Hugging Face, Mila, AI Singapore, MIT, KAIST and other institutions. Beyond expanding language coverage, Global-MMLU was built to quantify and correct the Western-centric cultural bias embedded in the original English MMLU and its machine-translated derivatives, and to improve translation quality by engaging compensated professional and community annotators. The work was published at ACL 2025.^[1]^[2]^[3]

Overview

The Massive Multitask Language Understanding (MMLU) dataset, introduced by Hendrycks et al. in 2020, covers 57 subjects spanning STEM, humanities, social sciences and other areas, and has become a standard test of knowledge and reasoning for large language models. Because MMLU was written in English, multilingual evaluation has typically relied on machine-translated copies of it, which the Global-MMLU authors collectively term transMMLU.^[1]

Global-MMLU addresses two problems with that practice at once. First, it improves linguistic quality by replacing or post-editing machine translations with human-verified ones for many languages. Second, it adds a cultural-bias layer: every question in an annotated subsample is labeled according to whether answering it correctly requires culture-, geography-, or dialect-specific knowledge. This lets evaluators report separately on culturally sensitive and culturally agnostic questions, exposing how much a model's MMLU score depends on Western-centric knowledge rather than general reasoning ability.^[1]^[4]

The full Global-MMLU test set contains all roughly 14,000 MMLU samples translated into 42 languages including English, for a total of 589,764 question-answer pairs. The dataset is distributed on Hugging Face under the Cohere Labs (CohereForAI) organization and is integrated as a task in EleutherAI's lm-evaluation-harness.^[5]^[6]

Motivation: Western bias in translated MMLU

The central argument of the paper is that translating a benchmark into many languages produces multilinguality but not multiculturalism. The original MMLU contains subjects that are explicitly United States-specific, such as US history, US-focused jurisprudence and US accounting, and even nominally universal subjects encode Western framing. The authors note, for example, that the Moral Scenarios subject is anchored to "moral standards in the US," making it culturally loaded once translated.^[1]

When such a dataset is machine-translated and adopted as a global yardstick, two distortions follow. Models that have absorbed more Western cultural and geographic facts gain an advantage that has little to do with cross-lingual competence, so rankings reward Western knowledge rather than genuine multilingual ability. In addition, naive machine translation introduces translationese and other artifacts that degrade question clarity, an effect that is more severe for low-resource languages where translation systems are weaker. Global-MMLU was constructed to separate these confounds from true model capability.^[1]

To quantify the bias, the authors had 200 compensated professional and community annotators review a representative random sample of the original English MMLU. Out of that annotated set, 28% of questions were found to require culturally sensitive knowledge to answer correctly. Among questions needing geographic knowledge, 84.9% concerned North America or Europe. Of all culturally sensitive questions, 86.5% were tagged as requiring Western cultural knowledge, with the next-largest category, South Asian culture, at only about 4%. Within Western-culture questions, 73.9% specifically required knowledge of the United States. These findings were independently summarized by AI press coverage and reproduced in the ACL and IBM Research publication records.^[1]^[2]^[3]

The 42 languages and the CS/CA subsets

Global-MMLU covers the following 42 languages: Amharic, Arabic, Bengali, Chinese, Czech, Dutch, English, Filipino, French, German, Greek, Hausa, Hebrew, Hindi, Igbo, Indonesian, Italian, Japanese, Korean, Kyrgyz, Lithuanian, Malagasy, Malay, Nepali, Nyanja, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Sinhala, Somali, Shona, Spanish, Swahili, Swedish, Telugu, Turkish, Ukrainian, Vietnamese and Yoruba.^[1]

For reporting, the paper groups these by resource availability following the taxonomy of Joshi et al. (2019): 18 high-resource languages (including Arabic, Chinese, English, French, German, Hindi, Russian and Spanish), 11 mid-resource languages (such as Bengali, Indonesian, Korean and Ukrainian) and 13 low-resource languages (such as Amharic, Hausa, Swahili, Telugu and Yoruba).^[1]

The cultural-sensitivity labels come from annotating MMLU Annotated (MA), a uniform subsample of 2,850 English questions (50 per subject, about 20% of the original data). Each question received three categories of judgment: cultural knowledge, geographic knowledge, and dialect knowledge. A question is labeled culturally sensitive (CS) if it requires any one of those; otherwise it is culturally agnostic (CA). By majority vote of the annotators, 792 of the 2,850 MA questions were CS and 2,058 were CA. Extended across all 42 languages, this yields 119,700 MA samples, 33,264 CS samples and 86,436 CA samples.^[1]

Subset	English samples	All-language samples	Description
Full Global-MMLU	~14,000 per language	589,764	All MMLU items translated into 42 languages^[1]
MMLU Annotated (MA)	2,850	119,700	Uniform 50-per-subject sample carrying CS/CA labels^[1]
Culturally Sensitive (CS)	792	33,264	Requires cultural, geographic or dialect knowledge^[1]
Culturally Agnostic (CA)	2,058	86,436	No cultural, regional or dialectal references^[1]
Global-MMLU-Lite	400 per language	6,000	200 CS + 200 CA for 15 fully human-translated languages plus English^[1]

The subsets are not just smaller slices; they have different subject mixes. Because cultural references cluster in the humanities and social sciences, the CS subset over-represents those fields, while STEM, medical and business questions dominate the CA subset. For example, STEM supplies 33.3% of MA but only 2.9% of CS, whereas humanities jumps from 22.8% of MA to 55.8% of CS.^[1]

Construction and translation

The dataset was assembled by combining several translation sources, all anchored to a common machine-translation baseline. The English MMLU was first translated into 41 languages with the Google Translate API; the authors chose Google Translate deliberately because using an LLM-based translator could bias evaluations toward models that favor their own outputs, and because Google Translate scored higher on ChrF++ than GPT-3.5-turbo across subjects. Native speakers then reviewed and edited those translations.^[1]

Editing came from three streams, producing 14 human-translated languages in total:^[1]

Professional gold-set translations for four languages: Arabic, French, Hindi and Spanish, where compensated professional annotators reviewed and corrected the machine output for fluency and cultural appropriateness.
Community translations for 11 languages that met a threshold of at least 50 human-edited samples: Amharic, Czech, Malay, Persian, Romanian, Russian, Sinhala, Telugu, Turkish, Ukrainian and Vietnamese, managed through the Argilla annotation platform.
MMMLU professional translations for 10 additional languages, drawn from OpenAI's human-translated MMMLU release: Bengali, Chinese, German, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili and Yoruba. (MMMLU covers 14 languages, four of which overlap with the gold set.)

The remaining 16 languages, including Polish, Greek, Dutch, Swedish, Filipino, Lithuanian, Hausa, Nepali, Somali, Hebrew, Shona, Malagasy, Igbo, Serbian, Kyrgyz and Nyanja, rely on the higher-quality machine translation baseline without human post-editing. In total, annotators made 7,565 edits, about 36.9% of reviewed samples; professional annotators edited on average 789 samples per language (38.5%) and community contributors 362 samples per language (17.7%).^[1]

For the cultural annotation, each of the 2,850 questions was reviewed by at least three annotators, and 96.4% by more than three (up to a maximum of ten). Labels were assigned by majority vote, and inter-annotator agreement was measured with Krippendorff's Alpha, which was high for most subjects (unanimous for Anatomy) and lowest for Moral Scenarios. Only 2.4% of samples were found to depend on time-sensitive knowledge.^[1]

Evaluation methodology

Global-MMLU is designed to be reported on three subsets: MMLU Annotated, the culturally agnostic (CA) subset, and the culturally sensitive (CS) subset, so that a model's overall score can be decomposed by cultural dependence. The authors evaluated 14 state-of-the-art open-weight and proprietary models from nine model families, chosen for strong multilingual performance.^[1]

The evaluated models spanned four size tiers: small models including Aya Expanse 8B, Gemma 2 9B, SEA-LION v3 (9B), Llama 3.1 8B, Mistral Nemo 12B (Mistral AI) and Qwen 2.5 7B; mid-size models including Aya Expanse 32B, Command R (34B), Gemma 2 27B and Qwen 2.5 32B; large models Llama 3.1 70B and Command R+; and the closed-weight models GPT-4o and Claude 3.5 Sonnet.^[1]

Open models were run through EleutherAI's lm-evaluation-harness in a 5-shot setting. The closed models were also evaluated 5-shot, but because token log-probabilities are unavailable through their APIs, the harness sent the 5-shot prompt and parsed the generated answer instead. Prompt instructions were given in the same language as each sample, following the original MMLU protocol.^[1]

Notable results and findings on cultural bias

The headline result is that model rankings shift substantially between the CA and CS subsets, demonstrating that scores on translated MMLU are distorted by cultural content. Measured relative to ranks on MMLU Annotated and averaged across the 14 human-translated languages, CA rankings moved by 3.4 rank changes and 3.7 position shifts, while CS rankings were far more volatile at 5.7 rank changes and 7.3 position shifts.^[1]

Finding	Value	Source
MMLU questions requiring culturally sensitive knowledge	28%	^[1]^[2]^[3]
Geography questions focused on North America or Europe	84.9%	^[1]^[2]^[3]
Culturally sensitive questions tagged as Western culture	86.5%	^[1]^[2]
Western-culture questions specific to the United States	73.9%	^[1]
CS questions tagged as North American (geographic)	64.5%	^[1]
Avg. rank changes on CA subset (human-translated langs)	3.4	^[1]
Avg. rank changes on CS subset (human-translated langs)	5.7	^[1]

Volatility grows as language resources shrink. The mean standard deviation of accuracy across high-resource languages was 3.21 on CA and 3.86 on CS; for low-resource languages it rose to 6.37 and 6.78, increases of 98% and 75% over the high-resource case. Across every resource level, CS accuracy varied more than CA accuracy, underlining how sensitive culturally loaded questions are to translation quality and training-data coverage.^[1]

Average accuracy also depends on model size, and counterintuitively was higher on CS than CA in the full set, because CS draws heavily on humanities and social-science questions where models tend to score well, whereas CA contains more difficult STEM and medical items.^[1]

Model tier	CA accuracy	CS accuracy	Avg. CA rank change	Avg. CS rank change
Small models	51.3%	54.8%	0.35	0.45
Mid-size models	59.1%	61.7%	0.33	1.97
Large models	61.6%	66.8%	0.21	0.67

When the CS and CA samples are balanced, as in Global-MMLU-Lite, the pattern inverts: CS tasks then show lower average accuracy and greater variance than CA tasks, showing that cultural specificity increases performance instability once the subject mix is held even. The closed proprietary models GPT-4o and Claude 3.5 Sonnet consistently outscored the smaller open models, though their lead was narrower on CS data than on CA data.^[1]

Based on these results, the authors recommend that practitioners report on Global-MMLU rather than translated MMLU, and that they report CS and CA performance separately rather than as a single aggregate number.^[1]

Significance

Global-MMLU has been adopted as a more reliable multilingual evaluation than raw machine-translated MMLU. It is built into EleutherAI's lm-evaluation-harness, where the global_mmlu_full_* tasks cover all 42 languages and the lighter global_mmlu_* tasks cover the human-translated subset, including subject-category splits for STEM, humanities, social sciences and other.^[6]

More broadly, the work is part of a wave of research arguing that multilingual natural language understanding benchmarks must account for culture, not just language. By providing per-question metadata on cultural, geographic and dialectal dependence, Global-MMLU lets researchers audit other datasets for similar bias and gives model developers a way to check whether multilingual gains reflect genuine cross-cultural competence or merely transferred Western knowledge. The accompanying Global-MMLU-Lite release, with 200 CS and 200 CA samples per language for 15 fully human-translated languages plus English, provides a compact high-quality option for routine evaluation.^[1]^[5]

Limitations

The authors are explicit about several constraints. Only 14 of the 42 languages are fully human-translated or post-edited; the other 16 depend on machine translation, so apparent performance on those languages may partly reflect translation artifacts rather than model ability. The cultural-sensitivity labels were derived from a 2,850-question English subsample rather than the entire dataset, and the dialect-knowledge signal is sparse, accounting for only about 0.5% of questions. Some subjects, notably Moral Scenarios, showed substantial annotator disagreement, indicating that cultural-sensitivity judgments are themselves subjective at the margins.^[1]

Because the CS and CA subsets differ in subject composition, raw accuracy comparisons between them conflate cultural sensitivity with subject difficulty; this is why the authors stress that ranking changes, not absolute accuracy gaps, are the meaningful signal of cultural bias, and why the balanced Global-MMLU-Lite is provided for cleaner comparisons. Finally, the benchmark inherits MMLU's multiple-choice format and its underlying subject coverage, so it measures factual recall and reasoning over a fixed curriculum rather than open-ended or generative multilingual ability.^[1]

References

Singh, Shivalika; Romanou, Angelika; Fourrier, Clémentine; Adelani, David I.; et al. "Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation." arXiv:2412.03304, December 2024. https://arxiv.org/abs/2412.03304 ↩
"Global-MMLU: A World-class Benchmark Redefining Multilingual AI by Bridging Cultural and Linguistic Gaps for Equitable Evaluation Across 42 Languages and Diverse Contexts." MarkTechPost, December 7, 2024. https://www.marktechpost.com/2024/12/07/global-mmlu-a-world-class-benchmark-redefining-multilingual-ai-by-bridging-cultural-and-linguistic-gaps-for-equitable-evaluation-across-42-languages-and-diverse-contexts/ ↩
Singh, Shivalika; et al. "Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), Vol. 1, pp. 18761-18799. ACL Anthology. https://aclanthology.org/2025.acl-long.919/ ↩
"Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation." Hugging Face Papers. https://huggingface.co/papers/2412.03304 ↩
"CohereLabs/Global-MMLU." Hugging Face Datasets. https://huggingface.co/datasets/CohereLabs/Global-MMLU ↩
"Global-MMLU." lm-evaluation-harness, EleutherAI. https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/global_mmlu/README.md ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

SWE-bench Verified

Overview

Motivation: Western bias in translated MMLU

The 42 languages and the CS/CA subsets

Construction and translation

Evaluation methodology

Notable results and findings on cultural bias

Significance

Limitations

See also

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench

What links here

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench