| MMMLU | |
|---|---|
| Overview | |
| Full name | Multilingual Massive Multitask Language Understanding |
| Abbreviation | Multilingual MMLU |
| Description | Multilingual evaluation frameworks based on the Massive Multitask Language Understanding benchmark, including translations and adaptations for 26+ languages |
| Release date | ~2023 |
| Latest version | Various |
| Benchmark updated | 2024 |
| Authors | University of Oregon NLP group, Various research teams |
| Organization | University of Oregon, Multiple institutions |
| Technical Details | |
| Type | Knowledge Assessment, Multilingual Evaluation |
| Modality | Text |
| Task format | Multiple choice |
| Number of tasks | Varies (MMLU: 15,908; MMLU-ProX: 11,829 per language) |
| Total examples | Varies by implementation |
| Evaluation metric | Accuracy, Zero-shot, Few-shot |
| Domains | STEM, Humanities, Social Sciences, Professional |
| Languages | 26-29 languages including English, Chinese, Spanish, French, German, Russian, Arabic, Hindi, and others |
| Performance | |
| Human performance | 89.8% (domain experts on English MMLU) |
| Baseline | 25% (random chance) |
| SOTA score | ~88% (on English MMLU) |
| SOTA model | Claude 3.5 Sonnet, GPT-4o, Llama 3.1 405B |
| SOTA date | 2024 |
| Saturated | Partially (English version) |
| Resources | |
| Paper | (original MMLU) Paper |
| GitHub | Repository
|
| Predecessor | MMLU |
| Successor | MMLU-ProX |
Multilingual MMLU refers to various multilingual adaptations and extensions of the Measuring Massive Multitask Language Understanding (MMLU) benchmark, designed to evaluate large language models' knowledge and reasoning capabilities across multiple languages and diverse subject areas. The most notable implementations include the mlmm-evaluation framework covering 26 languages and MMLU-ProX covering 29 languages, addressing the critical need for culturally and linguistically diverse AI assessment.
Multilingual MMLU initiatives represent significant efforts to extend AI benchmarking beyond English, addressing the limitations of the original MMLU benchmark. While MMLU became the standard for evaluating language models' multitask capabilities with over 100 million downloads by 2024, its English-only format overlooked the linguistic and cultural diversity essential for global AI deployment.
Developed by the University of Oregon NLP group, this framework includes:
A more recent comprehensive benchmark featuring:
The original MMLU benchmark, released on September 7, 2020, by Dan Hendrycks and colleagues, revolutionized AI evaluation by testing models across 57 subjects with 15,908 questions ranging from STEM fields to humanities. However, its English-only format created several limitations:
These limitations motivated the development of multilingual versions to create more inclusive and comprehensive evaluation metrics.
The framework supports: Russian, German, Chinese, French, Spanish, Italian, Dutch, Vietnamese, Indonesian, Arabic, Hungarian, Romanian, Danish, Slovak, Ukrainian, Catalan, Serbian, Croatian, Hindi, Bengali, Tamil, Nepali, Malayalam, Marathi, Telugu, and Kannada.
Includes the above languages plus additional coverage for even broader evaluation.
| Language Family | Example Languages | Script Type |
|---|---|---|
| Indo-European | English, Spanish, French, German, Italian, Dutch, Russian, Ukrainian, Romanian, Danish, Slovak, Serbian, Croatian, Hindi, Bengali, Marathi, Nepali | Latin, Cyrillic, Devanagari |
| Sino-Tibetan | Chinese (Simplified/Traditional) | Chinese characters |
| Dravidian | Tamil, Telugu, Malayalam, Kannada | Various scripts |
| Austronesian | Indonesian, Malay | Latin |
| Afro-Asiatic | Arabic | Arabic script |
| Uralic | Hungarian | Latin |
| Austroasiatic | Vietnamese | Latin with diacritics |
The benchmarks maintain the original MMLU's 57 subjects across four major categories:
| Category | Number of Subjects | Example Topics |
|---|---|---|
| STEM | 18 | Mathematics, Physics, Chemistry, Biology, Computer Science, Engineering |
| Humanities | 14 | History, Philosophy, Literature, Law, Ethics |
| Social Sciences | 13 | Psychology, Sociology, Economics, Political Science, Geography |
| Other | 12 | Professional Medicine, Business, Nutrition, Marketing |
1. Automated Translation: Using ChatGPT for initial translation 2. Direct Translation: Maintaining question structure and format 3. Batch Processing: Efficient translation of large question sets
1. Multi-LLM Translation: Multiple powerful LLMs for initial translation 2. Expert Review: Native speaker verification 3. Cultural Adaptation: Ensuring cultural relevance 4. Terminology Consistency: Standardized technical terms 5. Quality Assurance: Rigorous validation process
| Metric | Description | Implementation |
|---|---|---|
| Zero-shot Accuracy | Direct answer without examples | Standard evaluation |
| Few-shot Accuracy | Performance with 5 examples | Enhanced context |
| Cross-lingual Transfer | Performance correlation across languages | Language comparison |
| Performance Gap | Difference from English baseline | Fairness assessment |
Research reveals significant performance disparities:
| Language Category | Performance Gap | Key Factors |
|---|---|---|
| High-resource languages | Minimal drop | Large training data, similar structure |
| Medium-resource languages | Moderate drop | Adequate training data |
| Low-resource languages | Up to 24.3% drop | Limited training data |
Leading models show varying multilingual capabilities on English MMLU (2024):
| Model | English MMLU | Notes |
|---|---|---|
| Claude 3.5 Sonnet | ~88% | Strong multilingual capabilities |
| GPT-4o | ~88% | Consistent across languages |
| Llama 3.1 405B | ~88% | Open-source leader |
| Earlier models | <70% | Significant multilingual gaps |
| Issue | Description | Impact | Mitigation |
|---|---|---|---|
| Semantic Drift | Meaning changes in translation | Accuracy reduction | Human verification |
| Cultural Concepts | Untranslatable terms | Question invalidity | Cultural adaptation |
| Technical Terms | Inconsistent terminology | Confusion | Standardized glossaries |
| Idiomatic Expressions | Loss of nuance | Comprehension issues | Contextual rewriting |
Multilingual versions inherit problems from original MMLU:
| Application | Use Case | Key Benefit |
|---|---|---|
| Global Deployment | Multinational AI services | Fair performance assessment |
| Education Technology | Multilingual tutoring systems | Language-appropriate evaluation |
| Healthcare AI | Medical diagnosis systems | Culturally sensitive assessment |
| Legal Technology | International law applications | Jurisdiction-specific validation |
1. Native Content Creation: Questions originally written in each language 2. Cultural Validation: Expert review by cultural specialists 3. Dynamic Question Generation: Preventing data contamination 4. Multimodal Extensions: Adding visual and audio components 5. Code-switching Tests: Evaluating multilingual mixing
Cite error: <ref> tag with name "mmlu_original" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "wikipedia_mmlu" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mlmm_eval" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "mmlu_prox" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "huggingface_leaderboard" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "helm" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "klu_mmlu" defined in <references> has group attribute "" which does not appear in prior text.