MGSM
| MGSM | |
|---|---|
| Overview | |
| Full name | Multilingual Grade School Math |
| Abbreviation | MGSM |
| Description | A multilingual benchmark evaluating mathematical reasoning across 10 typologically diverse languages using grade-school math problems |
| Release date | 2022-10-06 |
| Latest version | 1.0 |
| Benchmark updated | 2022-10 |
| Authors | Freda Shi, Mirac Suzgun, Markus Freitag, Nathanael Schärli, And others |
| Organization | Google Research |
| Technical Details | |
| Type | Mathematical Reasoning, Multilingual Evaluation |
| Modality | Text |
| Task format | Word problems requiring multi-step arithmetic |
| Number of tasks | 10 languages |
| Total examples | 2,500 (250 per language) |
| Evaluation metric | Exact match accuracy |
| Domains | Elementary mathematics, Arithmetic word problems |
| Languages | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
| Performance | |
| Human performance | Based on GSM8K validation |
| Baseline | Varies by language and model |
| SOTA score | 91.60% |
| SOTA model | Claude 3.5 Sonnet, Meta Llama 3.1 405B |
| SOTA date | 2025 |
| Saturated | Nearly |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | CC-BY-SA 4.0 |
| Predecessor | GSM8K |
MGSM (Multilingual Grade School Math) is a benchmark dataset designed to evaluate the mathematical reasoning capabilities of large language models across multiple languages. Released in October 2022 by Google Research[1], MGSM extends the popular GSM8K benchmark by manually translating 250 grade-school math problems into 10 typologically diverse languages. The benchmark specifically evaluates whether chain-of-thought reasoning capabilities transfer across languages, revealing that mathematical reasoning emerges as a universal capability in sufficiently large language models.
Overview
MGSM addresses a critical gap in AI evaluation by extending mathematical reasoning assessment beyond English to include diverse languages from different linguistic families and writing systems. The benchmark consists of elementary-level arithmetic word problems that require multi-step reasoning to solve, making it an ideal test for evaluating whether models can perform complex cognitive tasks in languages beyond their primary training language. By maintaining consistent problem structure across all languages, MGSM enables direct comparison of reasoning capabilities across linguistic boundaries[1].
Significance
The development of MGSM has several important implications for multilingual AI:
- **Universal Reasoning**: Demonstrates that mathematical reasoning transfers across languages
- **Inclusive Evaluation**: Includes underrepresented languages like Swahili, Bengali, and Telugu
- **Chain-of-Thought Testing**: Evaluates reasoning process quality across languages
- **Scaling Insights**: Shows that reasoning capabilities emerge with model scale regardless of language
- **Benchmark Standardization**: Provides consistent evaluation methodology across diverse languages
Language Coverage
Languages Included
MGSM covers 10 languages selected for typological diversity and global representation:
| Language | Code | Script | Language Family | Speakers (millions) | Resource Level |
|---|---|---|---|---|---|
| **Spanish** | es | Latin | Indo-European (Romance) | 559 | High |
| **French** | fr | Latin | Indo-European (Romance) | 280 | High |
| **German** | de | Latin | Indo-European (Germanic) | 132 | High |
| **Russian** | ru | Cyrillic | Indo-European (Slavic) | 258 | High |
| **Chinese** | zh | Chinese characters | Sino-Tibetan | 1,118 | High |
| **Japanese** | ja | Mixed (Kanji/Kana) | Japonic | 128 | High |
| **Thai** | th | Thai | Kra-Dai | 61 | Medium |
| **Swahili** | sw | Latin | Niger-Congo | 200 | Low |
| **Bengali** | bn | Bengali | Indo-European (Indo-Aryan) | 273 | Medium-Low |
| **Telugu** | te | Telugu | Dravidian | 96 | Low |
Language Selection Rationale
The languages were chosen to represent:
- **Script Diversity**: Latin, Cyrillic, Chinese, Arabic-derived, and unique scripts
- **Typological Variety**: Different word orders, morphological systems, and syntactic structures
- **Resource Availability**: Mix of high-resource and low-resource languages
- **Global Coverage**: Languages from different continents and cultural contexts
Dataset Structure
Problem Characteristics
Each problem in MGSM follows the grade-school math format:
| Characteristic | Description | Example |
|---|---|---|
| **Problem Type** | Multi-step arithmetic word problems | "John has 5 apples..." |
| **Operations Required** | Addition, subtraction, multiplication, division | 2-8 steps typical |
| **Answer Format** | Single numerical value | Integer or decimal |
| **Complexity Level** | Elementary school (grades 3-5) | Basic arithmetic |
| **Average Length** | 2-5 sentences | Context-dependent |
Data Format
Each MGSM example contains:
```json {
"question": "Problem text in target language", "answer": "Step-by-step solution with reasoning", "answer_number": 42, "equation_solution": "(5 × 8) + 2 = 42"
} ```
Translation Process
The translation methodology ensured quality and consistency[1]:
1. **Professional Translation**: Native speakers with mathematical knowledge 2. **Cultural Adaptation**: Numbers and contexts adapted where necessary 3. **Verification**: Multiple reviewers checked each translation 4. **Consistency Checks**: Ensuring mathematical equivalence across languages
Evaluation Methodology
Standard Evaluation Protocol
MGSM employs a standardized evaluation approach:
| Component | Specification | Purpose |
|---|---|---|
| **Metric** | Exact match accuracy | Clear, unambiguous scoring |
| **Temperature** | 0 (deterministic) | Reproducible results |
| **Few-shot Examples** | 8 per language | Consistent prompting |
| **Answer Extraction** | "Answer:" format | Standardized parsing |
| **Numerical Tolerance** | Exact integer match | Objective evaluation |
Evaluation Modes
MGSM supports two primary evaluation modes:
| Mode | Description | Code | Use Case |
|---|---|---|---|
| **Direct Answer** | Model provides answer only | `mgsm_direct_*` | Baseline performance |
| **Chain-of-Thought** | Model shows reasoning steps | `mgsm_cot_native_*` | Reasoning evaluation |
Performance Analysis
Current Leaderboard (2025)
Top performing models on MGSM (average across all languages):
| Rank | Model | Average Accuracy | Best Language | Worst Language |
|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet | 91.60% | English/German | Bengali |
| 1 | Meta Llama 3.1 405B | 91.60% | Spanish | Telugu |
| 3 | GPT-4o | ~90.5% | French | Swahili |
| 4 | Gemini Ultra | ~89% | Chinese | Bengali |
| 5 | GPT-4 | ~88% | German | Telugu |
Language-Specific Performance
Performance varies significantly across languages[1]:
| Language Category | Languages | Typical Performance | Performance Gap |
|---|---|---|---|
| **High-Resource** | Spanish, French, German | 85-92% | Baseline |
| **Asian Languages** | Chinese, Japanese, Thai | 80-88% | -5% to -10% |
| **Low-Resource** | Swahili, Bengali, Telugu | 70-85% | -10% to -20% |
Chain-of-Thought Impact
The effect of chain-of-thought prompting on MGSM:
| Model Size | Direct Answer | With CoT | Improvement |
|---|---|---|---|
| Small (<10B) | 20-30% | 25-35% | +5% |
| Medium (10-50B) | 40-60% | 55-75% | +15% |
| Large (>50B) | 70-85% | 85-92% | +10-15% |
Key Findings
Emergent Multilingual Reasoning
Research using MGSM revealed several important insights[1]:
1. **Scale-Dependent Emergence**: Mathematical reasoning in non-English languages emerges at similar model scales as English 2. **Universal Capabilities**: Sufficiently large models show consistent reasoning across all languages 3. **Transfer Learning**: Models trained primarily on English data can reason in other languages 4. **Resource Gap**: Performance gap between high and low-resource languages decreases with scale
Cross-Lingual Consistency
Analysis of reasoning patterns across languages shows:
| Aspect | Finding | Implication |
|---|---|---|
| **Solution Strategies** | Consistent across languages | Universal mathematical reasoning |
| **Error Patterns** | Similar mistakes in all languages | Common failure modes |
| **CoT Quality** | Varies by language resource level | Training data influence |
| **Numerical Accuracy** | Uniform across languages | Arithmetic is language-agnostic |
Limitations and Challenges
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| **Limited Scope** | Only 250 problems per language | Statistical significance |
| **Translation Bias** | Some cultural contexts don't translate | Reduced naturalness |
| **Elementary Level** | Grade-school problems only | Doesn't test advanced math |
| **Near Saturation** | Top models achieve >90% | Limited discrimination |
Technical Challenges
- **Tokenization Differences**: Various scripts affect token efficiency
- **Number Representation**: Different numeral systems across languages
- **Cultural Context**: Some word problems assume Western contexts
- **Evaluation Consistency**: Ensuring fair comparison across languages
Research Impact
Influence on Multilingual AI
MGSM has significantly influenced multilingual AI research:
| Area | Impact | Development |
|---|---|---|
| **Evaluation Standards** | Established multilingual reasoning benchmarks | Adoption in major evaluations |
| **Model Development** | Drove improvements in multilingual capabilities | Better cross-lingual transfer |
| **Low-Resource Languages** | Highlighted performance gaps | Targeted improvements |
| **Reasoning Research** | Showed universal reasoning emergence | Theoretical insights |
Related Benchmarks
| Benchmark | Focus | Relation to MGSM |
|---|---|---|
| GSM8K | English math reasoning | Parent dataset |
| MGSM | Multilingual math | Current benchmark |
| MSVAMP | Multilingual variations | Complementary |
| MathQA-ML | Multilingual complex math | More advanced |
Applications and Use Cases
Practical Applications
Technologies evaluated on MGSM support:
- **Educational Technology**: Multilingual tutoring systems
- **Financial Services**: Cross-border calculation assistance
- **Scientific Computing**: International collaboration tools
- **Translation Services**: Mathematical document translation
Future Directions
Potential Extensions
1. **More Languages**: Expanding to 50+ languages 2. **Difficulty Levels**: Adding algebra and geometry 3. **Multimodal Problems**: Including diagrams and graphs 4. **Cultural Adaptation**: Region-specific problem contexts 5. **Dynamic Generation**: Procedurally generated problems
Significance
MGSM has demonstrated that mathematical reasoning is a universal capability that emerges in large language models regardless of language, challenging assumptions about language-specific limitations in AI systems. By showing that models can perform complex reasoning in low-resource languages like Bengali and Telugu nearly as well as in high-resource languages, MGSM provides evidence that fundamental cognitive capabilities in AI transcend linguistic boundaries.
The benchmark's near-saturation with top models achieving over 90% accuracy suggests both the remarkable progress in multilingual AI and the need for more challenging multilingual reasoning benchmarks. MGSM remains valuable for evaluating new models and understanding how reasoning capabilities transfer across languages, particularly for underrepresented languages in AI research.
See Also
- GSM8K
- Multilingual AI Evaluation
- Mathematical Reasoning Benchmarks
- Chain-of-Thought Prompting
- Google Research
- Low-Resource Languages in AI
References
- ↑ 1.0 1.1 1.2 1.3 1.4 Shi, F., Suzgun, M., Freitag, M., Schärli, N., Li, L. H., Khandelwal, A., Levy, I., Ding, A. S., Brahma, S., Wei, J., Bosma, M., Zhao, V., Huang, Y., & Zhou, D. (2022). "Language Models are Multilingual Chain-of-Thought Reasoners". arXiv:2210.03057. Retrieved from https://arxiv.org/abs/2210.03057