| MGSM | |
|---|---|
| Overview | |
| Full name | Multilingual Grade School Math |
| Abbreviation | MGSM |
| Description | A multilingual benchmark evaluating mathematical reasoning across 10 typologically diverse languages using grade-school math problems |
| Release date | 2022-10-06 |
| Latest version | 1.0 |
| Benchmark updated | 2022-10 |
| Authors | Freda Shi, Mirac Suzgun, Markus Freitag, Nathanael Schärli, And others |
| Organization | Google Research |
| Technical Details | |
| Type | Mathematical Reasoning, Multilingual Evaluation |
| Modality | Text |
| Task format | Word problems requiring multi-step arithmetic |
| Number of tasks | 10 languages |
| Total examples | 2,500 (250 per language) |
| Evaluation metric | Exact match accuracy |
| Domains | Elementary mathematics, Arithmetic word problems |
| Languages | Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
| Performance | |
| Human performance | Based on GSM8K validation |
| Baseline | Varies by language and model |
| SOTA score | 91.60% |
| SOTA model | Claude 3.5 Sonnet, Meta Llama 3.1 405B |
| SOTA date | 2025 |
| Saturated | Nearly |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | CC-BY-SA 4.0 |
| Predecessor | GSM8K |
MGSM (Multilingual Grade School Math) is a benchmark dataset designed to evaluate the mathematical reasoning capabilities of large language models across multiple languages. Released in October 2022 by Google Research[1], MGSM extends the popular GSM8K benchmark by manually translating 250 grade-school math problems into 10 typologically diverse languages. The benchmark specifically evaluates whether chain-of-thought reasoning capabilities transfer across languages, revealing that mathematical reasoning emerges as a universal capability in sufficiently large language models.
MGSM addresses a critical gap in AI evaluation by extending mathematical reasoning assessment beyond English to include diverse languages from different linguistic families and writing systems. The benchmark consists of elementary-level arithmetic word problems that require multi-step reasoning to solve, making it an ideal test for evaluating whether models can perform complex cognitive tasks in languages beyond their primary training language. By maintaining consistent problem structure across all languages, MGSM enables direct comparison of reasoning capabilities across linguistic boundaries[1].
The development of MGSM has several important implications for multilingual AI:
MGSM covers 10 languages selected for typological diversity and global representation:
| Language | Code | Script | Language Family | Speakers (millions) | Resource Level |
|---|---|---|---|---|---|
| **Spanish** | es | Latin | Indo-European (Romance) | 559 | High |
| **French** | fr | Latin | Indo-European (Romance) | 280 | High |
| **German** | de | Latin | Indo-European (Germanic) | 132 | High |
| **Russian** | ru | Cyrillic | Indo-European (Slavic) | 258 | High |
| **Chinese** | zh | Chinese characters | Sino-Tibetan | 1,118 | High |
| **Japanese** | ja | Mixed (Kanji/Kana) | Japonic | 128 | High |
| **Thai** | th | Thai | Kra-Dai | 61 | Medium |
| **Swahili** | sw | Latin | Niger-Congo | 200 | Low |
| **Bengali** | bn | Bengali | Indo-European (Indo-Aryan) | 273 | Medium-Low |
| **Telugu** | te | Telugu | Dravidian | 96 | Low |
The languages were chosen to represent:
Each problem in MGSM follows the grade-school math format:
| Characteristic | Description | Example |
|---|---|---|
| **Problem Type** | Multi-step arithmetic word problems | "John has 5 apples..." |
| **Operations Required** | Addition, subtraction, multiplication, division | 2-8 steps typical |
| **Answer Format** | Single numerical value | Integer or decimal |
| **Complexity Level** | Elementary school (grades 3-5) | Basic arithmetic |
| **Average Length** | 2-5 sentences | Context-dependent |
Each MGSM example contains:
```json {
"question": "Problem text in target language", "answer": "Step-by-step solution with reasoning", "answer_number": 42, "equation_solution": "(5 × 8) + 2 = 42"
} ```
The translation methodology ensured quality and consistency[1]:
1. **Professional Translation**: Native speakers with mathematical knowledge 2. **Cultural Adaptation**: Numbers and contexts adapted where necessary 3. **Verification**: Multiple reviewers checked each translation 4. **Consistency Checks**: Ensuring mathematical equivalence across languages
MGSM employs a standardized evaluation approach:
| Component | Specification | Purpose |
|---|---|---|
| **Metric** | Exact match accuracy | Clear, unambiguous scoring |
| **Temperature** | 0 (deterministic) | Reproducible results |
| **Few-shot Examples** | 8 per language | Consistent prompting |
| **Answer Extraction** | "Answer:" format | Standardized parsing |
| **Numerical Tolerance** | Exact integer match | Objective evaluation |
MGSM supports two primary evaluation modes:
| Mode | Description | Code | Use Case |
|---|---|---|---|
| **Direct Answer** | Model provides answer only | `mgsm_direct_*` | Baseline performance |
| **Chain-of-Thought** | Model shows reasoning steps | `mgsm_cot_native_*` | Reasoning evaluation |
Top performing models on MGSM (average across all languages):
| Rank | Model | Average Accuracy | Best Language | Worst Language |
|---|---|---|---|---|
| 1 | Claude 3.5 Sonnet | 91.60% | English/German | Bengali |
| 1 | Meta Llama 3.1 405B | 91.60% | Spanish | Telugu |
| 3 | GPT-4o | ~90.5% | French | Swahili |
| 4 | Gemini Ultra | ~89% | Chinese | Bengali |
| 5 | GPT-4 | ~88% | German | Telugu |
Performance varies significantly across languages[1]:
| Language Category | Languages | Typical Performance | Performance Gap |
|---|---|---|---|
| **High-Resource** | Spanish, French, German | 85-92% | Baseline |
| **Asian Languages** | Chinese, Japanese, Thai | 80-88% | -5% to -10% |
| **Low-Resource** | Swahili, Bengali, Telugu | 70-85% | -10% to -20% |
The effect of chain-of-thought prompting on MGSM:
| Model Size | Direct Answer | With CoT | Improvement |
|---|---|---|---|
| Small (<10B) | 20-30% | 25-35% | +5% |
| Medium (10-50B) | 40-60% | 55-75% | +15% |
| Large (>50B) | 70-85% | 85-92% | +10-15% |
Research using MGSM revealed several important insights[1]:
1. **Scale-Dependent Emergence**: Mathematical reasoning in non-English languages emerges at similar model scales as English 2. **Universal Capabilities**: Sufficiently large models show consistent reasoning across all languages 3. **Transfer Learning**: Models trained primarily on English data can reason in other languages 4. **Resource Gap**: Performance gap between high and low-resource languages decreases with scale
Analysis of reasoning patterns across languages shows:
| Aspect | Finding | Implication |
|---|---|---|
| **Solution Strategies** | Consistent across languages | Universal mathematical reasoning |
| **Error Patterns** | Similar mistakes in all languages | Common failure modes |
| **CoT Quality** | Varies by language resource level | Training data influence |
| **Numerical Accuracy** | Uniform across languages | Arithmetic is language-agnostic |
| Limitation | Description | Impact |
|---|---|---|
| **Limited Scope** | Only 250 problems per language | Statistical significance |
| **Translation Bias** | Some cultural contexts don't translate | Reduced naturalness |
| **Elementary Level** | Grade-school problems only | Doesn't test advanced math |
| **Near Saturation** | Top models achieve >90% | Limited discrimination |
MGSM has significantly influenced multilingual AI research:
| Area | Impact | Development |
|---|---|---|
| **Evaluation Standards** | Established multilingual reasoning benchmarks | Adoption in major evaluations |
| **Model Development** | Drove improvements in multilingual capabilities | Better cross-lingual transfer |
| **Low-Resource Languages** | Highlighted performance gaps | Targeted improvements |
| **Reasoning Research** | Showed universal reasoning emergence | Theoretical insights |
| Benchmark | Focus | Relation to MGSM |
|---|---|---|
| GSM8K | English math reasoning | Parent dataset |
| MGSM | Multilingual math | Current benchmark |
| MSVAMP | Multilingual variations | Complementary |
| MathQA-ML | Multilingual complex math | More advanced |
Technologies evaluated on MGSM support:
1. **More Languages**: Expanding to 50+ languages 2. **Difficulty Levels**: Adding algebra and geometry 3. **Multimodal Problems**: Including diagrams and graphs 4. **Cultural Adaptation**: Region-specific problem contexts 5. **Dynamic Generation**: Procedurally generated problems
MGSM has demonstrated that mathematical reasoning is a universal capability that emerges in large language models regardless of language, challenging assumptions about language-specific limitations in AI systems. By showing that models can perform complex reasoning in low-resource languages like Bengali and Telugu nearly as well as in high-resource languages, MGSM provides evidence that fundamental cognitive capabilities in AI transcend linguistic boundaries.
The benchmark's near-saturation with top models achieving over 90% accuracy suggests both the remarkable progress in multilingual AI and the need for more challenging multilingual reasoning benchmarks. MGSM remains valuable for evaluating new models and understanding how reasoning capabilities transfer across languages, particularly for underrepresented languages in AI research.