MGSM

From AI Wiki


MGSM
Overview
Full name Multilingual Grade School Math
Abbreviation MGSM
Description A multilingual benchmark evaluating mathematical reasoning across 10 typologically diverse languages using grade-school math problems
Release date 2022-10-06
Latest version 1.0
Benchmark updated 2022-10
Authors Freda ShiMirac SuzgunMarkus FreitagNathanael SchärliAnd others
Organization Google Research
Technical Details
Type Mathematical ReasoningMultilingual Evaluation
Modality Text
Task format Word problems requiring multi-step arithmetic
Number of tasks 10 languages
Total examples 2,500 (250 per language)
Evaluation metric Exact match accuracy
Domains Elementary mathematicsArithmetic word problems
Languages Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
Performance
Human performance Based on GSM8K validation
Baseline Varies by language and model
SOTA score 91.60%
SOTA model Claude 3.5 Sonnet, Meta Llama 3.1 405B
SOTA date 2025
Saturated Nearly
Resources
Website Official website
Paper Paper
GitHub Repository
Dataset Download
License CC-BY-SA 4.0
Predecessor GSM8K


MGSM (Multilingual Grade School Math) is a benchmark dataset designed to evaluate the mathematical reasoning capabilities of large language models across multiple languages. Released in October 2022 by Google Research[1], MGSM extends the popular GSM8K benchmark by manually translating 250 grade-school math problems into 10 typologically diverse languages. The benchmark specifically evaluates whether chain-of-thought reasoning capabilities transfer across languages, revealing that mathematical reasoning emerges as a universal capability in sufficiently large language models.

Overview

MGSM addresses a critical gap in AI evaluation by extending mathematical reasoning assessment beyond English to include diverse languages from different linguistic families and writing systems. The benchmark consists of elementary-level arithmetic word problems that require multi-step reasoning to solve, making it an ideal test for evaluating whether models can perform complex cognitive tasks in languages beyond their primary training language. By maintaining consistent problem structure across all languages, MGSM enables direct comparison of reasoning capabilities across linguistic boundaries[1].

Significance

The development of MGSM has several important implications for multilingual AI:

  • **Universal Reasoning**: Demonstrates that mathematical reasoning transfers across languages
  • **Inclusive Evaluation**: Includes underrepresented languages like Swahili, Bengali, and Telugu
  • **Chain-of-Thought Testing**: Evaluates reasoning process quality across languages
  • **Scaling Insights**: Shows that reasoning capabilities emerge with model scale regardless of language
  • **Benchmark Standardization**: Provides consistent evaluation methodology across diverse languages

Language Coverage

Languages Included

MGSM covers 10 languages selected for typological diversity and global representation:

Language Code Script Language Family Speakers (millions) Resource Level
**Spanish** es Latin Indo-European (Romance) 559 High
**French** fr Latin Indo-European (Romance) 280 High
**German** de Latin Indo-European (Germanic) 132 High
**Russian** ru Cyrillic Indo-European (Slavic) 258 High
**Chinese** zh Chinese characters Sino-Tibetan 1,118 High
**Japanese** ja Mixed (Kanji/Kana) Japonic 128 High
**Thai** th Thai Kra-Dai 61 Medium
**Swahili** sw Latin Niger-Congo 200 Low
**Bengali** bn Bengali Indo-European (Indo-Aryan) 273 Medium-Low
**Telugu** te Telugu Dravidian 96 Low

Language Selection Rationale

The languages were chosen to represent:

  • **Script Diversity**: Latin, Cyrillic, Chinese, Arabic-derived, and unique scripts
  • **Typological Variety**: Different word orders, morphological systems, and syntactic structures
  • **Resource Availability**: Mix of high-resource and low-resource languages
  • **Global Coverage**: Languages from different continents and cultural contexts

Dataset Structure

Problem Characteristics

Each problem in MGSM follows the grade-school math format:

Characteristic Description Example
**Problem Type** Multi-step arithmetic word problems "John has 5 apples..."
**Operations Required** Addition, subtraction, multiplication, division 2-8 steps typical
**Answer Format** Single numerical value Integer or decimal
**Complexity Level** Elementary school (grades 3-5) Basic arithmetic
**Average Length** 2-5 sentences Context-dependent

Data Format

Each MGSM example contains:

```json {

 "question": "Problem text in target language",
 "answer": "Step-by-step solution with reasoning",
 "answer_number": 42,
 "equation_solution": "(5 × 8) + 2 = 42"

} ```

Translation Process

The translation methodology ensured quality and consistency[1]:

1. **Professional Translation**: Native speakers with mathematical knowledge 2. **Cultural Adaptation**: Numbers and contexts adapted where necessary 3. **Verification**: Multiple reviewers checked each translation 4. **Consistency Checks**: Ensuring mathematical equivalence across languages

Evaluation Methodology

Standard Evaluation Protocol

MGSM employs a standardized evaluation approach:

Component Specification Purpose
**Metric** Exact match accuracy Clear, unambiguous scoring
**Temperature** 0 (deterministic) Reproducible results
**Few-shot Examples** 8 per language Consistent prompting
**Answer Extraction** "Answer:" format Standardized parsing
**Numerical Tolerance** Exact integer match Objective evaluation

Evaluation Modes

MGSM supports two primary evaluation modes:

Mode Description Code Use Case
**Direct Answer** Model provides answer only `mgsm_direct_*` Baseline performance
**Chain-of-Thought** Model shows reasoning steps `mgsm_cot_native_*` Reasoning evaluation

Performance Analysis

Current Leaderboard (2025)

Top performing models on MGSM (average across all languages):

Rank Model Average Accuracy Best Language Worst Language
1 Claude 3.5 Sonnet 91.60% English/German Bengali
1 Meta Llama 3.1 405B 91.60% Spanish Telugu
3 GPT-4o ~90.5% French Swahili
4 Gemini Ultra ~89% Chinese Bengali
5 GPT-4 ~88% German Telugu

Language-Specific Performance

Performance varies significantly across languages[1]:

Language Category Languages Typical Performance Performance Gap
**High-Resource** Spanish, French, German 85-92% Baseline
**Asian Languages** Chinese, Japanese, Thai 80-88% -5% to -10%
**Low-Resource** Swahili, Bengali, Telugu 70-85% -10% to -20%

Chain-of-Thought Impact

The effect of chain-of-thought prompting on MGSM:

Model Size Direct Answer With CoT Improvement
Small (<10B) 20-30% 25-35% +5%
Medium (10-50B) 40-60% 55-75% +15%
Large (>50B) 70-85% 85-92% +10-15%

Key Findings

Emergent Multilingual Reasoning

Research using MGSM revealed several important insights[1]:

1. **Scale-Dependent Emergence**: Mathematical reasoning in non-English languages emerges at similar model scales as English 2. **Universal Capabilities**: Sufficiently large models show consistent reasoning across all languages 3. **Transfer Learning**: Models trained primarily on English data can reason in other languages 4. **Resource Gap**: Performance gap between high and low-resource languages decreases with scale

Cross-Lingual Consistency

Analysis of reasoning patterns across languages shows:

Aspect Finding Implication
**Solution Strategies** Consistent across languages Universal mathematical reasoning
**Error Patterns** Similar mistakes in all languages Common failure modes
**CoT Quality** Varies by language resource level Training data influence
**Numerical Accuracy** Uniform across languages Arithmetic is language-agnostic

Limitations and Challenges

Current Limitations

Limitation Description Impact
**Limited Scope** Only 250 problems per language Statistical significance
**Translation Bias** Some cultural contexts don't translate Reduced naturalness
**Elementary Level** Grade-school problems only Doesn't test advanced math
**Near Saturation** Top models achieve >90% Limited discrimination

Technical Challenges

  • **Tokenization Differences**: Various scripts affect token efficiency
  • **Number Representation**: Different numeral systems across languages
  • **Cultural Context**: Some word problems assume Western contexts
  • **Evaluation Consistency**: Ensuring fair comparison across languages

Research Impact

Influence on Multilingual AI

MGSM has significantly influenced multilingual AI research:

Area Impact Development
**Evaluation Standards** Established multilingual reasoning benchmarks Adoption in major evaluations
**Model Development** Drove improvements in multilingual capabilities Better cross-lingual transfer
**Low-Resource Languages** Highlighted performance gaps Targeted improvements
**Reasoning Research** Showed universal reasoning emergence Theoretical insights

Related Benchmarks

Benchmark Focus Relation to MGSM
GSM8K English math reasoning Parent dataset
MGSM Multilingual math Current benchmark
MSVAMP Multilingual variations Complementary
MathQA-ML Multilingual complex math More advanced

Applications and Use Cases

Practical Applications

Technologies evaluated on MGSM support:

  • **Educational Technology**: Multilingual tutoring systems
  • **Financial Services**: Cross-border calculation assistance
  • **Scientific Computing**: International collaboration tools
  • **Translation Services**: Mathematical document translation

Future Directions

Potential Extensions

1. **More Languages**: Expanding to 50+ languages 2. **Difficulty Levels**: Adding algebra and geometry 3. **Multimodal Problems**: Including diagrams and graphs 4. **Cultural Adaptation**: Region-specific problem contexts 5. **Dynamic Generation**: Procedurally generated problems

Significance

MGSM has demonstrated that mathematical reasoning is a universal capability that emerges in large language models regardless of language, challenging assumptions about language-specific limitations in AI systems. By showing that models can perform complex reasoning in low-resource languages like Bengali and Telugu nearly as well as in high-resource languages, MGSM provides evidence that fundamental cognitive capabilities in AI transcend linguistic boundaries.

The benchmark's near-saturation with top models achieving over 90% accuracy suggests both the remarkable progress in multilingual AI and the need for more challenging multilingual reasoning benchmarks. MGSM remains valuable for evaluating new models and understanding how reasoning capabilities transfer across languages, particularly for underrepresented languages in AI research.

See Also

References

  1. 1.0 1.1 1.2 1.3 1.4 Shi, F., Suzgun, M., Freitag, M., Schärli, N., Li, L. H., Khandelwal, A., Levy, I., Ding, A. S., Brahma, S., Wei, J., Bosma, M., Zhao, V., Huang, Y., & Zhou, D. (2022). "Language Models are Multilingual Chain-of-Thought Reasoners". arXiv:2210.03057. Retrieved from https://arxiv.org/abs/2210.03057