MGSM (Multilingual Grade School Math)

**

MGSM
Overview
Full name	Multilingual Grade School Math
Abbreviation	MGSM
Description	A multilingual benchmark evaluating mathematical reasoning across 10 typologically diverse languages using grade-school math problems
Release date	2022-10-06
Latest version	1.0 (original); Rev2 (corrected, 2025)
Benchmark updated	2025 (MGSM-Rev2)
Authors	Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei
Organization	Google Research
Technical Details
Type	Mathematical Reasoning, Multilingual Evaluation
Modality	Text
Task format	Word problems requiring multi-step arithmetic
Number of tasks	10 languages (plus English source)
Total examples	2,750 (250 per language, 11 languages including English)
Evaluation metric	Exact match accuracy
Domains	Elementary mathematics, Arithmetic word problems
Languages	English, Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu
Performance
Human performance	Based on GSM8K validation
Baseline	Varies by language and model
SOTA score	~96% (frontier models, 2025)
SOTA model	GPT-5.3 Codex, Claude Opus 4.5 (Thinking)
SOTA date	2025
Saturated	Yes (top models exceed 95%)
Resources
Website	Official website
Paper	Paper (arXiv)
Venue	ICLR 2023
GitHub	Repository
Dataset	Download (Hugging Face)
Corrected version	MGSM-Rev2
License	CC-BY-SA 4.0
Predecessor	GSM8K

MGSM** (Multilingual Grade School Math) is a benchmark dataset that evaluates the mathematical reasoning capabilities of large language models across multiple languages. Released in October 2022 by Google Research and published at ICLR 2023, MGSM extends the popular GSM8K benchmark by manually translating 250 grade-school math problems into 10 typologically diverse languages^[1]. The benchmark specifically tests whether chain-of-thought reasoning capabilities transfer across languages. Its central finding is that mathematical reasoning emerges as a universal capability in sufficiently large language models, even in underrepresented languages such as Bengali and Swahili that make up less than 0.01% of typical pretraining corpora^[1].

As of 2025, MGSM has become one of the most widely used multilingual reasoning benchmarks. It is integrated into major evaluation frameworks such as EleutherAI's Language Model Evaluation Harness and appears on leaderboards maintained by Kaggle, Vals AI, and others. However, frontier models now routinely exceed 90% average accuracy across all languages, pushing the benchmark toward saturation and prompting the development of harder successors like MGSM-Pro^[15].

Background and motivation

Before MGSM, most evaluations of mathematical reasoning in language models were conducted exclusively in English. The GSM8K dataset, introduced by Cobbe et al. at OpenAI in 2021, provided 8,500 linguistically diverse grade-school math word problems and quickly became a standard benchmark for testing arithmetic reasoning^[2]. However, GSM8K told researchers nothing about whether the reasoning abilities it measured would generalize to other languages.

This gap mattered for several reasons. First, the majority of the world's population does not speak English as a first language, so practical deployment of reasoning-capable models requires multilingual competence. Second, researchers wanted to understand whether chain-of-thought reasoning, the prompting technique introduced by Wei et al. in 2022 that dramatically improved math performance in English, would work across languages with different syntactic structures, writing systems, and levels of representation in training data^[3]. Third, the field lacked a controlled experimental setup for measuring cross-lingual reasoning transfer, since existing multilingual benchmarks focused on tasks like natural language inference and question answering rather than multi-step mathematical problem-solving.

Shi et al. designed MGSM to fill this gap. By taking a fixed set of 250 problems from GSM8K's test split and translating them into 10 carefully selected languages, the researchers created a controlled environment where problem content remained constant and only the language varied. This design made it possible to isolate the effect of language on reasoning performance^[1].

Dataset construction

Source problems

The 250 source problems were drawn from the GSM8K test set. Each problem is an elementary-level arithmetic word problem requiring between 2 and 8 sequential reasoning steps. The problems involve basic operations (addition, subtraction, multiplication, and division) and produce integer answers. A typical problem describes a real-world scenario, such as buying fruit at a market or dividing objects among people, and asks for a single numerical result. The GSM8K dataset itself was written by human problem authors with comprehensive quality control, achieving an estimated error rate below 2%^[2].

Language selection

The 10 target languages were chosen to maximize typological diversity while spanning a range of resource levels in LLM pretraining data. They cover eight language families, multiple writing systems, and both high-resource and low-resource settings.

Language	ISO Code	Script	Language Family	Approximate speakers (millions)	Pretraining corpus share (PaLM)
English	en	Latin	Indo-European (Germanic)	1,452	~30%
Spanish	es	Latin	Indo-European (Romance)	559	~4.9%
French	fr	Latin	Indo-European (Romance)	280	~3.3%
German	de	Latin	Indo-European (Germanic)	132	~4.4%
Russian	ru	Cyrillic	Indo-European (Slavic)	258	~1.8%
Chinese	zh	Chinese characters	Sino-Tibetan	1,118	~0.78%
Japanese	ja	Mixed (Kanji, Hiragana, Katakana)	Japonic	128	~1.2%
Thai	th	Thai script	Kra-Dai	61	~0.15%
Bengali	bn	Bengali script	Indo-European (Indo-Aryan)	273	<0.01%
Swahili	sw	Latin	Niger-Congo (Bantu)	200	<0.01%
Telugu	te	Telugu script	Dravidian	96	~0.0002%

The selection intentionally included languages at the extremes of pretraining data availability. English, the dominant language in most LLM training sets, sits at one end, while Telugu, which constitutes roughly 0.0002% of PaLM's pretraining corpus, sits at the other^[1]^[4]. This range allowed the researchers to study how pretraining data volume affects multilingual reasoning.

Translation process

The translation methodology prioritized accuracy and naturalness. Between 1 and 5 professional native-speaker translators worked on each language. Every translator had at least 2 years of professional experience and was contractually prohibited from using machine translation^[1]^[5]. The process included several quality-control steps:

Initial translation: A native speaker translated each problem and its solution into the target language.
Cultural adaptation: Where necessary, translators adapted names, units, and contextual references to be natural in the target language while preserving the mathematical structure.
Verification: At least one additional reviewer checked each translation for mathematical equivalence and linguistic naturalness.
Consistency checks: The research team verified that answer values remained identical across all language versions.

Despite these precautions, later research revealed that some translation errors survived the review process, leading to the development of MGSM-Rev2 (discussed in a later section)^[6].

Evaluation methodology

Prompting strategies

Shi et al. tested four distinct prompting strategies on MGSM, each designed to probe a different aspect of multilingual reasoning:

Strategy	Abbreviation	Description	Question language	Reasoning language
Direct answer	DIRECT	Model provides a numerical answer without showing intermediate steps	Target language	None (answer only)
Native chain-of-thought	NATIVE-COT	Model reasons step by step in the same language as the question	Target language	Target language
English chain-of-thought	EN-COT	Model reasons step by step in English, regardless of question language	Target language	English
Translate to English	TRANSLATE-EN	Question is machine-translated to English before the model solves it in English	English (translated)	English

All experiments used 8-shot prompting, meaning 8 exemplar problem-solution pairs were provided in the prompt before the test question. The exemplars were drawn from the GSM8K training set, and for the NATIVE-COT and DIRECT strategies, the exemplars were translated into the corresponding target language^[1].

Scoring

MGSM uses exact match on the final numerical answer. The model's output is parsed for a number following a standardized "The answer is" format, and this number is compared to the gold answer. The evaluation uses greedy decoding (temperature 0) to ensure reproducibility. No partial credit is given for correct intermediate steps with a wrong final answer^[1].

Self-consistency

In addition to standard greedy decoding, the researchers evaluated self-consistency (SC), a technique introduced by Wang et al. (2022) in which multiple reasoning chains are sampled and the most common final answer is selected by majority vote^[7]. Self-consistency produced consistent improvements across all languages, with particularly large gains on low-resource languages.

Original experimental results

Models tested

The original paper evaluated three model families:

Model	Parameters	Developer	Notes
PaLM 540B	540 billion	Google	Largest model tested; primary focus of the study
PaLM 62B	62 billion	Google	Medium-scale model for studying scale effects
PaLM 8B	8 billion	Google	Smallest model for studying scale effects
text-davinci-002	~175 billion	OpenAI	GPT-3.5 variant available via API at the time
code-davinci-002	~175 billion	OpenAI	Codex variant, tested on a subset

Key results from PaLM-540B

PaLM-540B with the TRANSLATE-EN strategy achieved the best overall performance, with an average accuracy of 55.0% across all languages. The EN-COT strategy followed at 51.3% average accuracy, and NATIVE-COT achieved 48.1%. The DIRECT strategy, which provided no intermediate reasoning steps, performed substantially worse at under 20% on most languages^[1]^[8].

Selected per-language results for PaLM-540B (TRANSLATE-EN strategy)^[1]:

Language	Accuracy
English	62.4%
Spanish	60.0%
German	57.2%
Chinese	55.6%
French	55.2%
Bengali	53.2%
Swahili	51.2%

These results were surprising because Bengali and Swahili, which each account for less than 0.01% of PaLM's pretraining data, still achieved over 50% accuracy. The gap between the best-performing language (English at 62.4%) and the worst-performing languages was roughly 10 to 12 percentage points with the TRANSLATE-EN approach^[1].

Flan-PaLM results

The paper also reported results for Flan-PaLM, the instruction-finetuned version of PaLM. Flan-PaLM with chain-of-thought prompting and self-consistency achieved substantially higher scores, including 69.6% on Bengali, demonstrating that instruction tuning can meaningfully improve multilingual reasoning^[9].

Scale dependence

One of the paper's central findings was that multilingual reasoning ability is strongly scale-dependent. At 8 billion parameters, PaLM showed minimal ability to solve MGSM problems in any language, with chain-of-thought prompting providing little benefit. At 62 billion parameters, English performance improved meaningfully, but non-English languages lagged behind. At 540 billion parameters, reasoning ability emerged across all tested languages, including low-resource ones^[1].

This pattern matched the broader finding from Wei et al. (2022) that chain-of-thought reasoning is an emergent capability that appears at sufficient model scale^[3]. MGSM extended this observation to the multilingual setting, showing that the scale threshold for non-English reasoning is roughly similar to that for English reasoning.

Impact of chain-of-thought prompting

Across all languages and model sizes, chain-of-thought prompting produced large improvements over direct answer prediction:

Model size	DIRECT (avg)	NATIVE-COT (avg)	EN-COT (avg)	Improvement over DIRECT
PaLM 8B	~5%	~7%	~8%	+2-3%
PaLM 62B	~15%	~25%	~28%	+10-13%
PaLM 540B	~18%	~48.1%	~51.3%	+30-33%

The results showed that English chain-of-thought (EN-COT) slightly outperformed native-language chain-of-thought (NATIVE-COT) for most languages. This suggests that models have stronger reasoning capabilities in English, their dominant training language, and can leverage English reasoning even when the question is posed in another language^[1].

Cross-lingual performance analysis

The language gap

The original MGSM results showed a notable performance gap between English and other languages, particularly low-resource ones. This "language gap" was initially reported as ranging from 15 to 40 percentage points depending on the model and prompting strategy^[1]. For instance, with the NATIVE-COT strategy on PaLM-540B, the gap between English and the lowest-performing language could exceed 20 percentage points.

However, as discussed in the section on MGSM-Rev2 below, subsequent research showed that a substantial portion of this gap was artifactual, caused by translation errors and inconsistent answer-extraction scripts rather than genuine differences in model capability^[6].

Resource level and performance

Despite the caveats about translation quality, a real correlation between pretraining data volume and benchmark performance does exist. High-resource languages (English, Spanish, French, German) consistently score higher than medium-resource languages (Chinese, Japanese, Thai), which in turn score higher than low-resource languages (Bengali, Swahili, Telugu). The magnitude of this gap decreases with model scale, however. In the largest models, even low-resource languages achieve competitive performance^[1].

Error pattern analysis

Analysis of model errors across languages revealed several patterns:

Error type	Description	Language correlation
Arithmetic mistakes	Model computes a step incorrectly	Roughly uniform across languages
Step omission	Model skips a required reasoning step	More common in low-resource languages
Question misunderstanding	Model misinterprets what is being asked	More common in languages with complex morphology or word order
Unit/format confusion	Model produces an answer in the wrong format	More common in languages with different numeral systems (Bengali, Thai)

Arithmetic errors occurred at similar rates regardless of language, supporting the idea that basic computation is language-agnostic. However, errors in problem comprehension and step organization were more frequent in low-resource languages, suggesting that the language modeling component (as opposed to the arithmetic component) is the primary bottleneck in multilingual math reasoning^[1].

MGSM-Rev2: Addressing translation errors

Discovery of errors

In 2025, Mohn et al. published "Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results," which systematically examined translation quality in the original MGSM dataset^[6]. The study uncovered several categories of errors:

Semantic distortions: In the German translation, the English phrase "An orange costs 5 less than what a watermelon cost" was rendered as "Eine Orange kostet 5 Mal weniger" ("5 times less"), fundamentally changing the mathematical relationship in the problem^[6].

Logic inversions: One German translation changed "How many girls are not in the girl scout?" to "Wie viele Madchen sind bei den Pfadfinderinnen?" ("How many girls are in the scout?"), inverting the question entirely^[6].

Constraint changes: "Round all his prices to the nearest dollar" was translated into German as rounding up specifically, adding a constraint not present in the original English^[6].

Temporal errors: In at least one case, "Tuesday" was mistranslated as "Thursday" in the German version^[6].

Answer extraction issues

Beyond translation errors, Mohn et al. found that inconsistent answer-extraction scripts compounded the problem. The original evaluation code used language-specific answer prefixes and assumed English number formatting (periods for decimals, commas for thousands separators). For languages like French, where number formatting conventions differ, this caused correct answers to be scored as incorrect. Bengali numerals posed a particular challenge; when Bengali native numerals were not converted to Arabic equivalents, accuracy dropped dramatically^[6].

Improving answer extraction alone yielded a 10-percentage-point boost for French on GPT-5^[6].

Corrections and the revised dataset

The corrections fell into three main categories:

Clarity improvements: Resolving ambiguous phrasing (the most frequent correction type)
Numerical and factual fixes: Correcting incorrect values or logical relationships
Unit specification: Explicitly stating required answer units

Google released MGSM-Rev2 on GitHub, incorporating all identified corrections. Two erroneous English source questions were corrected, ambiguous questions were rephrased, and all translations were redone using Gemini with subsequent verification to ensure every question was answerable^[10].

Impact on the language gap

The corrections had a dramatic effect on reported cross-lingual performance gaps:

Model	English (original)	French (original)	Gap (original)	English (corrected)	French (corrected)	Gap (corrected)
Gemini 2.5 Pro	97.6%	82.4%	15.2%	99.6%	98.8%	0.8%
GPT-5	97.6%	80.0%	17.6%	99.6%	98.4%	1.2%
Claude Sonnet	~97%	~85%	~12%	~99%	~97%	~2.0%

For Gemma 3 27B, the Bengali accuracy jumped from 45.2% to 91.2% after correcting numeral handling, a 46-percentage-point improvement that was entirely an artifact of evaluation methodology rather than model capability^[6].

After combining corrected translations with improved answer extraction, the maximum cross-lingual accuracy gap for strong models shrank to under 2 percentage points for Gemini 2.5 Pro, under 1.6% for GPT-5, and under 2% for Claude Sonnet. Mohn et al. concluded that the language gap "mostly disappears" for frontier models, leading to "completely different conclusions" from those in the original paper^[6].

MGSM-Pro: Robustness evaluation

Motivation

Even with MGSM-Rev2 correcting translation errors, researchers identified another limitation: models might memorize the specific numerical values in MGSM's 250 problems, since the benchmark has been publicly available since 2022 and is widely used in training evaluations. To test this concern, Xiao et al. (2026) introduced MGSM-Pro, which applies the GSM-Symbolic approach of varying surface-level details while preserving mathematical structure^[15].

Methodology

MGSM-Pro creates multiple instantiations of each MGSM problem through two series of modifications:

Series	Variant	Modification
Symbolic (SYM)	SYM_N	Replace names with culturally relevant alternatives
Symbolic (SYM)	SYM_#	Change numerical values
Symbolic (SYM)	SYM_N#	Change both names and numbers
Irrelevant Context (IC)	IC_N	Add irrelevant sentences + change names
Irrelevant Context (IC)	IC_#	Add irrelevant sentences + change numbers
Irrelevant Context (IC)	IC_N#	Add irrelevant sentences + change both names and numbers

The dataset covers nine languages: four high-resource (English, Chinese, French, Japanese) and five low-resource (Swahili, Amharic, Igbo, Yoruba, Twi). Templates were created for 225 of the 250 MGSM questions, translated using Gemini 2.0 Flash, and verified by native speakers^[15].

Key findings

Changing entity names had minimal impact on performance. However, changing numerical values caused large accuracy drops, particularly for low-resource languages. The IC_# configuration (irrelevant context plus changed numbers) produced the most severe degradation^[15].

Robustness varied significantly across models:

Model	Original accuracy (avg)	IC_N# accuracy (Avg-5)	Drop
Claude Sonnet 4	84.8%	74.8%	-10.0
DeepSeek V3	81.2%	71.8%	-9.4
GPT-OSS 120B	79.5%	71.4%	-8.1
Gemini 2.5 Flash	86.2%	71.2%	-15.0
GPT-4.1	80.4%	64.0%	-16.4
Gemma 3 27B	63.0%	54.8%	-18.2

Claude Sonnet 4 proved the most robust, moving from 2nd place on the original MGSM to 1st place on MGSM-Pro. Gemini 2.5 Flash, despite having the highest original accuracy, dropped to 4th place, indicating that strong performance on static benchmarks does not guarantee robustness to surface-level variations^[15].

Low-resource languages suffered the most severe degradation. For Gemma 3 27B, Twi accuracy dropped from 19.6% to 9.4% and Yoruba from 45.3% to 31.2% under the IC_# configuration^[15]. The authors recommended evaluating each problem using at least five digit-varying instantiations to obtain reliable accuracy estimates.

Current leaderboard and benchmark saturation

Model performance (2025)

As of early 2025, the MGSM leaderboard (based on the original dataset) shows the following top performers:

Rank	Model	Average accuracy (all languages)
1	Llama 4 Maverick	92.3%
2	o3-mini	92.0%
3	Claude 3.5 Sonnet (June 2024)	91.6%
3	Claude 3.5 Sonnet (October 2024)	91.6%
5	Llama 3.3 70B Instruct	91.1%
6	o1-preview	90.8%
7	Claude 3 Opus	90.7%
8	Llama 4 Scout	90.6%
9	GPT-4o	90.5%
10	o1	89.3%
11	GPT-4 Turbo	88.5%
12	Gemini 1.5 Pro	87.5%
13	GPT-4o mini	87.0%
14	Llama 3.2 90B Instruct	86.9%
15	Claude 3.5 Haiku	85.6%

Frontier reasoning models such as Claude Opus 4.5 (Thinking) have reached 95.2% on MGSM, and GPT-5.3 Codex has been reported at 96%^[11].

Saturation

The benchmark is now considered saturated for frontier models. When the top 10 models all exceed 90% average accuracy and the best exceed 95%, the benchmark loses its ability to distinguish between models' multilingual reasoning capabilities. Vals AI, which maintains a model evaluation platform, has stopped running MGSM on new model releases for this reason^[12]. The benchmark remains useful for evaluating smaller or open-source models, where a wider performance spread still exists. Models below 10 billion parameters, for example, typically score under 70% on MGSM^[13].

Adoption and integration

Evaluation frameworks

MGSM is integrated into several widely used evaluation tools:

Framework	Organization	MGSM support	Notes
lm-evaluation-harness	EleutherAI	Full (11 languages, direct + CoT)	Backend for the Hugging Face Open LLM Leaderboard
Inspect Evals	UK AI Safety Institute	Full	Used for UK government model evaluations
Kaggle Open Benchmarks	Kaggle	Full	Public leaderboard with community submissions
Vals AI Benchmarks	Vals AI	Full (deprecated for new runs)	Tracked model releases through early 2025

The EleutherAI evaluation harness supports both mgsm_direct and mgsm_cot_native task variants for all 11 languages, with configuration files specifying few-shot counts, answer extraction patterns, and scoring logic^[14]. The harness is used internally by organizations including NVIDIA, Cohere, BigScience, and Mosaic ML.

Use in model releases

MGSM scores frequently appear in technical reports for major model releases. Google's PaLM 2 technical report used MGSM to demonstrate multilingual improvements over the original PaLM^[4]. Meta's Llama 3 series reports include MGSM results. Anthropic, OpenAI, and Google DeepMind all reference MGSM when reporting multilingual capabilities of their respective model families.

Downstream research

MGSM has spawned several lines of follow-up research:

MGSM8KInstruct: An extension that automatically translated all 7,473 GSM8K training problems into 10 languages, producing training data for multilingual math instruction tuning. The MathOctopus family of models, fine-tuned on MGSM8KInstruct, demonstrated that cross-lingual fine-tuning (training on English questions with target-language reasoning) can boost even monolingual English performance; MathOctopus-7B gained 8.4 percentage points on the English GSM8K through multilingual training^[16].

mCoT (Multilingual Chain-of-Thought): Research on multilingual instruction tuning for reasoning consistency, using MGSM as the primary evaluation benchmark^[17].

Zero-shot multilingual CoT: Studies on improving zero-shot chain-of-thought reasoning across languages, building on the MGSM findings that few-shot prompting is not strictly necessary for multilingual reasoning in large enough models^[18].

Relationship to GSM8K

MGSM is a direct subset and multilingual extension of GSM8K. The relationship between the two benchmarks is straightforward:

Aspect	GSM8K	MGSM
Created by	Cobbe et al. (OpenAI, 2021)	Shi et al. (Google Research, 2022)
Problem count	8,500 (7,500 train + 1,000 test)	250 (from GSM8K test set)
Languages	English only	11 (English + 10 translations)
Primary use	English math reasoning evaluation	Multilingual math reasoning evaluation
Problem format	Grade-school word problems, 2-8 steps	Identical to GSM8K
Answer type	Integer	Integer

The 250 MGSM problems are a proper subset of the 1,000 GSM8K test problems. Any model's MGSM English score should therefore be comparable to (though not identical to) its GSM8K score, since MGSM uses a different subset and different few-shot exemplars^[1]^[2].

Benchmark	Focus	Languages	Relation to MGSM
GSM8K	English math reasoning	1 (English)	Parent dataset; MGSM draws 250 problems from GSM8K
MSVAMP	Multilingual math variations	10	Complementary; uses different problem templates
MATH	Competition-level math	1 (English)	Much harder; different difficulty level
MGSM-Rev2	Corrected MGSM translations	11	Direct replacement for original MGSM
MGSM-Pro	Robustness-tested multilingual math	9	Extension testing digit and name variation robustness
GSM-Symbolic	Symbolic math variation (English)	1 (English)	Inspired the MGSM-Pro approach
BIG-Bench Hard	Diverse hard reasoning tasks	1 (English)	Includes some math tasks but is broader
XCOPA	Cross-lingual commonsense reasoning	11	Shi et al. also tested on XCOPA alongside MGSM
XL-WiC	Cross-lingual word-in-context	12	Shi et al. also tested on XL-WiC alongside MGSM
BenchMAX	Comprehensive multilingual evaluation	16	Includes MGSM as one of several multilingual benchmarks

Limitations

Dataset size

With only 250 problems per language, MGSM is a relatively small benchmark. Statistical noise from this small sample size can affect conclusions, particularly when comparing models with similar performance levels. A 2-percentage-point difference (5 problems) may not be statistically significant^[1].

Problem difficulty

The problems are elementary-level, requiring only basic arithmetic. This means MGSM does not test higher mathematical reasoning such as algebra, geometry, probability, or calculus. Models that struggle with MGSM problems face fundamental limitations, but models that ace MGSM may still fail on more advanced mathematics^[1].

Translation artifacts

Even with the Rev2 corrections, machine-translated or human-translated math problems may contain subtle cultural or linguistic artifacts that affect difficulty in ways unrelated to mathematical reasoning. Problem contexts (shopping scenarios, school situations) may feel more or less natural depending on the cultural background of the target language^[6].

Contamination risk

Because MGSM has been publicly available since October 2022 and is included in many benchmark suites, there is a risk of data contamination, where models may have seen MGSM problems during pretraining or fine-tuning. MGSM-Pro partially addresses this through numerical variation, but the original 250 problems remain fixed^[15].

Limited language coverage

Ten languages, while typologically diverse, represent a small fraction of the world's approximately 7,000 languages. Notable omissions include Arabic, Hindi, Korean, Indonesian, and virtually all indigenous and minority languages. Efforts like BenchMAX are beginning to address this gap^[19].

References

Shi, F., Suzgun, M., Freitag, M., Wang, X., Srivats, S., Vosoughi, S., Chung, H. W., Tay, Y., Ruder, S., Zhou, D., Das, D., & Wei, J. (2022). "Language Models are Multilingual Chain-of-Thought Reasoners." arXiv:2210.03057. Published at ICLR 2023. https://arxiv.org/abs/2210.03057
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). "Training Verifiers to Solve Math Word Problems." arXiv:2110.14168. https://arxiv.org/abs/2110.14168
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. https://arxiv.org/abs/2201.11903
Google. (2023). "PaLM 2 Technical Report." https://ai.google/static/documents/palm2techreport.pdf
Emergent Mind. "MGSM Benchmark: Multilingual Grade School Math." https://www.emergentmind.com/topics/multilingual-grade-school-math-mgsm-benchmark
Mohn, L. et al. (2025). "Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results." arXiv:2511.05162. https://arxiv.org/abs/2511.05162
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. https://arxiv.org/abs/2203.11171
Liner. "Quick Review: Language Models are Multilingual Chain-of-Thought Reasoners." https://liner.com/review/language-models-are-multilingual-chainofthought-reasoners
Chung, H. W., Hou, L., Longpre, S., et al. (2022). "Scaling Instruction-Finetuned Language Models." arXiv:2210.11416. https://arxiv.org/abs/2210.11416
Google Research Datasets. "MGSM-Rev2." GitHub. https://github.com/google-research-datasets/MGSM-Rev2
llm-stats.com. "MGSM Benchmark Leaderboard." https://llm-stats.com/benchmarks/mgsm
Vals AI. "MGSM Benchmark." https://www.vals.ai/benchmarks/mgsm
llmdb.com. "MGSM - LLM Benchmark." https://llmdb.com/benchmarks/mgsm
EleutherAI. "lm-evaluation-harness: MGSM task." GitHub. https://github.com/EleutherAI/lm-evaluation-harness/blob/main/lm_eval/tasks/mgsm/README.md
Xiao, Y. et al. (2026). "MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation." arXiv:2601.21225. https://arxiv.org/abs/2601.21225
Chen, X. et al. (2024). "Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations." JMLR 25. https://jmlr.org/papers/volume25/23-0870/23-0870.pdf
Qin, L. et al. (2024). "mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models." ACL 2024. https://aclanthology.org/2024.acl-long.649.pdf
Huang, Y. et al. (2023). "Improving Zero-shot Chain-of-Thought Reasoning across Languages." EMNLP 2023. https://aclanthology.org/2023.emnlp-main.163.pdf
Li, Y. et al. (2025). "BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models." arXiv:2502.07346. https://arxiv.org/abs/2502.07346

Background and motivation

Dataset construction

Source problems

Language selection

Translation process

Evaluation methodology

Prompting strategies

Scoring

Self-consistency

Original experimental results

Models tested

Key results from PaLM-540B

Flan-PaLM results

Scale dependence

Impact of chain-of-thought prompting

Cross-lingual performance analysis

The language gap

Resource level and performance

Error pattern analysis

MGSM-Rev2: Addressing translation errors

Discovery of errors

Answer extraction issues

Corrections and the revised dataset

Impact on the language gap

MGSM-Pro: Robustness evaluation

Motivation

Methodology

Key findings

Current leaderboard and benchmark saturation

Model performance (2025)

Saturation

Adoption and integration

Evaluation frameworks

Use in model releases

Downstream research

Relationship to GSM8K

Related benchmarks

Limitations

Dataset size

Problem difficulty

Translation artifacts

Contamination risk

Limited language coverage

See also

References

Improve this article

Related Articles

FLORES-200

Humanity's Last Exam

AA-LCR

MathArena

SimpleBench

Universal Speech Model

Background and motivation

Dataset construction

Source problems

Language selection

Translation process

Evaluation methodology

Prompting strategies

Scoring

Self-consistency

Original experimental results

Models tested

Key results from PaLM-540B

Flan-PaLM results

Scale dependence

Impact of chain-of-thought prompting

Cross-lingual performance analysis

The language gap

Resource level and performance

Error pattern analysis

MGSM-Rev2: Addressing translation errors

Discovery of errors

Answer extraction issues

Corrections and the revised dataset

Impact on the language gap

MGSM-Pro: Robustness evaluation

Motivation

Methodology