**
| MGSM | |
|---|---|
| Overview | |
| Full name | Multilingual Grade School Math |
| Abbreviation | MGSM |
| Description | A multilingual benchmark evaluating mathematical reasoning across 10 typologically diverse languages using grade-school math problems |
| Release date | 2022-10-06 |
| Latest version | 1.0 (original); Rev2 (corrected, 2025) |
| Benchmark updated | 2025 (MGSM-Rev2) |
| Authors | Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei |
| Organization | Google Research |
| Technical Details | |
| Type | Mathematical Reasoning, Multilingual Evaluation |
| Modality | Text |
| Task format | Word problems requiring multi-step arithmetic |
| Number of tasks | 10 languages (plus English source) |
| Total examples | 2,750 (250 per language, 11 languages including English) |
| Evaluation metric | Exact match accuracy |
| Domains | Elementary mathematics, Arithmetic word problems |
| Languages | English, Spanish, French, German, Russian, Chinese, Japanese, Thai, Swahili, Bengali, Telugu |
| Performance | |
| Human performance | Based on GSM8K validation |
| Baseline | Varies by language and model |
| SOTA score | ~96% (frontier models, 2025) |
| SOTA model | GPT-5.3 Codex, Claude Opus 4.5 (Thinking) |
| SOTA date | 2025 |
| Saturated | Yes (top models exceed 95%) |
| Resources | |
| Website | Official website |
| Paper | Paper (arXiv) |
| Venue | ICLR 2023 |
| GitHub | Repository |
| Dataset | Download (Hugging Face) |
| Corrected version | MGSM-Rev2 |
| License | CC-BY-SA 4.0 |
| Predecessor | GSM8K |
MGSM** (Multilingual Grade School Math) is a benchmark dataset that evaluates the mathematical reasoning capabilities of large language models across multiple languages. Released in October 2022 by Google Research and published at ICLR 2023, MGSM extends the popular GSM8K benchmark by manually translating 250 grade-school math problems into 10 typologically diverse languages[1]. The benchmark specifically tests whether chain-of-thought reasoning capabilities transfer across languages. Its central finding is that mathematical reasoning emerges as a universal capability in sufficiently large language models, even in underrepresented languages such as Bengali and Swahili that make up less than 0.01% of typical pretraining corpora[1].
As of 2025, MGSM has become one of the most widely used multilingual reasoning benchmarks. It is integrated into major evaluation frameworks such as EleutherAI's Language Model Evaluation Harness and appears on leaderboards maintained by Kaggle, Vals AI, and others. However, frontier models now routinely exceed 90% average accuracy across all languages, pushing the benchmark toward saturation and prompting the development of harder successors like MGSM-Pro[15].
Before MGSM, most evaluations of mathematical reasoning in language models were conducted exclusively in English. The GSM8K dataset, introduced by Cobbe et al. at OpenAI in 2021, provided 8,500 linguistically diverse grade-school math word problems and quickly became a standard benchmark for testing arithmetic reasoning[2]. However, GSM8K told researchers nothing about whether the reasoning abilities it measured would generalize to other languages.
This gap mattered for several reasons. First, the majority of the world's population does not speak English as a first language, so practical deployment of reasoning-capable models requires multilingual competence. Second, researchers wanted to understand whether chain-of-thought reasoning, the prompting technique introduced by Wei et al. in 2022 that dramatically improved math performance in English, would work across languages with different syntactic structures, writing systems, and levels of representation in training data[3]. Third, the field lacked a controlled experimental setup for measuring cross-lingual reasoning transfer, since existing multilingual benchmarks focused on tasks like natural language inference and question answering rather than multi-step mathematical problem-solving.
Shi et al. designed MGSM to fill this gap. By taking a fixed set of 250 problems from GSM8K's test split and translating them into 10 carefully selected languages, the researchers created a controlled environment where problem content remained constant and only the language varied. This design made it possible to isolate the effect of language on reasoning performance[1].
The 250 source problems were drawn from the GSM8K test set. Each problem is an elementary-level arithmetic word problem requiring between 2 and 8 sequential reasoning steps. The problems involve basic operations (addition, subtraction, multiplication, and division) and produce integer answers. A typical problem describes a real-world scenario, such as buying fruit at a market or dividing objects among people, and asks for a single numerical result. The GSM8K dataset itself was written by human problem authors with comprehensive quality control, achieving an estimated error rate below 2%[2].
The 10 target languages were chosen to maximize typological diversity while spanning a range of resource levels in LLM pretraining data. They cover eight language families, multiple writing systems, and both high-resource and low-resource settings.
| Language | ISO Code | Script | Language Family | Approximate speakers (millions) | Pretraining corpus share (PaLM) |
|---|---|---|---|---|---|
| English | en | Latin | Indo-European (Germanic) | 1,452 | ~30% |
| Spanish | es | Latin | Indo-European (Romance) | 559 | ~4.9% |
| French | fr | Latin | Indo-European (Romance) | 280 | ~3.3% |
| German | de | Latin | Indo-European (Germanic) | 132 | ~4.4% |
| Russian | ru | Cyrillic | Indo-European (Slavic) | 258 | ~1.8% |
| Chinese | zh | Chinese characters | Sino-Tibetan | 1,118 | ~0.78% |
| Japanese | ja | Mixed (Kanji, Hiragana, Katakana) | Japonic | 128 | ~1.2% |
| Thai | th | Thai script | Kra-Dai | 61 | ~0.15% |
| Bengali | bn | Bengali script | Indo-European (Indo-Aryan) | 273 | <0.01% |
| Swahili | sw | Latin | Niger-Congo (Bantu) | 200 | <0.01% |
| Telugu | te | Telugu script | Dravidian | 96 | ~0.0002% |
The selection intentionally included languages at the extremes of pretraining data availability. English, the dominant language in most LLM training sets, sits at one end, while Telugu, which constitutes roughly 0.0002% of PaLM's pretraining corpus, sits at the other[1][4]. This range allowed the researchers to study how pretraining data volume affects multilingual reasoning.
The translation methodology prioritized accuracy and naturalness. Between 1 and 5 professional native-speaker translators worked on each language. Every translator had at least 2 years of professional experience and was contractually prohibited from using machine translation[1][5]. The process included several quality-control steps:
Despite these precautions, later research revealed that some translation errors survived the review process, leading to the development of MGSM-Rev2 (discussed in a later section)[6].
Shi et al. tested four distinct prompting strategies on MGSM, each designed to probe a different aspect of multilingual reasoning:
| Strategy | Abbreviation | Description | Question language | Reasoning language |
|---|---|---|---|---|
| Direct answer | DIRECT | Model provides a numerical answer without showing intermediate steps | Target language | None (answer only) |
| Native chain-of-thought | NATIVE-COT | Model reasons step by step in the same language as the question | Target language | Target language |
| English chain-of-thought | EN-COT | Model reasons step by step in English, regardless of question language | Target language | English |
| Translate to English | TRANSLATE-EN | Question is machine-translated to English before the model solves it in English | English (translated) | English |
All experiments used 8-shot prompting, meaning 8 exemplar problem-solution pairs were provided in the prompt before the test question. The exemplars were drawn from the GSM8K training set, and for the NATIVE-COT and DIRECT strategies, the exemplars were translated into the corresponding target language[1].
MGSM uses exact match on the final numerical answer. The model's output is parsed for a number following a standardized "The answer is" format, and this number is compared to the gold answer. The evaluation uses greedy decoding (temperature 0) to ensure reproducibility. No partial credit is given for correct intermediate steps with a wrong final answer[1].
In addition to standard greedy decoding, the researchers evaluated self-consistency (SC), a technique introduced by Wang et al. (2022) in which multiple reasoning chains are sampled and the most common final answer is selected by majority vote[7]. Self-consistency produced consistent improvements across all languages, with particularly large gains on low-resource languages.
The original paper evaluated three model families:
| Model | Parameters | Developer | Notes |
|---|---|---|---|
| PaLM 540B | 540 billion | Largest model tested; primary focus of the study | |
| PaLM 62B | 62 billion | Medium-scale model for studying scale effects | |
| PaLM 8B | 8 billion | Smallest model for studying scale effects | |
| text-davinci-002 | ~175 billion | OpenAI | GPT-3.5 variant available via API at the time |
| code-davinci-002 | ~175 billion | OpenAI | Codex variant, tested on a subset |
PaLM-540B with the TRANSLATE-EN strategy achieved the best overall performance, with an average accuracy of 55.0% across all languages. The EN-COT strategy followed at 51.3% average accuracy, and NATIVE-COT achieved 48.1%. The DIRECT strategy, which provided no intermediate reasoning steps, performed substantially worse at under 20% on most languages[1][8].
Selected per-language results for PaLM-540B (TRANSLATE-EN strategy)[1]:
| Language | Accuracy |
|---|---|
| English | 62.4% |
| Spanish | 60.0% |
| German | 57.2% |
| Chinese | 55.6% |
| French | 55.2% |
| Bengali | 53.2% |
| Swahili | 51.2% |
These results were surprising because Bengali and Swahili, which each account for less than 0.01% of PaLM's pretraining data, still achieved over 50% accuracy. The gap between the best-performing language (English at 62.4%) and the worst-performing languages was roughly 10 to 12 percentage points with the TRANSLATE-EN approach[1].
The paper also reported results for Flan-PaLM, the instruction-finetuned version of PaLM. Flan-PaLM with chain-of-thought prompting and self-consistency achieved substantially higher scores, including 69.6% on Bengali, demonstrating that instruction tuning can meaningfully improve multilingual reasoning[9].
One of the paper's central findings was that multilingual reasoning ability is strongly scale-dependent. At 8 billion parameters, PaLM showed minimal ability to solve MGSM problems in any language, with chain-of-thought prompting providing little benefit. At 62 billion parameters, English performance improved meaningfully, but non-English languages lagged behind. At 540 billion parameters, reasoning ability emerged across all tested languages, including low-resource ones[1].
This pattern matched the broader finding from Wei et al. (2022) that chain-of-thought reasoning is an emergent capability that appears at sufficient model scale[3]. MGSM extended this observation to the multilingual setting, showing that the scale threshold for non-English reasoning is roughly similar to that for English reasoning.
Across all languages and model sizes, chain-of-thought prompting produced large improvements over direct answer prediction:
| Model size | DIRECT (avg) | NATIVE-COT (avg) | EN-COT (avg) | Improvement over DIRECT |
|---|---|---|---|---|
| PaLM 8B | ~5% | ~7% | ~8% | +2-3% |
| PaLM 62B | ~15% | ~25% | ~28% | +10-13% |
| PaLM 540B | ~18% | ~48.1% | ~51.3% | +30-33% |
The results showed that English chain-of-thought (EN-COT) slightly outperformed native-language chain-of-thought (NATIVE-COT) for most languages. This suggests that models have stronger reasoning capabilities in English, their dominant training language, and can leverage English reasoning even when the question is posed in another language[1].
The original MGSM results showed a notable performance gap between English and other languages, particularly low-resource ones. This "language gap" was initially reported as ranging from 15 to 40 percentage points depending on the model and prompting strategy[1]. For instance, with the NATIVE-COT strategy on PaLM-540B, the gap between English and the lowest-performing language could exceed 20 percentage points.
However, as discussed in the section on MGSM-Rev2 below, subsequent research showed that a substantial portion of this gap was artifactual, caused by translation errors and inconsistent answer-extraction scripts rather than genuine differences in model capability[6].
Despite the caveats about translation quality, a real correlation between pretraining data volume and benchmark performance does exist. High-resource languages (English, Spanish, French, German) consistently score higher than medium-resource languages (Chinese, Japanese, Thai), which in turn score higher than low-resource languages (Bengali, Swahili, Telugu). The magnitude of this gap decreases with model scale, however. In the largest models, even low-resource languages achieve competitive performance[1].
Analysis of model errors across languages revealed several patterns:
| Error type | Description | Language correlation |
|---|---|---|
| Arithmetic mistakes | Model computes a step incorrectly | Roughly uniform across languages |
| Step omission | Model skips a required reasoning step | More common in low-resource languages |
| Question misunderstanding | Model misinterprets what is being asked | More common in languages with complex morphology or word order |
| Unit/format confusion | Model produces an answer in the wrong format | More common in languages with different numeral systems (Bengali, Thai) |
Arithmetic errors occurred at similar rates regardless of language, supporting the idea that basic computation is language-agnostic. However, errors in problem comprehension and step organization were more frequent in low-resource languages, suggesting that the language modeling component (as opposed to the arithmetic component) is the primary bottleneck in multilingual math reasoning[1].
In 2025, Mohn et al. published "Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results," which systematically examined translation quality in the original MGSM dataset[6]. The study uncovered several categories of errors:
Semantic distortions: In the German translation, the English phrase "An orange costs 5 less than what a watermelon cost" was rendered as "Eine Orange kostet 5 Mal weniger" ("5 times less"), fundamentally changing the mathematical relationship in the problem[6].
Logic inversions: One German translation changed "How many girls are not in the girl scout?" to "Wie viele Madchen sind bei den Pfadfinderinnen?" ("How many girls are in the scout?"), inverting the question entirely[6].
Constraint changes: "Round all his prices to the nearest dollar" was translated into German as rounding up specifically, adding a constraint not present in the original English[6].
Temporal errors: In at least one case, "Tuesday" was mistranslated as "Thursday" in the German version[6].
Beyond translation errors, Mohn et al. found that inconsistent answer-extraction scripts compounded the problem. The original evaluation code used language-specific answer prefixes and assumed English number formatting (periods for decimals, commas for thousands separators). For languages like French, where number formatting conventions differ, this caused correct answers to be scored as incorrect. Bengali numerals posed a particular challenge; when Bengali native numerals were not converted to Arabic equivalents, accuracy dropped dramatically[6].
Improving answer extraction alone yielded a 10-percentage-point boost for French on GPT-5[6].
The corrections fell into three main categories:
Google released MGSM-Rev2 on GitHub, incorporating all identified corrections. Two erroneous English source questions were corrected, ambiguous questions were rephrased, and all translations were redone using Gemini with subsequent verification to ensure every question was answerable[10].
The corrections had a dramatic effect on reported cross-lingual performance gaps:
| Model | English (original) | French (original) | Gap (original) | English (corrected) | French (corrected) | Gap (corrected) |
|---|---|---|---|---|---|---|
| Gemini 2.5 Pro | 97.6% | 82.4% | 15.2% | 99.6% | 98.8% | 0.8% |
| GPT-5 | 97.6% | 80.0% | 17.6% | 99.6% | 98.4% | 1.2% |
| Claude Sonnet | ~97% | ~85% | ~12% | ~99% | ~97% | ~2.0% |
For Gemma 3 27B, the Bengali accuracy jumped from 45.2% to 91.2% after correcting numeral handling, a 46-percentage-point improvement that was entirely an artifact of evaluation methodology rather than model capability[6].
After combining corrected translations with improved answer extraction, the maximum cross-lingual accuracy gap for strong models shrank to under 2 percentage points for Gemini 2.5 Pro, under 1.6% for GPT-5, and under 2% for Claude Sonnet. Mohn et al. concluded that the language gap "mostly disappears" for frontier models, leading to "completely different conclusions" from those in the original paper[6].
Even with MGSM-Rev2 correcting translation errors, researchers identified another limitation: models might memorize the specific numerical values in MGSM's 250 problems, since the benchmark has been publicly available since 2022 and is widely used in training evaluations. To test this concern, Xiao et al. (2026) introduced MGSM-Pro, which applies the GSM-Symbolic approach of varying surface-level details while preserving mathematical structure[15].
MGSM-Pro creates multiple instantiations of each MGSM problem through two series of modifications:
| Series | Variant | Modification |
|---|---|---|
| Symbolic (SYM) | SYM_N | Replace names with culturally relevant alternatives |
| Symbolic (SYM) | SYM_# | Change numerical values |
| Symbolic (SYM) | SYM_N# | Change both names and numbers |
| Irrelevant Context (IC) | IC_N | Add irrelevant sentences + change names |
| Irrelevant Context (IC) | IC_# | Add irrelevant sentences + change numbers |
| Irrelevant Context (IC) | IC_N# | Add irrelevant sentences + change both names and numbers |
The dataset covers nine languages: four high-resource (English, Chinese, French, Japanese) and five low-resource (Swahili, Amharic, Igbo, Yoruba, Twi). Templates were created for 225 of the 250 MGSM questions, translated using Gemini 2.0 Flash, and verified by native speakers[15].
Changing entity names had minimal impact on performance. However, changing numerical values caused large accuracy drops, particularly for low-resource languages. The IC_# configuration (irrelevant context plus changed numbers) produced the most severe degradation[15].
Robustness varied significantly across models:
| Model | Original accuracy (avg) | IC_N# accuracy (Avg-5) | Drop |
|---|---|---|---|
| Claude Sonnet 4 | 84.8% | 74.8% | -10.0 |
| DeepSeek V3 | 81.2% | 71.8% | -9.4 |
| GPT-OSS 120B | 79.5% | 71.4% | -8.1 |
| Gemini 2.5 Flash | 86.2% | 71.2% | -15.0 |
| GPT-4.1 | 80.4% | 64.0% | -16.4 |
| Gemma 3 27B | 63.0% | 54.8% | -18.2 |
Claude Sonnet 4 proved the most robust, moving from 2nd place on the original MGSM to 1st place on MGSM-Pro. Gemini 2.5 Flash, despite having the highest original accuracy, dropped to 4th place, indicating that strong performance on static benchmarks does not guarantee robustness to surface-level variations[15].
Low-resource languages suffered the most severe degradation. For Gemma 3 27B, Twi accuracy dropped from 19.6% to 9.4% and Yoruba from 45.3% to 31.2% under the IC_# configuration[15]. The authors recommended evaluating each problem using at least five digit-varying instantiations to obtain reliable accuracy estimates.
As of early 2025, the MGSM leaderboard (based on the original dataset) shows the following top performers:
| Rank | Model | Average accuracy (all languages) |
|---|---|---|
| 1 | Llama 4 Maverick | 92.3% |
| 2 | o3-mini | 92.0% |
| 3 | Claude 3.5 Sonnet (June 2024) | 91.6% |
| 3 | Claude 3.5 Sonnet (October 2024) | 91.6% |
| 5 | Llama 3.3 70B Instruct | 91.1% |
| 6 | o1-preview | 90.8% |
| 7 | Claude 3 Opus | 90.7% |
| 8 | Llama 4 Scout | 90.6% |
| 9 | GPT-4o | 90.5% |
| 10 | o1 | 89.3% |
| 11 | GPT-4 Turbo | 88.5% |
| 12 | Gemini 1.5 Pro | 87.5% |
| 13 | GPT-4o mini | 87.0% |
| 14 | Llama 3.2 90B Instruct | 86.9% |
| 15 | Claude 3.5 Haiku | 85.6% |
Frontier reasoning models such as Claude Opus 4.5 (Thinking) have reached 95.2% on MGSM, and GPT-5.3 Codex has been reported at 96%[11].
The benchmark is now considered saturated for frontier models. When the top 10 models all exceed 90% average accuracy and the best exceed 95%, the benchmark loses its ability to distinguish between models' multilingual reasoning capabilities. Vals AI, which maintains a model evaluation platform, has stopped running MGSM on new model releases for this reason[12]. The benchmark remains useful for evaluating smaller or open-source models, where a wider performance spread still exists. Models below 10 billion parameters, for example, typically score under 70% on MGSM[13].
MGSM is integrated into several widely used evaluation tools:
| Framework | Organization | MGSM support | Notes |
|---|---|---|---|
| lm-evaluation-harness | EleutherAI | Full (11 languages, direct + CoT) | Backend for the Hugging Face Open LLM Leaderboard |
| Inspect Evals | UK AI Safety Institute | Full | Used for UK government model evaluations |
| Kaggle Open Benchmarks | Kaggle | Full | Public leaderboard with community submissions |
| Vals AI Benchmarks | Vals AI | Full (deprecated for new runs) | Tracked model releases through early 2025 |
The EleutherAI evaluation harness supports both mgsm_direct and mgsm_cot_native task variants for all 11 languages, with configuration files specifying few-shot counts, answer extraction patterns, and scoring logic[14]. The harness is used internally by organizations including NVIDIA, Cohere, BigScience, and Mosaic ML.
MGSM scores frequently appear in technical reports for major model releases. Google's PaLM 2 technical report used MGSM to demonstrate multilingual improvements over the original PaLM[4]. Meta's Llama 3 series reports include MGSM results. Anthropic, OpenAI, and Google DeepMind all reference MGSM when reporting multilingual capabilities of their respective model families.
MGSM has spawned several lines of follow-up research:
MGSM8KInstruct: An extension that automatically translated all 7,473 GSM8K training problems into 10 languages, producing training data for multilingual math instruction tuning. The MathOctopus family of models, fine-tuned on MGSM8KInstruct, demonstrated that cross-lingual fine-tuning (training on English questions with target-language reasoning) can boost even monolingual English performance; MathOctopus-7B gained 8.4 percentage points on the English GSM8K through multilingual training[16].
mCoT (Multilingual Chain-of-Thought): Research on multilingual instruction tuning for reasoning consistency, using MGSM as the primary evaluation benchmark[17].
Zero-shot multilingual CoT: Studies on improving zero-shot chain-of-thought reasoning across languages, building on the MGSM findings that few-shot prompting is not strictly necessary for multilingual reasoning in large enough models[18].
MGSM is a direct subset and multilingual extension of GSM8K. The relationship between the two benchmarks is straightforward:
| Aspect | GSM8K | MGSM |
|---|---|---|
| Created by | Cobbe et al. (OpenAI, 2021) | Shi et al. (Google Research, 2022) |
| Problem count | 8,500 (7,500 train + 1,000 test) | 250 (from GSM8K test set) |
| Languages | English only | 11 (English + 10 translations) |
| Primary use | English math reasoning evaluation | Multilingual math reasoning evaluation |
| Problem format | Grade-school word problems, 2-8 steps | Identical to GSM8K |
| Answer type | Integer | Integer |
The 250 MGSM problems are a proper subset of the 1,000 GSM8K test problems. Any model's MGSM English score should therefore be comparable to (though not identical to) its GSM8K score, since MGSM uses a different subset and different few-shot exemplars[1][2].
| Benchmark | Focus | Languages | Relation to MGSM |
|---|---|---|---|
| GSM8K | English math reasoning | 1 (English) | Parent dataset; MGSM draws 250 problems from GSM8K |
| MSVAMP | Multilingual math variations | 10 | Complementary; uses different problem templates |
| MATH | Competition-level math | 1 (English) | Much harder; different difficulty level |
| MGSM-Rev2 | Corrected MGSM translations | 11 | Direct replacement for original MGSM |
| MGSM-Pro | Robustness-tested multilingual math | 9 | Extension testing digit and name variation robustness |
| GSM-Symbolic | Symbolic math variation (English) | 1 (English) | Inspired the MGSM-Pro approach |
| BIG-Bench Hard | Diverse hard reasoning tasks | 1 (English) | Includes some math tasks but is broader |
| XCOPA | Cross-lingual commonsense reasoning | 11 | Shi et al. also tested on XCOPA alongside MGSM |
| XL-WiC | Cross-lingual word-in-context | 12 | Shi et al. also tested on XL-WiC alongside MGSM |
| BenchMAX | Comprehensive multilingual evaluation | 16 | Includes MGSM as one of several multilingual benchmarks |
With only 250 problems per language, MGSM is a relatively small benchmark. Statistical noise from this small sample size can affect conclusions, particularly when comparing models with similar performance levels. A 2-percentage-point difference (5 problems) may not be statistically significant[1].
The problems are elementary-level, requiring only basic arithmetic. This means MGSM does not test higher mathematical reasoning such as algebra, geometry, probability, or calculus. Models that struggle with MGSM problems face fundamental limitations, but models that ace MGSM may still fail on more advanced mathematics[1].
Even with the Rev2 corrections, machine-translated or human-translated math problems may contain subtle cultural or linguistic artifacts that affect difficulty in ways unrelated to mathematical reasoning. Problem contexts (shopping scenarios, school situations) may feel more or less natural depending on the cultural background of the target language[6].
Because MGSM has been publicly available since October 2022 and is included in many benchmark suites, there is a risk of data contamination, where models may have seen MGSM problems during pretraining or fine-tuning. MGSM-Pro partially addresses this through numerical variation, but the original 250 problems remain fixed[15].
Ten languages, while typologically diverse, represent a small fraction of the world's approximately 7,000 languages. Notable omissions include Arabic, Hindi, Korean, Indonesian, and virtually all indigenous and minority languages. Efforts like BenchMAX are beginning to address this gap[19].