Minerva (language model)
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,548 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,548 words
Add missing citations, update stale details, or suggest a clearer explanation.
Minerva is a large language model developed by Google Research that specializes in quantitative reasoning, meaning it answers mathematics, science, and engineering questions by writing out step-by-step solutions in natural language. It was introduced in a research paper and accompanying blog post at the end of June 2022. The defining idea behind Minerva is that strong mathematical reasoning can emerge from a general-purpose language model that is further trained on a large, carefully prepared corpus of technical text, with no calculator, code interpreter, or other external tool involved at inference time. At the time of its release it set state-of-the-art results on several standard reasoning benchmarks, most notably the MATH dataset of competition-style problems.
Minerva was built on top of the PaLM general language model, the 540-billion-parameter decoder-only transformer that Google had described earlier in 2022 [1][2]. Large language models had by then shown strong performance on tasks involving common-sense reasoning, question answering, and summarization, but they tended to fall down on problems requiring precise multi-step calculation and symbolic manipulation [2]. Quantitative reasoning is a demanding test because a solver has to parse a natural-language prompt, recall relevant facts, and then carry out a chain of computations correctly to reach the answer, with a single arithmetic slip enough to spoil the result [2].
The work came out of the Blueshift team within Google Research, and the paper, "Solving Quantitative Reasoning Problems with Language Models," was led by Aitor Lewkowycz, Guy Gur-Ari, Vedant Misra, Behnam Neyshabur, and a dozen co-authors [1][2]. It was posted to arXiv on 29 June 2022, and Google published its companion blog post on 30 June 2022 [1][2]. (The research division that produced it has since been folded into Google DeepMind.)
The central contribution of the paper is the training corpus rather than any change to the model architecture. Starting from already-pretrained PaLM checkpoints, the authors continued training (a form of finetuning) on a dataset that pairs natural language with formal mathematical notation [2]. The blog post describes this technical corpus as roughly 118 GB of scientific papers from the arXiv preprint server together with web pages containing mathematical expressions [1]. In token terms the paper reports about 38.5 billion tokens of this math-and-science content [2].
What made the corpus unusual was its handling of formatting. Standard text-cleaning pipelines tend to strip out HTML and markup, which destroys equations. Minerva's data pipeline instead preserved mathematical notation written in LaTeX, MathJax, and similar typesetting formats, so that expressions such as e^(iπ) + 1 = 0 reached the model intact [1][2]. The breakdown of the technical dataset was an even split between the two main sources, with general natural language data making up a small remainder that was drawn from the original PaLM pretraining mix [2]:
| Data source | Share of technical data | Tokens |
|---|---|---|
| Math web pages | 47.5% | 17.5B |
| arXiv | 47.5% | 21.0B |
| General natural language | 5% | over 100B |
At inference time Minerva relies on prompting and sampling techniques rather than fine-tuned task heads. It is given a few worked examples in the prompt (few-shot prompting), produces a chain-of-thought or "scratchpad" solution that spells out intermediate steps, and marks a final answer at the end [2]. Crucially, the model performs its own arithmetic and algebra within that text; it is never handed a calculator or external solver [1][2]. To improve reliability the authors sample many candidate solutions with nucleus sampling and then take the most common final answer, a procedure known as majority voting or self-consistency [2]. Answers are checked for mathematical equivalence using the SymPy library, so that forms like 1/√3 and √3/3 are treated as the same [2].
The team trained three Minerva models, each continued from the correspondingly sized PaLM checkpoint. The largest is built on the 540B PaLM, confirming that Minerva did reach the 540-billion-parameter scale. The table below lists the architecture and the amount of additional training each received on the technical dataset [2].
| Model | Parameters | Layers | Continued-training tokens |
|---|---|---|---|
| Minerva 8B | 8.63B | 32 | 164B |
| Minerva 62B | 62.50B | 64 | 109B |
| Minerva 540B | 540.35B | 118 | 26B |
The authors noted that the 540B model was relatively undertrained on the technical corpus compared with the smaller models, yet still delivered the best results [2].
Minerva was evaluated on three existing benchmarks plus a new set the authors assembled. MATH is a collection of about 12,000 middle- and high-school competition problems written in LaTeX; GSM8K consists of grade-school math word problems; and MMLU-STEM is the science, technology, engineering, and mathematics subset of the MMLU exam [2]. The authors also gathered 272 undergraduate-level problems from MIT OpenCourseWare, which they call OCWCourses, to probe multi-step scientific reasoning [2].
With majority voting, Minerva 540B reached the headline numbers reported in the blog post and paper. The figures below are the majority-vote (maj1@k) results for each model size, followed by the previous published state of the art and OpenAI's davinci-002 evaluated under the same conditions [1][2]:
| Model | MATH | GSM8K | MMLU-STEM | OCWCourses |
|---|---|---|---|---|
| Minerva 8B | 25.4% | 28.4% | 43.4% | 12.5% |
| Minerva 62B | 43.4% | 68.5% | 63.5% | 23.5% |
| Minerva 540B | 50.3% | 78.5% | 75.0% | 30.8% |
| OpenAI davinci-002 | 19.1% | n/a | n/a | 14.8% |
| Previous SOTA | 6.9% | 74.4% | 54.9% | n/a |
The jump on MATH was the most striking: from a previous best of about 6.9% to 50.3% [2]. The MMLU-STEM result improved on the prior figure of roughly 55%, while the GSM8K gain was more modest because PaLM 540B had already reached 74.4% on that benchmark using majority voting [2]. Majority voting consistently raised accuracy over single-sample (greedy) decoding; for example, the 540B model scored 33.6% on MATH with one sample versus 50.3% with voting [2]. The number of samples used for voting varied by benchmark, with k = 256 for MATH on the smaller models, k = 64 for the 540B model on MATH, and k = 16 for MMLU-STEM [2]. On the broader set of more than two hundred undergraduate science problems, the model answered close to a third correctly [1][2]. As a further illustration, Minerva 62B scored 57% on Poland's 2022 national math exam, matching that exam's 2021 national average, and the 540B model reached 65% [2].
The authors were explicit about what the approach could not do. Because Minerva has no built-in way to verify its own chain of reasoning, it can reach a correct final answer through faulty intermediate steps, a phenomenon they term false positives. On a sample of MATH problems they estimated the false-positive rate of the 62B model at about 8% on average, rising with problem difficulty [2]. The model can also make ordinary calculation errors and reasoning errors, and in a minority of cases it hallucinates an equation or mathematical fact that is not real [2]. In their analysis of mistakes made by the 8B model, incorrect reasoning and incorrect calculation were the two largest categories [2].
Three broader limitations are listed in the paper. First, there is no automatic way to check the correctness of a solution's reasoning, in contrast to formal-proof systems where verification is intrinsic. Second, with no access to external tools such as a calculator or Python interpreter, Minerva is limited on tasks that need heavy numerical computation. Third, because it was trained on a very large corpus, the authors had little direct control over exactly which capabilities it acquired [2]. The team also ran several memorization checks, including evaluating modified versions of problems, and found little evidence that the model's scores could be explained by having memorized test items [2].
Minerva was an influential demonstration that a general language model, with the right data and inference-time techniques and without any external calculator, could close much of the gap on quantitative reasoning benchmarks that had long resisted such models [1][2]. Its leap on the MATH dataset reframed expectations for how far scaling and high-quality domain data could push mathematical performance, and the work fed into Google's later reasoning research, including the multimodal Gemini models. The MATH benchmark it helped popularize remained a standard yardstick afterward; subsequent systems such as DeepSeekMath later surpassed Minerva's MATH score while using far fewer parameters, underscoring how quickly the field advanced [3]. In its discussion of societal impact, the paper suggested an accessible and affordable math tutor as a natural application, while cautioning that performance remained well below human level and that answers still could not be automatically verified [2].