Magistral
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,012 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,012 words
Add missing citations, update stale details, or suggest a clearer explanation.
Magistral is a family of reasoning models from Mistral AI, the French AI company, first released on June 10, 2025. [1] It is Mistral's first dedicated line of reasoning models, built to work through problems step by step rather than answering in a single pass. The launch came with two members. Magistral Small is a 24 billion parameter model released with open weights under the Apache 2.0 license, and Magistral Medium is a larger, more capable variant offered through Mistral's commercial products and API. [1][2] A few days after the announcement, Mistral published a technical report on arXiv that described the training pipeline and the reasoning behind the design. [3]
Magistral arrived during a stretch when reasoning models had moved to the center of the field. OpenAI's o-series and DeepSeek-R1 had shown that letting a model produce a long internal chain of thought before its final answer could lift performance on math, coding, and logic problems by a wide margin. Mistral had until that point shipped general purpose chat and instruct models, so Magistral marked its entry into the same category, and it did so partly on its own terms by training the models with its own infrastructure rather than borrowing reasoning traces from existing systems. [3]
Magistral is a large language model tuned to reason explicitly. Given a question, it writes out a working-through stage, usually wrapped in dedicated thinking tokens, and then gives a final answer that follows from that work. [2] This is the chain-of-thought pattern that defines the current generation of reasoning systems, and it tends to help most on tasks where a single forward guess is unreliable, such as competition mathematics, multi-step coding, and structured logical problems.
Mistral framed two qualities as the point of the model. The first is transparency. Because the reasoning trace is exposed rather than hidden, a user can follow how the model reached a conclusion, which Mistral pitched at regulated settings where every step of a decision may need to be auditable. [1] In the company's words, professionals get reasoning that can be traced back through its logical steps. [1] The second quality is multilingual reasoning, where the chain of thought itself happens in the user's language rather than defaulting to English. [1][3]
The two variants split along an open versus commercial line that runs through much of Mistral's catalog. Magistral Small is downloadable and modifiable under Apache 2.0, which lets developers run it locally and fine-tune it freely. [2] Magistral Medium is not open. It is reached through Le Chat, Mistral's assistant, and through the La Plateforme API, with availability also planned across cloud marketplaces including Amazon SageMaker, IBM watsonx, Azure AI, and Google Cloud. [1]
Magistral does not introduce a new pretrained backbone. Each variant adds reasoning behavior on top of an existing Mistral model. Magistral Small is built on Mistral Small 3.1, specifically the Mistral-Small-3.1-24B-Instruct-2503 checkpoint, which fixes its size at 24 billion parameters. [2] Magistral Medium is built on Mistral Medium 3, a larger model that Mistral keeps closed, so its parameter count has not been published. [3]
Both variants carry a 128k token context window, though Mistral notes that quality can fall off past roughly 40k tokens of generation and recommends keeping within that range for the open model. [2] Because Magistral Small inherits a 24B footprint, it is small enough to run on a single high-end consumer GPU or a well-equipped laptop once quantized, which is part of the appeal of the open release. [4]
The training method is the most distinctive part of the project. Mistral built Magistral with reinforcement learning using a setup it describes as a ground-up approach that relied only on its own models and infrastructure, without distilling reasoning traces from any outside model. [3] The two variants reached their reasoning ability by different routes. Magistral Medium was trained with reinforcement learning alone on top of Mistral Medium 3, and the report states that this pure RL run produced close to a 50 percent gain in AIME 2024 pass@1 over the starting checkpoint. [3] Magistral Small then learned from the larger model. It was trained with a round of supervised fine-tuning on reasoning traces drawn from Magistral Medium, used as cold-start data, followed by its own reinforcement learning stage. [2][3]
The reinforcement learning algorithm builds on GRPO, the group relative policy optimization method that DeepSeek popularized with DeepSeek-R1, but Mistral changed several pieces. [3] The report describes removing the KL divergence penalty entirely, normalizing the loss by the total length of generations in a group, normalizing advantages within each minibatch, and widening the upper clipping bound through a clip-higher setting in roughly the 0.26 to 0.28 range to keep the policy from collapsing to low entropy. It also filters out groups whose samples all receive the same advantage so they do not waste a training step. [3] This style of training, where rewards come from automatically checkable signals such as whether a math answer is correct or code passes its tests, is often grouped under reinforcement learning with verifiable rewards.
Multilingual reasoning was handled through reward shaping rather than a separate model. During training Mistral translated a fraction of the problems into languages including French, Spanish, Italian, German, Russian, and Chinese, then used a classifier to check that the question, the reasoning, and the answer all stayed in the same language, granting a small extra reward when they did. [3] The result is that the chain of thought is written in the user's language instead of being translated after the fact. The report also reports a useful side effect. Running reinforcement learning on text alone tended to preserve, and sometimes improve, the base model's other abilities such as instruction following, function calling, and multimodal understanding, rather than eroding them. [3]
To make large-scale online reinforcement learning practical, Mistral leaned on an asynchronous system. Generators produce completions continuously while training proceeds, and updated weights are pushed to those generators using NCCL without stopping generation or discarding the in-progress cache. [3] Much of the technical report is given over to these infrastructure choices, since keeping the generation and training loops fed is a large part of what makes pure RL at this scale workable.
Mistral reported results on the standard reasoning benchmarks of the moment, including the AIME competition mathematics sets for 2024 and 2025, GPQA Diamond for graduate-level science questions, and LiveCodeBench for coding. The figures below are the pass@1 numbers from the technical report, with majority-vote results shown where Mistral provided them. [3]
| Benchmark | Magistral Medium | Magistral Small |
|---|---|---|
| AIME 2024 (pass@1) | 73.6% | 70.7% |
| AIME 2024 (maj@64) | 90.0% | 83.3% |
| AIME 2025 (pass@1) | 64.9% | 62.8% |
| AIME 2025 (maj@64) | 83.3% | 76.7% |
| GPQA Diamond | 70.8% | 68.2% |
| LiveCodeBench v5 | 59.4% | 55.8% |
The pattern is what you would expect from the two-tier design. Magistral Medium leads on every measure, while Magistral Small lands a few points behind despite being the open, smaller model that learned partly from its larger sibling. [3] The majority-vote columns, where the model samples many answers and takes the most common one, show how much headroom the sampling strategy can add on AIME, lifting Magistral Medium from 73.6 percent to 90 percent on the 2024 set. [1][3] These are the only numbers Mistral published for the first release, and comparisons against other vendors' models should be read with care, since benchmark setups and decoding settings differ between labs.
Multilingual reasoning is the feature Mistral leaned on most when separating Magistral from its rivals. The blog post highlighted strong reasoning in English, French, Spanish, German, Italian, Arabic, Russian, and Simplified Chinese, and the Magistral Small model card lists support across roughly two dozen languages in total. [1][2] What matters here is not only that the model can read a non-English prompt but that its internal reasoning stays in that language, which is the behavior the language-consistency reward was designed to produce. [3] For a European company that has often positioned itself around sovereignty and language coverage, keeping the reasoning trace in the user's own language fits the broader pitch.
Magistral sits in the same space as DeepSeek-R1 and OpenAI's o-series, and the comparison is mostly about how open the models are. Like DeepSeek-R1, the Small variant ships with open weights, so it can be inspected, self-hosted, and fine-tuned, which the closed o-series models cannot. [4] Unlike DeepSeek-R1, which is a very large mixture-of-experts model, Magistral Small is a compact 24B dense model that an individual can run on local hardware, trading peak benchmark scores for accessibility. [2][4] Several outlets described Magistral as a European entry into reasoning models, with Mistral pairing the open Small release against an enterprise Medium tier reached through Le Chat and the API. [1][5] The reasoning-heavy approach was also tied to product features. In Le Chat, a Flash Answers mode used Magistral Medium to return responses at up to ten times the token throughput of some competing setups, according to Mistral. [1]
Magistral has moved through point releases since launch. Mistral shipped Magistral 1.1 on July 24, 2025, covering both Magistral Medium 1.1 and Magistral Small 1.1 under the version codes magistral-medium-2507 and magistral-small-2507. [6]
The larger step came with Magistral 1.2 in mid-September 2025, released as magistral-medium-2509 and magistral-small-2509. [6][7] This update rebased Magistral Small on Mistral Small 3.2 and added a vision encoder, so the model could take image inputs and extend its reasoning to visual questions for the first time. [7][8] Mistral also listed quality-of-life changes, including better tone, cleaner LaTeX and Markdown formatting, shorter answers on easy prompts, and a lower tendency to fall into runaway generation loops, along with the [THINK] and [/THINK] tokens used to wrap the reasoning span. [7] The benchmark gains were sizable. On Mistral's reported figures, Magistral Small 1.2 reached 86.14 percent on AIME 2024 and 77.34 percent on AIME 2025, well above the 70.52 percent and 62.03 percent posted by Magistral Small 1.1 on the same measures, with GPQA Diamond and LiveCodeBench also improving. [7] Magistral Small 1.2 stayed open under Apache 2.0 and small enough to run on a single consumer card or a 32GB laptop once quantized. [7][8]
Some limits follow directly from the design. Reasoning models spend extra tokens thinking before they answer, which raises latency and cost on hard questions and can be wasteful on simple ones, a point Mistral partly addressed in 1.2 by shortening answers on easy prompts. [7] Magistral Small's quality can degrade past roughly 40k tokens of generation even though the context window is 128k, so the practical reasoning budget is smaller than the headline figure suggests. [2] The first release of Magistral Small was text only, with image support arriving only with the 1.2 vision encoder. [7] And because Magistral Medium is closed and its parameters are unpublished, outside researchers cannot directly study the stronger of the two models, so independent verification rests mainly on the open Small variant and on Mistral's own technical report. [3] As with any single-vendor benchmark disclosure, the reported scores are best treated as a starting point rather than a settled ranking.