Phi-4 Reasoning
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,767 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,767 words
Add missing citations, update stale details, or suggest a clearer explanation.
Phi-4-reasoning is a 14 billion parameter open weight reasoning model released by Microsoft Research on April 30, 2025. It was launched together with Phi-4-reasoning-plus, a sibling checkpoint that adds a short reinforcement learning phase on top of the same supervised fine tuning recipe. Both models are post trained from the same base, the dense Phi-4 released in December 2024, and inherit its decoder only Transformer architecture, 32 thousand token context window, and English first training distribution. The reasoning variants were distributed through Hugging Face under the MIT license, with weights, tokenizer, and inference code published the same day as the technical report.
What makes the pair notable is the gap between their size and their benchmark numbers. Phi-4-reasoning-plus reaches 78 percent on AIME 2025 and 81.3 percent on AIME 2024, scores that compete directly with OpenAI o1 and approach the full 671 billion parameter DeepSeek R1 on the same problems. Microsoft positioned the release as evidence that a careful supervised fine tuning curriculum on roughly 1.4 million carefully curated prompts, paired with chain of thought traces distilled from OpenAI's o3-mini, can lift a small dense model to frontier reasoning behavior without the massive scale that defined earlier reasoning systems. Phi-4-reasoning sits between the larger frontier reasoners and the much smaller Phi-4-mini-reasoning variant released a few days later, which targets the same recipe at a 3.8 billion parameter footprint.
The Phi family began at Microsoft Research in mid 2023 with the Textbooks Are All You Need paper, which argued that a 1.3 billion parameter Python coding model trained on a small but high quality synthetic dataset could rival models several times its size. That paper introduced Phi-1, and the team followed it with Phi-1.5, Phi-2, and the Phi-3 generation through 2024. Each release kept the same recipe: dense Transformers, heavy synthetic data, and a focus on what Microsoft calls textbook reasoning. By the end of 2024, the Phi line had become one of the most cited examples of how data quality can substitute for parameter count in language models.
Phi-4 itself shipped in December 2024 as a 14 billion parameter dense model that was built almost entirely on synthetic and filtered web data. It was strong on math word problems and competition style questions but, like every base model in the family, it answered in a single pass without an explicit chain of thought. Through 2024 and into 2025, the rest of the field had moved in a different direction. OpenAI's o1 in September 2024, DeepSeek R1 in January 2025, and a small wave of open weight reasoners such as QwQ from Alibaba in late 2024 all demonstrated that test time compute, in the form of long chain of thought traces and self verification, could lift math and code performance well beyond what fixed pretraining could deliver.
Microsoft entered that wave on April 30, 2025 with two simultaneous releases. Phi-4-reasoning and Phi-4-reasoning-plus were positioned as small dense reasoning models that could match or exceed much larger systems on math, science, and code benchmarks while remaining cheap to run. The technical report, posted to arXiv as 2504.21318 the same day, was unusually candid about which design choices mattered: data curation, boundary case selection, and a deliberately conservative reinforcement learning stage that only ran for about 90 steps. The same week, Microsoft published a companion paper on Phi-4-mini-reasoning, a 3.8 billion parameter variant of the same recipe targeting on device deployment.
The Phi-4 reasoning release also fit into a broader argument Microsoft Research had been making since Phi-1. Frontier behavior, the team kept claiming, was not solely a function of parameter count. Phi-4-reasoning was framed as the latest data point in that argument: a 14 billion parameter model that, after two months of post training, scored within striking distance of the largest closed systems on a battery of competition problems.
The April 30 release packaged two models that share the same base, the same tokenizer, the same 32 thousand token context window, and almost all of the same supervised fine tuning data. They differ only in a final reinforcement learning stage.
Phi-4-reasoning is the supervised fine tuned checkpoint. It takes the base Phi-4 weights and trains them on roughly 16 billion tokens of synthetic chain of thought traces, of which about 8.3 billion are unique, generated by OpenAI's o3-mini in high reasoning mode. The model learns to produce a two section response: a <think> block that contains its reasoning, followed by a final solution section that the user is expected to consume. The chat template enforces this structure, and the system prompt that ships with the model gives the model explicit instructions on how to use the thinking block.
Phi-4-reasoning-plus is the same checkpoint after an additional reinforcement learning stage using Group Relative Policy Optimization, the same algorithm DeepSeek used for R1. The RL stage focused exclusively on mathematical reasoning, drawing from a seed set of 72,401 problems and running for only about 90 optimization steps over roughly 6,000 examples, with 8 sampled trajectories per example. Microsoft selected the final RL checkpoint based on the best observed AIME 2024 score during training. The plus variant generates noticeably longer responses than the SFT only model, roughly 50 percent more tokens on average, which translates into higher accuracy at the cost of more inference compute.
The table below summarizes the headline specifications, which are identical for both models except where noted.
| Attribute | Phi-4-reasoning | Phi-4-reasoning-plus |
|---|---|---|
| Parameters | 14 billion | 14 billion |
| Architecture | Dense decoder only Transformer | Dense decoder only Transformer |
| Context length | 32,000 tokens | 32,000 tokens (tested to 64K) |
| Tokenizer | Same as Phi-4 | Same as Phi-4 |
| Training tokens | About 16B (8.3B unique) | Same SFT plus RL |
| RL algorithm | None | GRPO, about 90 steps |
| Reasoning trace teacher | OpenAI o3-mini | OpenAI o3-mini |
| Training hardware | 32 H100-80GB GPUs | 32 H100-80GB GPUs |
| Training duration | About 2.5 days | About 2.5 days plus RL |
| Training data cutoff | March 2025 | March 2025 |
| Release date | April 30, 2025 | April 30, 2025 |
| License | MIT | MIT |
A practical consequence of having two checkpoints is that users with different latency budgets can pick the one that fits. Phi-4-reasoning is faster because it terminates its thought block sooner; Phi-4-reasoning-plus answers harder problems more reliably at the cost of generating roughly half again as many tokens. Both models use the same recommended sampling configuration, with temperature 0.8, top_p 0.95, top_k 50, and do_sample=True. The Hugging Face model cards explicitly note that deterministic decoding tends to hurt reasoning quality.
Neither variant changes the underlying architecture. Both still run as dense decoder only Transformers without speculative decoding, mixture of experts routing, or any sparse attention pattern. That made integration into existing inference stacks straightforward; within a week of release, llama.cpp, vLLM, and Ollama all had working configurations, and Unsloth published GGUF quantized builds for local deployment on consumer GPUs.
The technical report breaks the post training stack into two phases that the team treats as distinct experiments. The first is supervised fine tuning on chain of thought traces. The second is a short reinforcement learning phase that the team applied only to Phi-4-reasoning-plus.
Microsoft built a corpus of roughly 1.4 million prompts paired with long form reasoning answers. The prompts were collected from public competition archives and from internal synthetic generation pipelines, then filtered against two criteria. First, the team kept prompts that the base Phi-4 model could not reliably solve, so the SFT data would push the model to learn new behaviors rather than reinforce existing ones. Second, prompts had to be answerable with a clear final solution that could be automatically verified, so the team could measure pass rates and prune low quality items.
Responses were not written by humans. The team generated them by sampling OpenAI o3-mini in high reasoning mode and collecting both the visible solution and the model's reasoning trace where available. That choice is what makes Phi-4-reasoning a distillation in the technical sense, and the report is candid about it. The team also noted that the prompt set leans heavily on math, with substantial slices for science (especially physics and chemistry at the competition level) and code. A smaller portion covers safety and alignment data, generated through Microsoft's standard red teaming pipeline.
The SFT stage ran on 32 NVIDIA H100-80GB GPUs for about 2.5 days. The model was trained on roughly 16 billion total tokens, with around 8.3 billion unique tokens, suggesting at most a couple of passes over the data. The chat template was modified during SFT to introduce the <think> and </think> markers that wrap the model's reasoning, and the system prompt was written to encourage the model to explore alternative approaches inside the thinking block before settling on a final answer.
The RL stage was the more selective half of the recipe. Microsoft started from a seed set of 72,401 verifiable math problems and ran Group Relative Policy Optimization, the same algorithm that DeepSeek used to train R1 and R1-Zero. In each RL iteration, the model produced 8 rollouts per problem on a subsample of 64 seeds, and the policy was updated based on the relative reward across those rollouts. The reward signal was a simple correctness check against ground truth answers, which is why the team restricted RL to math problems where answers can be automatically verified.
The surprising part of the RL stage is how short it was. The team reported that the checkpoint they shipped as Phi-4-reasoning-plus was trained for only about 90 steps and saw roughly 6,000 unique examples in total. Despite the small scale, that stage lifted AIME performance by more than 10 percentage points over the SFT only baseline. The team selected the final RL checkpoint by tracking AIME 2024 scores during training, since AIME 2025 problems were held out as an evaluation set.
The report attributes the efficiency of the RL stage to the fact that the SFT model was already producing high quality reasoning traces. The RL stage did not have to teach the model to think; it had to teach the model to favor traces that resolved correctly. The team also noted that response lengths grew during RL, with Phi-4-reasoning-plus producing about 50 percent more tokens per response than Phi-4-reasoning at convergence. That tradeoff matched what other reasoning groups had reported, and the team chose to ship the longer responding checkpoint rather than impose a length penalty.
The technical report and the Hugging Face model cards publish results across math, science, and coding benchmarks. The numbers below are taken directly from those sources.
| Benchmark | Phi-4-reasoning | Phi-4-reasoning-plus | DeepSeek R1 Distill 70B | DeepSeek R1 (671B) | OpenAI o1-mini |
|---|---|---|---|---|---|
| AIME 2024 (math) | 75.3 | 81.3 | 70.0 | 79.8 | 63.6 |
| AIME 2025 (math) | 62.9 | 78.0 | 51.5 | 70.4 | 54.8 |
| OmniMath (math) | 76.6 | 81.9 | 63.4 | 85.0 | not reported |
| GPQA Diamond (science) | 65.8 | 68.9 | 66.2 | 73.0 | 60.0 |
| HMMT February 2025 (math) | 53.1 | 69.0 | not reported | not reported | not reported |
| LiveCodeBench (code) | 53.8 | 53.1 | 57.5 | 62.8 | 53.8 |
A few patterns stand out. On AIME, both the 2024 and 2025 editions, the plus variant is the single strongest of the small open weight reasoners and is competitive with the full 671 billion parameter DeepSeek R1. The SFT only Phi-4-reasoning is already strong on AIME 2024 but loses ground on AIME 2025, where the harder problems and the lack of RL reward shaping leave it about 15 points below the plus checkpoint. The plus variant gives up some of that gap when problems shift toward open ended physics and chemistry, as on GPQA Diamond, where OpenAI o1 (76.7) and Gemini 2.5 Pro (84.0) still lead.
LiveCodeBench tells a slightly different story. The plus variant does not improve over the SFT only checkpoint on code, scoring 53.1 versus 53.8, and both trail DeepSeek R1 and the larger frontier closed models. The team attributes the flat code performance to the fact that the RL stage focused entirely on math, leaving code generation to whatever the SFT data taught the model. Future iterations could plausibly close that gap by extending GRPO to verifiable coding tasks, but the April 2025 release did not.
The report also publishes results on more general benchmarks. Phi-4-reasoning-plus scored 84.9 on IFEval, 76.5 on Arena Hard, and 81.6 on HumanEvalPlus. Those numbers indicate that the reasoning post training did not destroy the base Phi-4 model's instruction following or chat behavior; the reasoning models can still answer ordinary questions in a normal one shot style if the thinking block is suppressed.
On HMMT February 2025, a contest released after the model's training cutoff, Phi-4-reasoning-plus scored 69 percent, comfortably ahead of the SFT only variant at 53.1. That result was used in the report as evidence that the RL stage generalized rather than overfitting to a specific contest year, since HMMT problems were not in the training pool. The team also tracked AIME 2025 specifically because, like HMMT 2025, it was held out of training data, providing a clean evaluation set.
Microsoft released both Phi-4-reasoning and Phi-4-reasoning-plus under the MIT license. The license, which dates back to the original MIT software releases of the 1980s, is one of the most permissive in widespread use. It allows commercial use, modification, redistribution, sublicensing, and private use, with the only requirement being that the original copyright notice and license text accompany substantial portions of the software. There are no use case restrictions, no revenue thresholds, no acceptable use policies, and no obligation to share derivatives.
That licensing choice put Phi-4-reasoning on a notably more open footing than several of its peers. Meta's Llama 3 series is governed by the Llama 3 community license, which adds a 700 million monthly active user threshold for commercial deployment. Google's Gemma terms of use include an acceptable use policy. DeepSeek R1 was released under the MIT license, matching Phi-4-reasoning, but Alibaba's QwQ ships under the Qwen research license, which limits commercial use without a separate agreement.
The MIT license also covers the model weights, the tokenizer, the chat template, and the inference code published on Hugging Face. It does not cover the training data, which Microsoft has consistently declined to redistribute for the Phi family. Phi-4-reasoning is therefore an open weight model in the strict sense: the architecture, the weights, and the code are open, but the training corpus is not. That distinction matters for groups attempting to independently reproduce the model from scratch, since the SFT data and the 72,401 RL math problems are not publicly available even though the training recipe is described in the report.
One secondary licensing question concerns the o3-mini generated reasoning traces used for SFT. Microsoft is OpenAI's largest commercial partner and has access to o3-mini under whatever terms govern that relationship; the publicly available OpenAI usage policies restrict using model outputs to train competing models, but Microsoft's status as an OpenAI investor and Azure partner gives it a separate negotiated position. The technical report acknowledges the distillation explicitly and frames it as a deliberate choice rather than an attempt to obscure the data source.
The table below collects published specifications and headline benchmark scores for the leading reasoning oriented open weight and closed models as of mid 2025. Numbers come from each model's official report, model card, or product page.
| Model | Developer | Parameters | License | AIME 2024 | AIME 2025 | GPQA Diamond | Released |
|---|---|---|---|---|---|---|---|
| Phi-4-reasoning-plus | Microsoft | 14B | MIT | 81.3 | 78.0 | 68.9 | Apr 2025 |
| Phi-4-reasoning | Microsoft | 14B | MIT | 75.3 | 62.9 | 65.8 | Apr 2025 |
| DeepSeek R1 | DeepSeek | 671B (37B active) | MIT | 79.8 | 70.4 | 73.0 | Jan 2025 |
| DeepSeek R1 Distill Llama 70B | DeepSeek | 70B | MIT | 70.0 | 51.5 | 66.2 | Jan 2025 |
| OpenAI o1-mini | OpenAI | undisclosed | proprietary | 63.6 | 54.8 | 60.0 | Sep 2024 |
| OpenAI o1 | OpenAI | undisclosed | proprietary | 74.4 | 75.3 | 76.7 | Dec 2024 |
| QwQ 32B Preview | Alibaba | 32B | Apache 2.0 | 79.5 | 56.0 | 65.2 | Nov 2024 |
| Phi-4-mini-reasoning | Microsoft | 3.8B | MIT | 57.5 | not reported | 52.0 | Apr 2025 |
The most direct comparison is with DeepSeek R1 Distill Llama 70B, since it is also a distilled dense reasoner. Phi-4-reasoning-plus is roughly one fifth the parameter count and beats it on AIME 2024 (81.3 vs 70.0), AIME 2025 (78.0 vs 51.5), and slightly trails on GPQA Diamond (68.9 vs 66.2; the gap reverses depending on the eval prompt template). Against the full DeepSeek R1, Phi-4-reasoning-plus is competitive on AIME 2024 and AIME 2025 but trails on harder open ended science questions where the larger parameter count helps recall.
Against OpenAI's reasoning line, Phi-4-reasoning-plus beats o1-mini on every published math benchmark and approaches o1 on AIME 2024. It trails o1 on GPQA Diamond and on more open ended reasoning. The team frames that gap as expected given the parameter count difference and the closed model's larger training budget, and uses the result to argue that small models can close most of the gap with frontier reasoners on verifiable tasks like math, where reinforcement learning has a clear reward signal.
The Phi-4-reasoning models sit alongside Phi-4-mini-reasoning, the 3.8 billion parameter variant released at almost the same time. Phi-4-mini-reasoning uses a different recipe, distilling from DeepSeek R1 rather than o3-mini, and targets on device deployment rather than competition level scores. Together the two models bracket Microsoft's reasoning offering, with Phi-4-reasoning aimed at competition style problems on a single workstation GPU and Phi-4-mini-reasoning aimed at agents and assistants on consumer laptops.
Reception inside the open weight research community was broadly positive, with most commentary focused on three points. The first was the gap between parameter count and benchmark score. Several research blogs and Hugging Face discussions noted that a 14 billion parameter dense model approaching DeepSeek R1 on AIME was a striking data point, and it reinforced the Phi line's central claim that data curation can substitute for scale. The second was the candor of the technical report. Microsoft published not only the recipe but also negative results, including the observation that the RL stage did not improve code performance, and that the team's instinct to run a longer RL stage was not supported by the validation curves.
The third was the o3-mini distillation. Some commenters described the work as an OpenAI distillation in disguise, since the SFT data was generated by an OpenAI model. The technical report addresses that directly and frames the o3-mini choice as a deliberate one. Microsoft's position is that distillation from a strong teacher is a reasonable way to bootstrap a small open weight reasoner, and that the resulting MIT licensed weights are still useful to the open weight community even if the underlying traces were produced by a proprietary system. That argument carried more weight after DeepSeek R1, where R1's own training pipeline included extensive distillation steps.
Critics raised familiar points about Phi family releases. Several reviewers noted that Microsoft has never published the full training data composition, which makes it hard to independently audit or reproduce the model. Others observed that the strong scores on competition math benchmarks do not always translate to open ended chat or agent behavior, where larger models still have an advantage in breadth of knowledge. A few reviewers also flagged the 32 thousand token context window as a limitation for long form reasoning tasks, since some frontier reasoners support 128 thousand tokens or more.
On the commercial side, Phi-4-reasoning anchored Microsoft's mid 2025 reasoning offering on Azure AI Foundry. Both variants were added to the Foundry model catalog the same week, alongside the Phi-4-mini-reasoning checkpoint, and the team published deployment guides for Azure Machine Learning, ONNX Runtime, and Foundry Local. NVIDIA promoted the models through its developer materials as a target workload for H100 and L40S deployments, and the Ollama community library shipped quantized builds within a few days of the Hugging Face release.
A more practical critique came from the local model community. Phi-4-reasoning-plus generates noticeably longer responses than the SFT only checkpoint, and several reviewers reported that median response times on consumer hardware nearly doubled when switching from Phi-4-reasoning to Phi-4-reasoning-plus. For users who needed the highest reasoning accuracy, that tradeoff was acceptable; for users running interactive assistants or agent loops, the SFT only variant remained the more popular choice. By late 2025, the open weight community had largely converged on the pattern of using Phi-4-reasoning for general agent work and reserving Phi-4-reasoning-plus for batch tasks that benefited from the longer thinking budget.