Phi-4 Reasoning

AI Models Large Language Models Open Source AI Reasoning Models

19 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 3,764 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Phi-4-reasoning is a 14 billion parameter open weight reasoning model released by Microsoft Research on April 30, 2025. It was launched together with Phi-4-reasoning-plus, a sibling checkpoint that adds a short reinforcement learning phase on top of the same supervised fine tuning recipe.^[1] Both models are post trained from the same base, the dense Phi-4 released in December 2024, and inherit its decoder only Transformer architecture, 32 thousand token context window, and English first training distribution.^[9] The reasoning variants were distributed through Hugging Face under the MIT license, with weights, tokenizer, and inference code published the same day as the technical report.^[3]^[4]

What makes the pair notable is the gap between their size and their benchmark numbers. Phi-4-reasoning-plus reaches 78 percent on AIME 2025 and 81.3 percent on AIME 2024, scores that compete directly with OpenAI o1 and approach the full 671 billion parameter DeepSeek R1 on the same problems.^[1] Microsoft positioned the release as evidence that a careful supervised fine tuning curriculum on roughly 1.4 million carefully curated prompts, paired with chain of thought traces distilled from OpenAI's o3-mini, can lift a small dense model to frontier reasoning behavior without the massive scale that defined earlier reasoning systems.^[1] Phi-4-reasoning sits between the larger frontier reasoners and the much smaller Phi-4-mini-reasoning variant released a few days later, which targets the same recipe at a 3.8 billion parameter footprint.^[8]

Background

The Phi family began at Microsoft Research in mid 2023 with the Textbooks Are All You Need paper, which argued that a 1.3 billion parameter Python coding model trained on a small but high quality synthetic dataset could rival models several times its size. That paper introduced Phi-1, and the team followed it with Phi-1.5, Phi-2, and the Phi-3 generation through 2024. Each release kept the same recipe: dense Transformers, heavy synthetic data, and a focus on what Microsoft calls textbook reasoning. By the end of 2024, the Phi line had become one of the most cited examples of how data quality can substitute for parameter count in language models.

Phi-4 itself shipped in December 2024 as a 14 billion parameter dense model that was built almost entirely on synthetic and filtered web data. It was strong on math word problems and competition style questions but, like every base model in the family, it answered in a single pass without an explicit chain of thought.^[9] Through 2024 and into 2025, the rest of the field had moved in a different direction. OpenAI's o1 in September 2024, DeepSeek R1 in January 2025,^[10] and a small wave of open weight reasoners such as QwQ from Alibaba in late 2024 all demonstrated that test time compute, in the form of long chain of thought traces and self verification, could lift math and code performance well beyond what fixed pretraining could deliver.

Microsoft entered that wave on April 30, 2025 with two simultaneous releases. Phi-4-reasoning and Phi-4-reasoning-plus were positioned as small dense reasoning models that could match or exceed much larger systems on math, science, and code benchmarks while remaining cheap to run.^[1] The technical report, posted to arXiv as 2504.21318 the same day, was unusually candid about which design choices mattered: data curation, boundary case selection, and a deliberately conservative reinforcement learning stage that only ran for about 90 steps.^[2] The same week, Microsoft published a companion paper on Phi-4-mini-reasoning, a 3.8 billion parameter variant of the same recipe targeting on device deployment.^[8]

The Phi-4 reasoning release also fit into a broader argument Microsoft Research had been making since Phi-1. Frontier behavior, the team kept claiming, was not solely a function of parameter count. Phi-4-reasoning was framed as the latest data point in that argument: a 14 billion parameter model that, after two months of post training, scored within striking distance of the largest closed systems on a battery of competition problems.^[5]

Phi-4-reasoning and Phi-4-reasoning-plus

The April 30 release packaged two models that share the same base, the same tokenizer, the same 32 thousand token context window, and almost all of the same supervised fine tuning data. They differ only in a final reinforcement learning stage.^[1]

Phi-4-reasoning is the supervised fine tuned checkpoint. It takes the base Phi-4 weights and trains them on roughly 16 billion tokens of synthetic chain of thought traces, of which about 8.3 billion are unique, generated by OpenAI's o3-mini in high reasoning mode.^[2] The model learns to produce a two section response: a <think> block that contains its reasoning, followed by a final solution section that the user is expected to consume. The chat template enforces this structure, and the system prompt that ships with the model gives the model explicit instructions on how to use the thinking block.^[3]

Phi-4-reasoning-plus is the same checkpoint after an additional reinforcement learning stage using Group Relative Policy Optimization,^[2] the same algorithm DeepSeek used for R1.^[10] The RL stage focused exclusively on mathematical reasoning, drawing from a seed set of 72,401 problems and running for only about 90 optimization steps over roughly 6,000 examples, with 8 sampled trajectories per example. Microsoft selected the final RL checkpoint based on the best observed AIME 2024 score during training. The plus variant generates noticeably longer responses than the SFT only model, roughly 50 percent more tokens on average, which translates into higher accuracy at the cost of more inference compute.^[2]

The table below summarizes the headline specifications, which are identical for both models except where noted.

Attribute	Phi-4-reasoning	Phi-4-reasoning-plus
Parameters	14 billion	14 billion
Architecture	Dense decoder only Transformer	Dense decoder only Transformer
Context length	32,000 tokens	32,000 tokens (tested to 64K)
Tokenizer	Same as Phi-4	Same as Phi-4
Training tokens	About 16B (8.3B unique)	Same SFT plus RL
RL algorithm	None	GRPO, about 90 steps
Reasoning trace teacher	OpenAI o3-mini	OpenAI o3-mini
Training hardware	32 H100-80GB GPUs	32 H100-80GB GPUs
Training duration	About 2.5 days	About 2.5 days plus RL
Training data cutoff	March 2025	March 2025
Release date	April 30, 2025	April 30, 2025
License	MIT	MIT

A practical consequence of having two checkpoints is that users with different latency budgets can pick the one that fits. Phi-4-reasoning is faster because it terminates its thought block sooner; Phi-4-reasoning-plus answers harder problems more reliably at the cost of generating roughly half again as many tokens. Both models use the same recommended sampling configuration, with temperature 0.8, top_p 0.95, top_k 50, and do_sample=True. The Hugging Face model cards explicitly note that deterministic decoding tends to hurt reasoning quality.^[3]^[4]

Neither variant changes the underlying architecture. Both still run as dense decoder only Transformers without speculative decoding, mixture of experts routing, or any sparse attention pattern. That made integration into existing inference stacks straightforward; within a week of release, llama.cpp, vLLM, and Ollama all had working configurations, and Unsloth published GGUF quantized builds for local deployment on consumer GPUs.

Training methodology

The technical report breaks the post training stack into two phases that the team treats as distinct experiments. The first is supervised fine tuning on chain of thought traces. The second is a short reinforcement learning phase that the team applied only to Phi-4-reasoning-plus.

Supervised fine tuning

Microsoft built a corpus of roughly 1.4 million prompts paired with long form reasoning answers. The prompts were collected from public competition archives and from internal synthetic generation pipelines, then filtered against two criteria. First, the team kept prompts that the base Phi-4 model could not reliably solve, so the SFT data would push the model to learn new behaviors rather than reinforce existing ones. Second, prompts had to be answerable with a clear final solution that could be automatically verified, so the team could measure pass rates and prune low quality items.^[2]

Responses were not written by humans. The team generated them by sampling OpenAI o3-mini in high reasoning mode and collecting both the visible solution and the model's reasoning trace where available.^[2] That choice is what makes Phi-4-reasoning a distillation in the technical sense, and the report is candid about it. The team also noted that the prompt set leans heavily on math, with substantial slices for science (especially physics and chemistry at the competition level) and code. A smaller portion covers safety and alignment data, generated through Microsoft's standard red teaming pipeline.

The SFT stage ran on 32 NVIDIA H100-80GB GPUs for about 2.5 days. The model was trained on roughly 16 billion total tokens, with around 8.3 billion unique tokens, suggesting at most a couple of passes over the data.^[2] The chat template was modified during SFT to introduce the <think> and </think> markers that wrap the model's reasoning, and the system prompt was written to encourage the model to explore alternative approaches inside the thinking block before settling on a final answer.

Reinforcement learning on math

The RL stage was the more selective half of the recipe. Microsoft started from a seed set of 72,401 verifiable math problems and ran Group Relative Policy Optimization,^[2] the same algorithm that DeepSeek used to train R1 and R1-Zero.^[10] In each RL iteration, the model produced 8 rollouts per problem on a subsample of 64 seeds, and the policy was updated based on the relative reward across those rollouts. The reward signal was a simple correctness check against ground truth answers, which is why the team restricted RL to math problems where answers can be automatically verified.

The surprising part of the RL stage is how short it was. The team reported that the checkpoint they shipped as Phi-4-reasoning-plus was trained for only about 90 steps and saw roughly 6,000 unique examples in total. Despite the small scale, that stage lifted AIME performance by more than 10 percentage points over the SFT only baseline. The team selected the final RL checkpoint by tracking AIME 2024 scores during training, since AIME 2025 problems were held out as an evaluation set.^[2]

The report attributes the efficiency of the RL stage to the fact that the SFT model was already producing high quality reasoning traces. The RL stage did not have to teach the model to think; it had to teach the model to favor traces that resolved correctly. The team also noted that response lengths grew during RL, with Phi-4-reasoning-plus producing about 50 percent more tokens per response than Phi-4-reasoning at convergence.^[2] That tradeoff matched what other reasoning groups had reported, and the team chose to ship the longer responding checkpoint rather than impose a length penalty.

Benchmark performance

The technical report and the Hugging Face model cards publish results across math, science, and coding benchmarks. The numbers below are taken directly from those sources.^[2]^[3]^[4]

Benchmark	Phi-4-reasoning	Phi-4-reasoning-plus	DeepSeek R1 Distill 70B	DeepSeek R1 (671B)	OpenAI o1-mini
AIME 2024 (math)	75.3	81.3	70.0	79.8	63.6
AIME 2025 (math)	62.9	78.0	51.5	70.4	54.8
OmniMath (math)	76.6	81.9	63.4	85.0	not reported
GPQA Diamond (science)	65.8	68.9	66.2	73.0	60.0
HMMT February 2025 (math)	53.1	69.0	not reported	not reported	not reported
LiveCodeBench (code)	53.8	53.1	57.5	62.8	53.8

A few patterns stand out. On AIME, both the 2024 and 2025 editions, the plus variant is the single strongest of the small open weight reasoners and is competitive with the full 671 billion parameter DeepSeek R1. The SFT only Phi-4-reasoning is already strong on AIME 2024 but loses ground on AIME 2025, where the harder problems and the lack of RL reward shaping leave it about 15 points below the plus checkpoint. The plus variant gives up some of that gap when problems shift toward open ended physics and chemistry, as on GPQA Diamond, where OpenAI o1 (76.7) and Gemini 2.5 Pro (84.0) still lead.

LiveCodeBench tells a slightly different story. The plus variant does not improve over the SFT only checkpoint on code, scoring 53.1 versus 53.8, and both trail DeepSeek R1 and the larger frontier closed models. The team attributes the flat code performance to the fact that the RL stage focused entirely on math, leaving code generation to whatever the SFT data taught the model.^[2] Future iterations could plausibly close that gap by extending GRPO to verifiable coding tasks, but the April 2025 release did not.

The report also publishes results on more general benchmarks. Phi-4-reasoning-plus scored 84.9 on IFEval, 76.5 on Arena Hard, and 81.6 on HumanEvalPlus.^[2] Those numbers indicate that the reasoning post training did not destroy the base Phi-4 model's instruction following or chat behavior; the reasoning models can still answer ordinary questions in a normal one shot style if the thinking block is suppressed.

On HMMT February 2025, a contest released after the model's training cutoff, Phi-4-reasoning-plus scored 69 percent, comfortably ahead of the SFT only variant at 53.1. That result was used in the report as evidence that the RL stage generalized rather than overfitting to a specific contest year, since HMMT problems were not in the training pool.^[2] The team also tracked AIME 2025 specifically because, like HMMT 2025, it was held out of training data, providing a clean evaluation set.

Licensing

Microsoft released both Phi-4-reasoning and Phi-4-reasoning-plus under the MIT license.^[3]^[4] The license, which dates back to the original MIT software releases of the 1980s, is one of the most permissive in widespread use. It allows commercial use, modification, redistribution, sublicensing, and private use, with the only requirement being that the original copyright notice and license text accompany substantial portions of the software. There are no use case restrictions, no revenue thresholds, no acceptable use policies, and no obligation to share derivatives.

That licensing choice put Phi-4-reasoning on a notably more open footing than several of its peers. Meta's Llama 3 series is governed by the Llama 3 community license, which adds a 700 million monthly active user threshold for commercial deployment. Google's Gemma terms of use include an acceptable use policy. DeepSeek R1 was released under the MIT license, matching Phi-4-reasoning,^[10] but Alibaba's QwQ ships under the Qwen research license, which limits commercial use without a separate agreement.

The MIT license also covers the model weights, the tokenizer, the chat template, and the inference code published on Hugging Face.^[3] It does not cover the training data, which Microsoft has consistently declined to redistribute for the Phi family. Phi-4-reasoning is therefore an open weight model in the strict sense: the architecture, the weights, and the code are open, but the training corpus is not. That distinction matters for groups attempting to independently reproduce the model from scratch, since the SFT data and the 72,401 RL math problems are not publicly available even though the training recipe is described in the report.^[2]

One secondary licensing question concerns the o3-mini generated reasoning traces used for SFT. Microsoft is OpenAI's largest commercial partner and has access to o3-mini under whatever terms govern that relationship; the publicly available OpenAI usage policies restrict using model outputs to train competing models, but Microsoft's status as an OpenAI investor and Azure partner gives it a separate negotiated position. The technical report acknowledges the distillation explicitly and frames it as a deliberate choice rather than an attempt to obscure the data source.^[2]

Comparison to peers

The table below collects published specifications and headline benchmark scores for the leading reasoning oriented open weight and closed models as of mid 2025. Numbers come from each model's official report, model card, or product page.

Model	Developer	Parameters	License	AIME 2024	AIME 2025	GPQA Diamond	Released
Phi-4-reasoning-plus^[1]	Microsoft	14B	MIT	81.3	78.0	68.9	Apr 2025
Phi-4-reasoning^[1]	Microsoft	14B	MIT	75.3	62.9	65.8	Apr 2025
DeepSeek R1^[10]	DeepSeek	671B (37B active)	MIT	79.8	70.4	73.0	Jan 2025
DeepSeek R1 Distill Llama 70B^[10]	DeepSeek	70B	MIT	70.0	51.5	66.2	Jan 2025
OpenAI o1-mini	OpenAI	undisclosed	proprietary	63.6	54.8	60.0	Sep 2024
OpenAI o1	OpenAI	undisclosed	proprietary	74.4	75.3	76.7	Dec 2024
QwQ 32B Preview	Alibaba	32B	Apache 2.0	79.5	56.0	65.2	Nov 2024
Phi-4-mini-reasoning^[8]	Microsoft	3.8B	MIT	57.5	not reported	52.0	Apr 2025

The most direct comparison is with DeepSeek R1 Distill Llama 70B, since it is also a distilled dense reasoner. Phi-4-reasoning-plus is roughly one fifth the parameter count and beats it on AIME 2024 (81.3 vs 70.0), AIME 2025 (78.0 vs 51.5), and slightly trails on GPQA Diamond (68.9 vs 66.2; the gap reverses depending on the eval prompt template).^[1] Against the full DeepSeek R1, Phi-4-reasoning-plus is competitive on AIME 2024 and AIME 2025 but trails on harder open ended science questions where the larger parameter count helps recall.

Against OpenAI's reasoning line, Phi-4-reasoning-plus beats o1-mini on every published math benchmark and approaches o1 on AIME 2024. It trails o1 on GPQA Diamond and on more open ended reasoning. The team frames that gap as expected given the parameter count difference and the closed model's larger training budget, and uses the result to argue that small models can close most of the gap with frontier reasoners on verifiable tasks like math, where reinforcement learning has a clear reward signal.

The Phi-4-reasoning models sit alongside Phi-4-mini-reasoning, the 3.8 billion parameter variant released at almost the same time. Phi-4-mini-reasoning uses a different recipe, distilling from DeepSeek R1 rather than o3-mini, and targets on device deployment rather than competition level scores.^[8] Together the two models bracket Microsoft's reasoning offering, with Phi-4-reasoning aimed at competition style problems on a single workstation GPU and Phi-4-mini-reasoning aimed at agents and assistants on consumer laptops.

Reception

Reception inside the open weight research community was broadly positive, with most commentary focused on three points. The first was the gap between parameter count and benchmark score. Several research blogs and Hugging Face discussions noted that a 14 billion parameter dense model approaching DeepSeek R1 on AIME was a striking data point, and it reinforced the Phi line's central claim that data curation can substitute for scale. The second was the candor of the technical report. Microsoft published not only the recipe but also negative results, including the observation that the RL stage did not improve code performance, and that the team's instinct to run a longer RL stage was not supported by the validation curves.

The third was the o3-mini distillation. Some commenters described the work as an OpenAI distillation in disguise, since the SFT data was generated by an OpenAI model. The technical report addresses that directly and frames the o3-mini choice as a deliberate one. Microsoft's position is that distillation from a strong teacher is a reasonable way to bootstrap a small open weight reasoner, and that the resulting MIT licensed weights are still useful to the open weight community even if the underlying traces were produced by a proprietary system. That argument carried more weight after DeepSeek R1, where R1's own training pipeline included extensive distillation steps.

Critics raised familiar points about Phi family releases. Several reviewers noted that Microsoft has never published the full training data composition, which makes it hard to independently audit or reproduce the model. Others observed that the strong scores on competition math benchmarks do not always translate to open ended chat or agent behavior, where larger models still have an advantage in breadth of knowledge. A few reviewers also flagged the 32 thousand token context window as a limitation for long form reasoning tasks, since some frontier reasoners support 128 thousand tokens or more.

On the commercial side, Phi-4-reasoning anchored Microsoft's mid 2025 reasoning offering on Azure AI Foundry. Both variants were added to the Foundry model catalog the same week, alongside the Phi-4-mini-reasoning checkpoint, and the team published deployment guides for Azure Machine Learning, ONNX Runtime, and Foundry Local.^[6] NVIDIA promoted the models through its developer materials as a target workload for H100 and L40S deployments, and the Ollama community library shipped quantized builds within a few days of the Hugging Face release.

A more practical critique came from the local model community. Phi-4-reasoning-plus generates noticeably longer responses than the SFT only checkpoint, and several reviewers reported that median response times on consumer hardware nearly doubled when switching from Phi-4-reasoning to Phi-4-reasoning-plus. For users who needed the highest reasoning accuracy, that tradeoff was acceptable; for users running interactive assistants or agent loops, the SFT only variant remained the more popular choice. By late 2025, the open weight community had largely converged on the pattern of using Phi-4-reasoning for general agent work and reserving Phi-4-reasoning-plus for batch tasks that benefited from the longer thinking budget.

References

Abdin, Marah, Sahaj Agarwal, Ahmed Awadallah, et al. "Phi-4-reasoning Technical Report." Microsoft Research, April 30, 2025. https://www.microsoft.com/en-us/research/publication/phi-4-reasoning-technical-report/ ↩
Abdin, Marah, et al. "Phi-4-reasoning Technical Report." arXiv preprint 2504.21318, April 30, 2025. https://arxiv.org/abs/2504.21318 ↩
Microsoft. "microsoft/Phi-4-reasoning." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-reasoning ↩
Microsoft. "microsoft/Phi-4-reasoning-plus." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-reasoning-plus ↩
Microsoft Research. "Phi-Reasoning: Once again redefining what is possible with small and efficient AI." Microsoft Research Articles, April 30, 2025. https://www.microsoft.com/en-us/research/articles/phi-reasoning-once-again-redefining-what-is-possible-with-small-and-efficient-ai/ ↩
Microsoft Azure. "Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning." Azure AI Foundry Blog. https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/ ↩
Microsoft. "microsoft/Phi-4-mini-reasoning." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-mini-reasoning
Abdin, Marah, et al. "Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math." arXiv preprint 2504.21233, April 30, 2025. https://arxiv.org/abs/2504.21233 ↩
Microsoft. "Phi-4 Technical Report." Microsoft Research, December 2024. https://www.microsoft.com/en-us/research/publication/phi-4-technical-report/ ↩
DeepSeek AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv preprint 2501.12948, January 2025. https://arxiv.org/abs/2501.12948 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Microsoft Research Phi-4-mini-flash-reasoning

Background

Phi-4-reasoning and Phi-4-reasoning-plus

Training methodology

Supervised fine tuning

Reinforcement learning on math

Benchmark performance

Licensing

Comparison to peers

Reception

See also

References

Improve this article

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

Phi-4-mini-flash-reasoning

OLMo 3

ZAYA1-8B

QwQ

What links here

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

Phi-4-mini-flash-reasoning

OLMo 3

ZAYA1-8B

QwQ

What links here