Phi-4-mini-flash-reasoning

AI Models Large Language Models Open Source AI Reasoning Models Small Language Models

14 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 2,865 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Phi-4-mini-flash-reasoning is a 3.8 billion parameter open weight reasoning model released by Microsoft in July 2025.^[1] It is a latency optimized sibling of Phi-4-mini and the earlier Phi-4-mini-reasoning checkpoint, built on a new hybrid architecture called SambaY rather than the standard dense Transformer backbone used elsewhere in the Phi-4 family.^[3] The model targets math heavy reasoning workloads where decoding speed and memory footprint matter as much as raw accuracy, and Microsoft positions it for edge deployment, on device tutoring, and other resource constrained settings.^[1]

The model was announced on the Azure blog under the title Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning, with weights distributed under the MIT license on Hugging Face, Azure AI Foundry, and the NVIDIA API Catalog.^[1] The underlying SambaY architecture, described in a paper titled Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation and posted to arXiv on July 9, 2025, combines a Samba based self decoder using Mamba (State Space Model) layers and Sliding Window Attention with a cross decoder that interleaves cross attention with a new component called the Gated Memory Unit (GMU).^[3] Microsoft reports up to 10 times higher decoding throughput than Phi-4-mini-reasoning on 2K token prompts with 32K token generations, and a 2 to 3 times average reduction in latency, while matching or beating the older model on Math500, AIME 2024, AIME 2025, and GPQA Diamond.^[1]

Background

The Phi-4 reasoning family grew out of a year of work on small dense Transformers at Microsoft Research. The flagship Phi-4 model arrived in December 2024 as a 14 billion parameter dense model that competed with much larger systems on math benchmarks, and Microsoft followed it on February 26, 2025 with a refresh that added the 3.8 billion parameter Phi-4-mini and the multimodal Phi-4-multimodal.^[9] Both of those models were trained on roughly 5 trillion tokens with heavy emphasis on synthetic reasoning data, but neither was a reasoning specialist; they handled chain of thought tasks reasonably well, yet were tuned mainly for general instruction following.

In April 2025 Microsoft released Phi-4-mini-reasoning, a math focused fine tune of Phi-4-mini that used about 150 billion tokens of synthetic mathematical content distilled from DeepSeek R1.^[10] The team also published a heavier sibling, Phi-4 Reasoning, built on the 14 billion parameter Phi-4 base, in late April 2025. Phi-4-mini-reasoning kept the same dense Transformer backbone as Phi-4-mini and reached strong scores on AIME 2024, MATH-500, and GPQA Diamond despite its small parameter budget.^[10] Its weakness was the same weakness every dense Transformer has on long reasoning traces: quadratic attention cost during decoding, which makes 32K token generations expensive on edge hardware.

Phi-4-mini-flash-reasoning was Microsoft's answer to that bottleneck. Rather than push the parameter count higher, the team kept the 3.8 billion parameter budget but swapped the dense Transformer for a hybrid State Space Model plus attention design. The Azure blog framed the release as the first Phi model to ship with a non Transformer core, and as a step toward reasoning that fits inside a phone or a laptop without the latency penalty of full attention.^[1]

SambaY architecture

SambaY is a decoder hybrid decoder architecture introduced in the same paper as Phi-4-mini-flash-reasoning.^[3] It is not a pure State Space Model and not a pure Transformer; it is a two stage stack where the first stage handles most of the sequence processing with linear time components and the second stage uses a small number of attention layers to handle the parts that benefit from global context.

The self decoder, which is the first stage, is a Samba block that combines Mamba (State Space Model) layers with Sliding Window Attention (SWA) and a single layer of full attention. Mamba layers run in linear time over the input sequence and carry information forward through a learned state rather than a key value cache. SWA gives the model the ability to attend over a local window without paying for full attention across the whole context. The single full attention layer adds a global look up so the model can still resolve dependencies that span the entire 64K token context.^[3]

The cross decoder, which is the second stage, is where the new Gated Memory Unit (GMU) lives. A standard cross decoder would interleave several layers of cross attention against the self decoder's outputs, which is expensive in both compute and memory. SambaY replaces roughly half of those cross attention layers with GMUs. A GMU is an element wise gating function that reuses the hidden state from the final SSM layer of the self decoder, so it does not have to recompute attention against the full prefix. The paper describes the GMU as a way to share memory readout states from the Samba based self decoder, which keeps representation sharing across layers without redundant computation.^[3]

The architecture also draws on differential attention from related Microsoft research. Together these choices give the model linear prefill time complexity in the sequence length and lower decoder I/O than a standard Transformer of the same parameter count. Crucially, SambaY does not need an explicit positional encoding scheme such as RoPE for its long context behavior, because the State Space Model layers carry order information through their state.^[3]

The table below summarizes the headline architectural choices, drawn from the Hugging Face model card and the arXiv paper.^[2]^[3]

Attribute	Value
Parameters	3.8 billion
Architecture	SambaY (decoder hybrid decoder)
Self decoder	Mamba SSM, Sliding Window Attention, one full attention layer
Cross decoder	Cross attention layers interleaved with Gated Memory Units
Attention features	Grouped query attention, single global attention layer, differential attention
Context length	64,000 tokens
Vocabulary	200,064 tokens
Embeddings	Shared input and output (tied)
Positional encoding	None (state carries order)
Precision	bfloat16
KV cache	Shared key value cache

The 200,064 entry vocabulary and tied embeddings are inherited from Phi-4-mini. The 64K context window is shorter than Phi-4-mini's 128K window, which reflects the fact that the flash variant is tuned for long generations rather than long inputs. In practice the model is intended for prompts in the low thousands of tokens with reasoning traces that can stretch out to 32K tokens or more.^[2]

Training

The pretraining recipe carries over the synthetic data approach from earlier Phi releases but uses a different compute profile because of the hybrid architecture. According to the Hugging Face model card, Phi-4-mini-flash-reasoning was pretrained on 5 trillion tokens using 1,024 NVIDIA A100 80GB GPUs over 14 days. The reasoning post training stage ran on 128 NVIDIA H100 80GB GPUs for 2 days using 150 billion tokens of reasoning data.^[2]

The reasoning corpus is dominated by synthetic mathematical content, including over 1 million math problems spanning middle school through Ph.D. level, and roughly 30 billion tokens of verified mathematical content. Microsoft used DeepSeek R1 to generate problem and solution traces for that corpus.^[2] The data composition is similar to the corpus used for Phi-4-mini-reasoning, which makes the architecture itself the main controlled difference between the two models.

The post training pipeline is multi stage supervised fine tuning followed by Direct Preference Optimization. The model card notes that Phi-4-mini-flash-reasoning does not use reinforcement learning, which separates it from Phi-4-mini-reasoning. The training cutoff for source data is May 2025.^[2]

Capabilities

The model is a specialist rather than a generalist. The Hugging Face card states explicitly that it is designed for advanced math reasoning and is not intended for general purpose language understanding tasks.^[2] The table below summarizes its intended uses and supported workflows.

Capability	Notes
Multi step math reasoning	Primary use; trained on synthetic problem solution traces
Formal proof generation	Supported via chain of thought style outputs
Symbolic computation	Listed in model card as an intended use
Long generation under tight memory budgets	SambaY design keeps decoder I/O low
Edge and mobile deployment	Targeted use case in Azure announcement
Educational and tutoring applications	Cited as a primary scenario
General chat and world knowledge	Not the intended use; limited capacity at 3.8B parameters
Non English reasoning	Limited; English is the primary training language
Code generation	Mostly Python; other languages require verification

Long conversational use can produce repetition or drift in extended sessions, and the model is not appropriate for high risk legal or medical advice without additional safeguards. Microsoft also notes elevated defect rates on election related queries, in line with its broader responsible AI guidance.^[2]

Inference speed claims

The headline efficiency claim from both the Azure blog and the Hugging Face card is up to 10 times higher decoding throughput than Phi-4-mini-reasoning on a workload with 2K token prompts and 32K token generations. The same comparison reports a 2 to 3 times average reduction in latency.^[1]^[2]

The driver behind those numbers is the SambaY architecture itself. Because most of the sequence processing in the self decoder runs on State Space Model layers, prefill cost grows linearly with prompt length instead of quadratically. The Gated Memory Units in the cross decoder remove the need to recompute attention against the full prefix for half of the layers that would otherwise carry it. The full attention layer is restricted to a single global slot, so its quadratic cost is paid only once per forward pass.

Microsoft tested the model on NVIDIA A100 GPUs with vLLM and on H100 GPUs through Azure AI Foundry.^[2] Flash Attention 2 support is a hardware requirement, which limits deployment to recent NVIDIA accelerators. SGLang offers streaming support, and NVIDIA NIM packages the model as a commercial inference microservice.^[4]

Benchmark performance

The Hugging Face model card publishes a head to head comparison against Phi-4-mini-reasoning and several DeepSeek R1 distilled models. The numbers below are taken directly from that card. AIME accuracy is averaged over 64 samples and Math500 and GPQA Diamond are averaged over 8 samples, all pass at 1.^[2]

Model	Parameters	AIME 2024	AIME 2025	Math500	GPQA Diamond
Phi-4-mini-flash-reasoning	3.8B	52.29	33.59	92.45	45.08
Phi-4-mini-reasoning	3.8B	48.13	31.77	91.20	44.51
DeepSeek-R1-Distill-Qwen-7B	7B	53.70	35.94	93.03	47.85
DeepSeek-R1-Distill-Llama-8B	8B	43.96	27.34	87.48	45.83

A few patterns are worth calling out. Phi-4-mini-flash-reasoning beats its dense Transformer sibling Phi-4-mini-reasoning on every benchmark in the table, with the largest gap on AIME 2024 (52.29 versus 48.13). It trails DeepSeek-R1-Distill-Qwen-7B on three of the four benchmarks, which is expected given that the Qwen distill has almost twice the parameter count, but the gap is narrow on Math500 (92.45 versus 93.03). Against the larger DeepSeek-R1-Distill-Llama-8B, Phi-4-mini-flash-reasoning leads on AIME 2024, AIME 2025, and Math500 while losing slightly on GPQA Diamond.^[2]

The paper frames these numbers as evidence that the SambaY architecture does not sacrifice reasoning quality to gain its efficiency. In the paper's words, the flash variant achieves significantly better performance than Phi-4-mini-reasoning on reasoning tasks while delivering up to 10 times higher decoding throughput.^[3]

Licensing

Microsoft released Phi-4-mini-flash-reasoning under the MIT License.^[1]^[2] The MIT license is one of the most permissive licenses in widespread use. It allows commercial use, modification, redistribution, private use, and sublicensing, with the only requirement being that the original copyright and license notice be included in any substantial portion of the software.

This matches the licensing of every other model in the Phi-4 family released in 2025, including Phi-4, Phi-4-mini, Phi-4-multimodal, Phi-4-mini-reasoning, and Phi-4 Reasoning. The licensing posture is more permissive than Meta's Llama 3.x community license and Google's Gemma terms of use, both of which include use case restrictions. The MIT license has helped the Phi family spread quickly through open weight tooling such as vLLM, Ollama, llama.cpp, and downstream quantizations.

As with the rest of the Phi family, the weights are open but the training data is not redistributed. Phi-4-mini-flash-reasoning is therefore an open weight model rather than a fully open source model in the strict sense; the architecture, code, and weights are open, while the training corpus is not.

The table below collects published specifications and headline benchmark scores for the four reasoning relevant models in the Phi-4 family. Math500 and AIME numbers are taken from each model's Hugging Face model card or technical report.

Model	Parameters	Backbone	Context	Reasoning training	Math500	AIME 2024	Released
Phi-4	14B	Dense Transformer	16K	None (base)	n/a	n/a	Dec 2024
Phi-4-mini	3.8B	Dense Transformer	128K	None (base)	n/a	n/a	Feb 2025
Phi-4-mini-reasoning	3.8B	Dense Transformer	128K	SFT, DPO, RL on synthetic math	91.20	48.13	Apr 2025
Phi-4 Reasoning	14B	Dense Transformer	32K	SFT, DPO, RL on synthetic math	n/a	n/a	Apr 2025
Phi-4-mini-flash-reasoning	3.8B	SambaY hybrid SSM and attention	64K	SFT and DPO on synthetic math	92.45	52.29	Jul 2025

Two differences between Phi-4-mini-reasoning and Phi-4-mini-flash-reasoning are worth emphasizing. The first is the backbone: the reasoning variant uses the same dense Transformer as Phi-4-mini, while the flash variant uses SambaY. The second is the training pipeline: Phi-4-mini-reasoning includes a reinforcement learning stage on top of supervised fine tuning and Direct Preference Optimization, while Phi-4-mini-flash-reasoning skips RL entirely and relies on multi stage SFT and DPO.^[2] The shorter context window of the flash variant (64K versus 128K) reflects its emphasis on long generation rather than long input.

Against the larger 14 billion parameter Phi-4 Reasoning, Phi-4-mini-flash-reasoning trades raw accuracy on hard math benchmarks for a much smaller memory footprint and much faster decoding. The two models target different deployment scenarios: Phi-4 Reasoning is intended for server side inference where capacity matters most, and Phi-4-mini-flash-reasoning is intended for edge and mobile workloads where latency and memory matter more.

Reception

Reception inside the open weight community focused on the architectural novelty more than the benchmark numbers. Coverage on MarkTechPost, several Medium technical deep dives, and the Microsoft Azure Insider blog highlighted SambaY as the first time a major lab had shipped a production reasoning model with a hybrid State Space Model and attention design.^[5] The Mamba and Mamba 2 community had spent two years showing that SSMs could match Transformers on standard language modeling benchmarks; Phi-4-mini-flash-reasoning was widely framed as the first commercial deployment of those ideas inside a reasoning specialist.

The Gated Memory Unit attracted attention as a specific technique. Several reviewers pointed out that replacing half of the cross attention layers with a cheaper element wise gate is a simple idea that may transfer to other hybrid stacks, and the paper's NeurIPS 2025 acceptance gave it additional visibility.

Criticism centered on the same issues that affect every small reasoning model. The 3.8 billion parameter budget limits factual knowledge, and the model can produce inaccurate facts on world knowledge prompts. Performance on non English queries degrades noticeably. The hardware requirement for Flash Attention 2 support restricts deployment to recent NVIDIA accelerators, which complicates use on Apple Silicon and on older edge devices. And as with every Phi release, the training corpus is not redistributed, so the model's behavior cannot be independently reproduced or audited at the data level.^[2]

On the commercial side, the model became part of Microsoft's small model strategy for the second half of 2025 alongside Phi-4-mini, Phi-4-multimodal, and Phi-4 Reasoning. It was included as a default option in Azure AI Foundry's reasoning tier and packaged as a NIM microservice through the NVIDIA API Catalog.^[8]^[4]

References

Microsoft Azure. "Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning." Azure Blog, July 2025. https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/ ↩
Microsoft. "microsoft/Phi-4-mini-flash-reasoning." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning ↩
Ren, Liliang; Chen, Congcong; Xu, Haoran; Kim, Young Jin; Atkinson, Adam; Zhan, Zheng; Sun, Jiankai; Peng, Baolin; Liu, Liyuan; Wang, Shuohang; Cheng, Hao; Gao, Jianfeng; Chen, Weizhu; Shen, Yelong. "Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation." arXiv preprint 2507.06607, July 9, 2025. https://arxiv.org/abs/2507.06607 ↩
NVIDIA. "phi-4-mini-flash-reasoning Model Card." NVIDIA Build. https://build.nvidia.com/microsoft/phi-4-mini-flash-reasoning/modelcard ↩
MarkTechPost. "Microsoft Releases Phi-4-mini-Flash-Reasoning: Efficient Long-Context Reasoning with Compact Architecture." July 10, 2025. https://www.marktechpost.com/2025/07/10/microsoft-releases-phi-4-mini-flash-reasoning-efficient-long-context-reasoning-with-compact-architecture/ ↩
Microsoft. "PhiCookBook." GitHub repository. https://github.com/microsoft/PhiCookBook
Microsoft. "ArchScale training codebase." GitHub repository. https://github.com/microsoft/ArchScale
Microsoft Azure. "Phi-4-mini-flash-reasoning in Azure AI Foundry catalog." https://ai.azure.com/catalog/models/Phi-4-mini-flash-reasoning ↩
Microsoft Azure. "Empowering innovation: The next generation of the Phi family." Azure Blog, February 26, 2025. https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/ ↩
Microsoft. "microsoft/Phi-4-mini-reasoning." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-mini-reasoning ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Phi-4-mini YOCO (You Only Cache Once)

Background

SambaY architecture

Training

Capabilities

Inference speed claims

Benchmark performance

Licensing

Comparison to related Phi variants

Reception

See also

References

Improve this article

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini

SmolLM 2

What links here

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini

SmolLM 2

What links here