Phi-4-mini-flash-reasoning
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,867 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 2,867 words
Add missing citations, update stale details, or suggest a clearer explanation.
Phi-4-mini-flash-reasoning is a 3.8 billion parameter open weight reasoning model released by Microsoft in July 2025. It is a latency optimized sibling of Phi-4-mini and the earlier Phi-4-mini-reasoning checkpoint, built on a new hybrid architecture called SambaY rather than the standard dense Transformer backbone used elsewhere in the Phi-4 family. The model targets math heavy reasoning workloads where decoding speed and memory footprint matter as much as raw accuracy, and Microsoft positions it for edge deployment, on device tutoring, and other resource constrained settings.
The model was announced on the Azure blog under the title Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning, with weights distributed under the MIT license on Hugging Face, Azure AI Foundry, and the NVIDIA API Catalog. The underlying SambaY architecture, described in a paper titled Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation and posted to arXiv on July 9, 2025, combines a Samba based self decoder using Mamba (State Space Model) layers and Sliding Window Attention with a cross decoder that interleaves cross attention with a new component called the Gated Memory Unit (GMU). Microsoft reports up to 10 times higher decoding throughput than Phi-4-mini-reasoning on 2K token prompts with 32K token generations, and a 2 to 3 times average reduction in latency, while matching or beating the older model on Math500, AIME 2024, AIME 2025, and GPQA Diamond.
The Phi-4 reasoning family grew out of a year of work on small dense Transformers at Microsoft Research. The flagship Phi-4 model arrived in December 2024 as a 14 billion parameter dense model that competed with much larger systems on math benchmarks, and Microsoft followed it on February 26, 2025 with a refresh that added the 3.8 billion parameter Phi-4-mini and the multimodal Phi-4-multimodal. Both of those models were trained on roughly 5 trillion tokens with heavy emphasis on synthetic reasoning data, but neither was a reasoning specialist; they handled chain of thought tasks reasonably well, yet were tuned mainly for general instruction following.
In April 2025 Microsoft released Phi-4-mini-reasoning, a math focused fine tune of Phi-4-mini that used about 150 billion tokens of synthetic mathematical content distilled from DeepSeek R1. The team also published a heavier sibling, Phi-4 Reasoning, built on the 14 billion parameter Phi-4 base, in late April 2025. Phi-4-mini-reasoning kept the same dense Transformer backbone as Phi-4-mini and reached strong scores on AIME 2024, MATH-500, and GPQA Diamond despite its small parameter budget. Its weakness was the same weakness every dense Transformer has on long reasoning traces: quadratic attention cost during decoding, which makes 32K token generations expensive on edge hardware.
Phi-4-mini-flash-reasoning was Microsoft's answer to that bottleneck. Rather than push the parameter count higher, the team kept the 3.8 billion parameter budget but swapped the dense Transformer for a hybrid State Space Model plus attention design. The Azure blog framed the release as the first Phi model to ship with a non Transformer core, and as a step toward reasoning that fits inside a phone or a laptop without the latency penalty of full attention.
SambaY is a decoder hybrid decoder architecture introduced in the same paper as Phi-4-mini-flash-reasoning. It is not a pure State Space Model and not a pure Transformer; it is a two stage stack where the first stage handles most of the sequence processing with linear time components and the second stage uses a small number of attention layers to handle the parts that benefit from global context.
The self decoder, which is the first stage, is a Samba block that combines Mamba (State Space Model) layers with Sliding Window Attention (SWA) and a single layer of full attention. Mamba layers run in linear time over the input sequence and carry information forward through a learned state rather than a key value cache. SWA gives the model the ability to attend over a local window without paying for full attention across the whole context. The single full attention layer adds a global look up so the model can still resolve dependencies that span the entire 64K token context.
The cross decoder, which is the second stage, is where the new Gated Memory Unit (GMU) lives. A standard cross decoder would interleave several layers of cross attention against the self decoder's outputs, which is expensive in both compute and memory. SambaY replaces roughly half of those cross attention layers with GMUs. A GMU is an element wise gating function that reuses the hidden state from the final SSM layer of the self decoder, so it does not have to recompute attention against the full prefix. The paper describes the GMU as a way to share memory readout states from the Samba based self decoder, which keeps representation sharing across layers without redundant computation.
The architecture also draws on differential attention from related Microsoft research. Together these choices give the model linear prefill time complexity in the sequence length and lower decoder I/O than a standard Transformer of the same parameter count. Crucially, SambaY does not need an explicit positional encoding scheme such as RoPE for its long context behavior, because the State Space Model layers carry order information through their state.
The table below summarizes the headline architectural choices, drawn from the Hugging Face model card and the arXiv paper.
| Attribute | Value |
|---|---|
| Parameters | 3.8 billion |
| Architecture | SambaY (decoder hybrid decoder) |
| Self decoder | Mamba SSM, Sliding Window Attention, one full attention layer |
| Cross decoder | Cross attention layers interleaved with Gated Memory Units |
| Attention features | Grouped query attention, single global attention layer, differential attention |
| Context length | 64,000 tokens |
| Vocabulary | 200,064 tokens |
| Embeddings | Shared input and output (tied) |
| Positional encoding | None (state carries order) |
| Precision | bfloat16 |
| KV cache | Shared key value cache |
The 200,064 entry vocabulary and tied embeddings are inherited from Phi-4-mini. The 64K context window is shorter than Phi-4-mini's 128K window, which reflects the fact that the flash variant is tuned for long generations rather than long inputs. In practice the model is intended for prompts in the low thousands of tokens with reasoning traces that can stretch out to 32K tokens or more.
The pretraining recipe carries over the synthetic data approach from earlier Phi releases but uses a different compute profile because of the hybrid architecture. According to the Hugging Face model card, Phi-4-mini-flash-reasoning was pretrained on 5 trillion tokens using 1,024 NVIDIA A100 80GB GPUs over 14 days. The reasoning post training stage ran on 128 NVIDIA H100 80GB GPUs for 2 days using 150 billion tokens of reasoning data.
The reasoning corpus is dominated by synthetic mathematical content, including over 1 million math problems spanning middle school through Ph.D. level, and roughly 30 billion tokens of verified mathematical content. Microsoft used DeepSeek R1 to generate problem and solution traces for that corpus. The data composition is similar to the corpus used for Phi-4-mini-reasoning, which makes the architecture itself the main controlled difference between the two models.
The post training pipeline is multi stage supervised fine tuning followed by Direct Preference Optimization. The model card notes that Phi-4-mini-flash-reasoning does not use reinforcement learning, which separates it from Phi-4-mini-reasoning. The training cutoff for source data is May 2025.
The model is a specialist rather than a generalist. The Hugging Face card states explicitly that it is designed for advanced math reasoning and is not intended for general purpose language understanding tasks. The table below summarizes its intended uses and supported workflows.
| Capability | Notes |
|---|---|
| Multi step math reasoning | Primary use; trained on synthetic problem solution traces |
| Formal proof generation | Supported via chain of thought style outputs |
| Symbolic computation | Listed in model card as an intended use |
| Long generation under tight memory budgets | SambaY design keeps decoder I/O low |
| Edge and mobile deployment | Targeted use case in Azure announcement |
| Educational and tutoring applications | Cited as a primary scenario |
| General chat and world knowledge | Not the intended use; limited capacity at 3.8B parameters |
| Non English reasoning | Limited; English is the primary training language |
| Code generation | Mostly Python; other languages require verification |
Long conversational use can produce repetition or drift in extended sessions, and the model is not appropriate for high risk legal or medical advice without additional safeguards. Microsoft also notes elevated defect rates on election related queries, in line with its broader responsible AI guidance.
The headline efficiency claim from both the Azure blog and the Hugging Face card is up to 10 times higher decoding throughput than Phi-4-mini-reasoning on a workload with 2K token prompts and 32K token generations. The same comparison reports a 2 to 3 times average reduction in latency.
The driver behind those numbers is the SambaY architecture itself. Because most of the sequence processing in the self decoder runs on State Space Model layers, prefill cost grows linearly with prompt length instead of quadratically. The Gated Memory Units in the cross decoder remove the need to recompute attention against the full prefix for half of the layers that would otherwise carry it. The full attention layer is restricted to a single global slot, so its quadratic cost is paid only once per forward pass.
Microsoft tested the model on NVIDIA A100 GPUs with vLLM and on H100 GPUs through Azure AI Foundry. Flash Attention 2 support is a hardware requirement, which limits deployment to recent NVIDIA accelerators. SGLang offers streaming support, and NVIDIA NIM packages the model as a commercial inference microservice.
The Hugging Face model card publishes a head to head comparison against Phi-4-mini-reasoning and several DeepSeek R1 distilled models. The numbers below are taken directly from that card. AIME accuracy is averaged over 64 samples and Math500 and GPQA Diamond are averaged over 8 samples, all pass at 1.
| Model | Parameters | AIME 2024 | AIME 2025 | Math500 | GPQA Diamond |
|---|---|---|---|---|---|
| Phi-4-mini-flash-reasoning | 3.8B | 52.29 | 33.59 | 92.45 | 45.08 |
| Phi-4-mini-reasoning | 3.8B | 48.13 | 31.77 | 91.20 | 44.51 |
| DeepSeek-R1-Distill-Qwen-7B | 7B | 53.70 | 35.94 | 93.03 | 47.85 |
| DeepSeek-R1-Distill-Llama-8B | 8B | 43.96 | 27.34 | 87.48 | 45.83 |
A few patterns are worth calling out. Phi-4-mini-flash-reasoning beats its dense Transformer sibling Phi-4-mini-reasoning on every benchmark in the table, with the largest gap on AIME 2024 (52.29 versus 48.13). It trails DeepSeek-R1-Distill-Qwen-7B on three of the four benchmarks, which is expected given that the Qwen distill has almost twice the parameter count, but the gap is narrow on Math500 (92.45 versus 93.03). Against the larger DeepSeek-R1-Distill-Llama-8B, Phi-4-mini-flash-reasoning leads on AIME 2024, AIME 2025, and Math500 while losing slightly on GPQA Diamond.
The paper frames these numbers as evidence that the SambaY architecture does not sacrifice reasoning quality to gain its efficiency. In the paper's words, the flash variant achieves significantly better performance than Phi-4-mini-reasoning on reasoning tasks while delivering up to 10 times higher decoding throughput.
Microsoft released Phi-4-mini-flash-reasoning under the MIT License. The MIT license is one of the most permissive licenses in widespread use. It allows commercial use, modification, redistribution, private use, and sublicensing, with the only requirement being that the original copyright and license notice be included in any substantial portion of the software.
This matches the licensing of every other model in the Phi-4 family released in 2025, including Phi-4, Phi-4-mini, Phi-4-multimodal, Phi-4-mini-reasoning, and Phi-4 Reasoning. The licensing posture is more permissive than Meta's Llama 3.x community license and Google's Gemma terms of use, both of which include use case restrictions. The MIT license has helped the Phi family spread quickly through open weight tooling such as vLLM, Ollama, llama.cpp, and downstream quantizations.
As with the rest of the Phi family, the weights are open but the training data is not redistributed. Phi-4-mini-flash-reasoning is therefore an open weight model rather than a fully open source model in the strict sense; the architecture, code, and weights are open, while the training corpus is not.
The table below collects published specifications and headline benchmark scores for the four reasoning relevant models in the Phi-4 family. Math500 and AIME numbers are taken from each model's Hugging Face model card or technical report.
| Model | Parameters | Backbone | Context | Reasoning training | Math500 | AIME 2024 | Released |
|---|---|---|---|---|---|---|---|
| Phi-4 | 14B | Dense Transformer | 16K | None (base) | n/a | n/a | Dec 2024 |
| Phi-4-mini | 3.8B | Dense Transformer | 128K | None (base) | n/a | n/a | Feb 2025 |
| Phi-4-mini-reasoning | 3.8B | Dense Transformer | 128K | SFT, DPO, RL on synthetic math | 91.20 | 48.13 | Apr 2025 |
| Phi-4 Reasoning | 14B | Dense Transformer | 32K | SFT, DPO, RL on synthetic math | n/a | n/a | Apr 2025 |
| Phi-4-mini-flash-reasoning | 3.8B | SambaY hybrid SSM and attention | 64K | SFT and DPO on synthetic math | 92.45 | 52.29 | Jul 2025 |
Two differences between Phi-4-mini-reasoning and Phi-4-mini-flash-reasoning are worth emphasizing. The first is the backbone: the reasoning variant uses the same dense Transformer as Phi-4-mini, while the flash variant uses SambaY. The second is the training pipeline: Phi-4-mini-reasoning includes a reinforcement learning stage on top of supervised fine tuning and Direct Preference Optimization, while Phi-4-mini-flash-reasoning skips RL entirely and relies on multi stage SFT and DPO. The shorter context window of the flash variant (64K versus 128K) reflects its emphasis on long generation rather than long input.
Against the larger 14 billion parameter Phi-4 Reasoning, Phi-4-mini-flash-reasoning trades raw accuracy on hard math benchmarks for a much smaller memory footprint and much faster decoding. The two models target different deployment scenarios: Phi-4 Reasoning is intended for server side inference where capacity matters most, and Phi-4-mini-flash-reasoning is intended for edge and mobile workloads where latency and memory matter more.
Reception inside the open weight community focused on the architectural novelty more than the benchmark numbers. Coverage on MarkTechPost, several Medium technical deep dives, and the Microsoft Azure Insider blog highlighted SambaY as the first time a major lab had shipped a production reasoning model with a hybrid State Space Model and attention design. The Mamba and Mamba 2 community had spent two years showing that SSMs could match Transformers on standard language modeling benchmarks; Phi-4-mini-flash-reasoning was widely framed as the first commercial deployment of those ideas inside a reasoning specialist.
The Gated Memory Unit attracted attention as a specific technique. Several reviewers pointed out that replacing half of the cross attention layers with a cheaper element wise gate is a simple idea that may transfer to other hybrid stacks, and the paper's NeurIPS 2025 acceptance gave it additional visibility.
Criticism centered on the same issues that affect every small reasoning model. The 3.8 billion parameter budget limits factual knowledge, and the model can produce inaccurate facts on world knowledge prompts. Performance on non English queries degrades noticeably. The hardware requirement for Flash Attention 2 support restricts deployment to recent NVIDIA accelerators, which complicates use on Apple Silicon and on older edge devices. And as with every Phi release, the training corpus is not redistributed, so the model's behavior cannot be independently reproduced or audited at the data level.
On the commercial side, the model became part of Microsoft's small model strategy for the second half of 2025 alongside Phi-4-mini, Phi-4-multimodal, and Phi-4 Reasoning. It was included as a default option in Azure AI Foundry's reasoning tier and packaged as a NIM microservice through the NVIDIA API Catalog.