Phi-4-mini
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,200 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 ยท 3,200 words
Add missing citations, update stale details, or suggest a clearer explanation.
Phi-4-mini is a 3.8 billion parameter open weight small language model released by Microsoft Research on February 26, 2025. It is the compact text only entry in the second wave of the Phi-4 family, sitting alongside the larger 14 billion parameter Phi-4 and the multimodal Phi-4-multimodal model. Phi-4-mini is a dense, decoder only Transformer trained on roughly 5 trillion tokens of curated educational text, code, and synthetic data, with a 128,000 token context window and a 200,064 entry vocabulary that was deliberately expanded for multilingual support across 22 languages. Microsoft released the weights under the MIT license through Hugging Face, the Azure AI Foundry Model Catalog, GitHub Models, Ollama, and the NVIDIA API Catalog.
The model continues the Phi series philosophy that began with Phi-1 and Phi-3: high quality data, especially synthetic textbook style training data, can produce a small model that competes with much larger systems on reasoning, math, and coding tasks. On standard small model benchmarks, Phi-4-mini outperforms Llama 3.2 3B Instruct and most other models in the 3B to 4B class. It is also the language backbone of Phi-4-multimodal, a 5.6 billion parameter model that bolts vision and speech encoders onto the same frozen Phi-4-mini weights using a mixture of LoRA adapters. Together, the two models are positioned for on device deployment, low latency inference, and privacy sensitive workloads on consumer hardware.
The Phi project started inside Microsoft Research with a 2023 paper titled Textbooks Are All You Need, which argued that training a 1.3 billion parameter Python coding model on a small but high quality synthetic dataset could rival models several times its size. That experiment, Phi-1, kicked off a sequence of releases (Phi-1.5, Phi-2, and the Phi-3 line) that all kept the same recipe: relatively small dense Transformers, heavy investment in synthetic data, and a focus on tasks the team called textbook reasoning. Each generation pushed the parameter budget up modestly while adding to the Small Language Models conversation that emerged through 2024.
Phi-4 itself debuted in December 2024 as a 14 billion parameter dense model that beat several larger systems on math benchmarks. When Microsoft refreshed the line on February 26, 2025, the company added two siblings rather than a single successor. Phi-4-mini takes the data recipe and architecture lessons from Phi-4 and ports them to a 3.8 billion parameter footprint that can run on a laptop or a phone class accelerator. Phi-4-multimodal, announced the same day, glues vision and speech adapters onto the Phi-4-mini base. Both releases were positioned in the Azure blog as part of Microsoft's effort to push capable models out of the data center and onto edge devices, where latency, memory, and privacy constraints rule out frontier scale systems.
The Phi family also sits inside a broader market shift. By early 2025, the gap between very small open models (1B to 4B parameters) and mid size open models (7B to 13B) had narrowed sharply because of better training data, instruction tuning, and reasoning distillation. Llama 3.2, Qwen 2.5, Gemma 2, and SmolLM had all released competitive small models in the prior six months. Phi-4-mini is Microsoft's answer in that segment, and Microsoft's marketing emphasizes its strength on math and reasoning rather than raw multilingual knowledge.
Phi-4-mini is a dense decoder only Transformer. It is not a mixture of experts model, and it does not use any sparse attention or routing tricks at inference time. The architecture is designed to be straightforward enough that it runs cleanly through standard inference engines such as vLLM, llama.cpp, and ONNX Runtime, and to map well onto consumer GPUs, NPUs, and Apple Silicon.
The table below summarizes the model's headline parameters as documented in the technical report and the Hugging Face model card.
| Attribute | Value |
|---|---|
| Parameters | 3.8 billion |
| Transformer blocks | 32 |
| Hidden size | 3,072 |
| Query heads | 24 |
| Key/value heads | 8 |
| Attention mechanism | Grouped query attention (GQA) |
| Vocabulary | 200,064 tokens (o200k_base tiktoken) |
| Embeddings | Shared input/output (tied) |
| Context length | 128,000 tokens |
| Long context method | LongRoPE |
| Precision | bfloat16 |
Grouped query attention is one of the most important practical choices. Rather than giving every query head its own key and value projection, Phi-4-mini uses 24 query heads but only 8 key/value heads, so each KV head is shared by 3 query heads. The result is a much smaller KV cache during inference, which matters far more than parameter count when serving long contexts. For a 128K token prompt, the KV cache savings can be the difference between fitting on a single consumer GPU and not.
The shared input and output embedding is a second compactness trick borrowed from earlier Phi generations and from models like Gemma. Tying the embedding and the language modeling head means the same 200,064 by 3,072 matrix is used at both ends of the network. That alone saves roughly 615 million parameters that would otherwise live in a separate output projection, and frees up budget for the Transformer blocks themselves.
The 200,064 entry vocabulary is unusually large for a model in this class. Microsoft adopted OpenAI's o200k_base tokenizer (the same tokenizer used by GPT-4o) specifically to give better coverage for non English scripts. A larger vocabulary means fewer tokens per word in languages like Chinese, Japanese, Korean, Arabic, Hebrew, and Thai, which both reduces inference cost and improves quality on multilingual benchmarks. The trade off is that the embedding matrix itself eats a sizable fraction of the model's parameters, but the tied embedding partially offsets that cost.
Long context support up to 128,000 tokens is implemented with LongRoPE, the positional encoding scheme Microsoft introduced in 2024. LongRoPE rescales rotary position embeddings in a way that lets a model trained mostly on shorter contexts extrapolate cleanly out to far longer prompts. In Phi-4-mini, the pretraining phase used shorter contexts and the long context behavior was extended in a post training stage.
Phi-4-mini was pretrained on roughly 5 trillion tokens, larger and, per Microsoft, of higher quality than the dataset used for Phi-3.5-mini. The training corpus mixes three sources: filtered high quality web data, code from public repositories, and a substantial volume of synthetic data generated by larger models in the Phi family. The synthetic data emphasizes math, reasoning, and code, which is the part of the distribution Microsoft has historically pushed hardest.
NVIDIA's developer blog, which co published a deployment article about the model, reports that Phi-4-mini was trained on 1,024 NVIDIA A100 80GB GPUs for 14 days, with a data cutoff of June 2024 for publicly sourced material. The Hugging Face model card lists a slightly different figure (512 A100 80GB GPUs over 21 days), which likely reflects different points in the training pipeline; both numbers refer to A100 80GB hardware at Microsoft scale.
The post training stack is where Phi-4-mini gains most of its instruction following and function calling polish. Microsoft documents three post training stages:
The instruction tuned checkpoint, distributed as microsoft/Phi-4-mini-instruct, is what most users actually run. The base model is also available but receives less attention because the instruction tuned version already supports system prompts, tool calling, and chat formatting out of the box.
Function calling deserves a specific note because Phi-4-mini is one of the few models its size with first class tool use baked into the chat template. The model uses special <|tool|> and <|/tool|> tags to declare available tools and emit calls, and the post training set includes synthetic tool use trajectories. In practice this means Phi-4-mini can drive a local agent loop on a laptop without an external orchestration layer.
Although they share the same backbone, Phi-4-mini and Phi-4-multimodal are distinct releases with different intended uses.
Phi-4-mini-instruct is text only. It is the 3.8 billion parameter chat model described in the architecture section above, optimized for instruction following, function calling, reasoning, and multilingual chat. It is the default choice for developers who want a small dense language model to embed in an application, run on a consumer GPU, or expose through Ollama.
Phi-4-multimodal wraps the same frozen Phi-4-mini weights with vision and audio encoders, then attaches separate LoRA adapters for each modality. The total parameter count is roughly 5.6 billion. The architecture is what Microsoft calls a mixture of LoRAs:
Because the base Phi-4-mini weights stay frozen while the modality LoRAs are trained, Phi-4-multimodal preserves the text capabilities of Phi-4-mini while adding vision and speech understanding. The model can also combine modalities at inference time. The most cited result from the release is that Phi-4-multimodal climbed to the top of the Hugging Face OpenASR leaderboard for English automatic speech recognition with a word error rate of 6.14 percent, beating Whisper V3 and SeamlessM4T v2 Large on multiple speech benchmarks.
Microsoft also released a reasoning specialist variant later in the cycle. Phi-4-mini-reasoning, made available in April 2025, uses the same 3.8 billion parameter architecture but is fine tuned on roughly 150 billion tokens of synthetic math content distilled from DeepSeek R1. On AIME 2024 it scores 57.5, on MATH-500 it scores 94.6, and on GPQA Diamond it scores 52.0; the base Phi-4-mini scores 10.0, 71.8, and 36.9 on the same benchmarks. A separate Phi-4-mini-flash-reasoning variant, optimized for latency in reasoning workloads, followed soon after.
The Hugging Face model card for Phi-4-mini-instruct includes a head to head comparison against the previous generation Phi-3.5-mini, Llama 3.2 3B Instruct, Qwen 2.5 7B Instruct, and GPT-4o-mini. Phi-4-mini is the smallest model in the comparison; Qwen 2.5 7B is about twice its size. All numbers below are taken directly from Microsoft's published results.
| Benchmark | Phi-4-mini-instruct (3.8B) | Phi-3.5-mini-instruct (3.8B) | Llama 3.2 3B Instruct | Qwen 2.5 7B Instruct | GPT-4o-mini |
|---|---|---|---|---|---|
| MMLU (5-shot) | 67.3 | 65.5 | 61.8 | 72.6 | 77.2 |
| MMLU-Pro (0-shot, CoT) | 52.8 | 47.4 | 39.2 | 56.2 | 62.8 |
| GSM8K (8-shot, CoT) | 88.6 | 76.9 | 75.6 | 88.7 | 91.3 |
| MATH (0-shot, CoT) | 64.0 | 49.8 | 46.7 | 60.4 | 70.2 |
| BigBench Hard (0-shot, CoT) | 70.4 | 63.1 | 55.4 | 72.4 | 80.4 |
| ARC Challenge (10-shot) | 83.7 | 84.6 | 76.1 | 90.1 | 93.5 |
| HellaSwag (5-shot) | 69.1 | 72.2 | 77.2 | 80.0 | 88.7 |
| GPQA (0-shot, CoT) | 25.2 | 26.6 | 24.3 | 30.6 | 41.1 |
| Arena Hard | 32.8 | 34.4 | 17.0 | 55.5 | 53.7 |
| Multilingual MMLU (5-shot) | 49.3 | 51.8 | 48.1 | 64.4 | 72.9 |
| MGSM (0-shot, CoT) | 63.9 | 49.6 | 44.6 | 64.5 | 81.7 |
| Overall aggregate | 63.5 | 60.5 | 56.2 | 67.9 | 75.5 |
A few patterns jump out of the table. Phi-4-mini's gains over Phi-3.5-mini are concentrated in reasoning heavy tasks: MATH jumps from 49.8 to 64.0, GSM8K from 76.9 to 88.6, and BigBench Hard from 63.1 to 70.4. Those gains track the team's emphasis on synthetic reasoning data in pretraining and chain of thought training in post training. The model does not improve on every benchmark, though. HellaSwag, a commonsense benchmark, actually drops from 72.2 to 69.1, and ARC Challenge and GPQA slip slightly. The team appears to have made an explicit trade off in favor of reasoning over rote commonsense recall.
Against Llama 3.2 3B Instruct, the most direct same size competitor, Phi-4-mini leads on nearly every benchmark in the table, with the gap widest on math and reasoning (MATH 64.0 vs 46.7, MGSM 63.9 vs 44.6, BigBench Hard 70.4 vs 55.4). Llama wins on HellaSwag and is roughly tied on Multilingual MMLU. Against Qwen 2.5 7B, which has nearly twice the parameter count, Phi-4-mini is competitive on math (GSM8K 88.6 vs 88.7) but trails on most knowledge heavy benchmarks. Against GPT-4o-mini, a much larger closed model, Phi-4-mini predictably trails across the board, but the gap on GSM8K (88.6 vs 91.3) is narrower than the parameter count difference would suggest.
On the reasoning specialist front, Phi-4-mini-reasoning's 57.5 on AIME 2024 and 94.6 on MATH-500 put it ahead of DeepSeek-R1-Distill-Qwen-7B (53.3 / 91.4) and within reach of OpenAI's o1-mini (63.6 / 90.0) despite being roughly half the parameter count.
Microsoft released Phi-4-mini, Phi-4-mini-instruct, Phi-4-mini-reasoning, and Phi-4-multimodal under the MIT license. The MIT license is one of the most permissive licenses in widespread use. It allows commercial use, modification, redistribution, private use, and sublicensing, with the only requirement being that the original copyright and license notice be included in any substantial portion of the software.
This is a meaningful contrast with Llama 3.2, which ships under Meta's custom community license that imposes use case restrictions and a 700 million monthly active user threshold for commercial deployment. It is also more permissive than Gemma 2's Gemma terms of use, which include Google's prohibited use policy. Phi-4-mini's MIT license has no such restrictions, which has helped it spread quickly through the open weight ecosystem on Ollama, llama.cpp, vLLM, and downstream fine tunes such as Unsloth's GGUF quantizations.
The one footnote is that the model weights and license are governed by Microsoft, while the training data is not redistributed. Phi-4-mini is therefore an open weight model rather than a fully open source model in the sense used by some research groups. The architecture, code, and weights are open; the training corpus is not.
The table below collects published specifications and headline benchmark scores for the leading small open weight models in the 2 to 4 billion parameter range as of mid 2025. Numbers come from each model's official model card or technical report; the Phi-4-mini row reuses Microsoft's published figures.
| Model | Developer | Parameters | Context | License | MMLU | GSM8K | Released |
|---|---|---|---|---|---|---|---|
| Phi-4-mini-instruct | Microsoft | 3.8B | 128K | MIT | 67.3 | 88.6 | Feb 2025 |
| Phi-3.5-mini-instruct | Microsoft | 3.8B | 128K | MIT | 65.5 | 76.9 | Aug 2024 |
| Llama 3.2 3B Instruct | Meta | 3.2B | 128K | Llama 3.2 Community | 61.8 | 75.6 | Sep 2024 |
| Qwen 2.5 3B Instruct | Alibaba | 3.1B | 32K | Qwen Research License | 65.6 | 86.7 | Sep 2024 |
| Gemma 2 2B Instruct | 2.6B | 8K | Gemma Terms of Use | 51.3 | 30.3 | Jul 2024 | |
| Phi-4-multimodal | Microsoft | 5.6B | 128K | MIT | n/a | n/a | Feb 2025 |
The comparison points to a clear position. Phi-4-mini is the strongest 3B class model on reasoning and math benchmarks when restricted to permissively licensed weights. Qwen 2.5 3B comes closest on math but has a much shorter native context window and a more restrictive research license. Llama 3.2 3B matches the 128K context but trails on every reasoning benchmark. Gemma 2 2B is the smallest of the group and competes on speed rather than capability.
In practical deployment, Phi-4-mini's combination of MIT license, 128K context, GQA enabled small KV cache, and first class function calling support has made it a common default for on device assistants, retrieval augmented generation pipelines, and agent loops that run on consumer hardware. Microsoft's own Foundry Local runtime, NVIDIA's NIM microservices, and the Ollama community library all ship optimized builds.
Reception inside the open weight community was positive but measured. Reviewers on Hugging Face, the r/LocalLLaMA subreddit, and several developer blogs flagged the strong math performance and the function calling support as the most novel features. The model's 200,064 token vocabulary and its multilingual coverage were highlighted as notable for a 3.8 billion parameter model. The arXiv technical report (Microsoft, March 2025) became a frequent reference point for discussions about how far synthetic data scaling can carry a small model.
Criticism centered on a few familiar themes. The model's strength on reasoning benchmarks does not always transfer to open ended chat quality; Arena Hard scores of 32.8 trail Qwen 2.5 7B and other similarly sized peers, suggesting that human raters prefer the style of larger or more chat tuned models. The HellaSwag regression versus Phi-3.5-mini drew comments about whether the heavy emphasis on synthetic math content costs the model some breadth of world knowledge. And as with every Phi release, some researchers noted that Microsoft has never published the full training data composition, which makes it hard to independently reproduce or audit the model's behavior.
On the commercial side, Phi-4-mini and Phi-4-multimodal anchored Microsoft's small model strategy for 2025. Both became defaults in Azure AI Foundry's small model tier, and the Phi-4-mini family expanded over the following months to include the reasoning specialist and the flash reasoning variant. NVIDIA promoted Phi-4-multimodal heavily in its developer materials as a showcase for the NIM microservice deployment pattern.