Phi-4-mini

AI Models Large Language Models Open Source AI Small Language Models

19 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v2 · 3,878 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Phi-4-mini is a 3.8 billion parameter open weight small language model released by Microsoft on February 26, 2025, under the permissive MIT license.^[1]^[4] It is the compact, text only entry in the second wave of the Phi family, designed to run on laptops, phones, and other memory or latency constrained hardware while delivering strong math, reasoning, coding, and function calling for its size.^[1]^[4] Microsoft describes it as "a lightweight open model built upon synthetic data and filtered publicly available websites, with a focus on high-quality, reasoning dense data," trained on roughly 5 trillion tokens with a 128,000 token context window and a 200,064 entry vocabulary.^[1]

Phi-4-mini sits alongside the larger 14 billion parameter Phi-4 and the multimodal Phi-4-multimodal model, both refreshed in the same February 2025 release.^[4] It is a dense, decoder only Transformer trained on curated educational text, code, and synthetic data, with the vocabulary deliberately expanded for multilingual support across 22 languages.^[1]^[6] Microsoft released the weights under the MIT license through Hugging Face, the Azure AI Foundry Model Catalog, GitHub Models, Ollama, and the NVIDIA API Catalog, and later extended the line with two reasoning specialists: Phi-4-mini-reasoning (April 2025) and Phi-4-mini-flash-reasoning (July 2025).^[1]^[3]^[9]

The model continues the Phi series philosophy that began with Phi-1 and Phi-3: high quality data, especially synthetic textbook style training data, can produce a small model that competes with much larger systems on reasoning, math, and coding tasks.^[6] On standard small model benchmarks, Phi-4-mini outperforms Llama 3.2 3B Instruct and most other models in the 3B to 4B class.^[1] It is also the language backbone of Phi-4-multimodal, a 5.6 billion parameter model that bolts vision and speech encoders onto the same frozen Phi-4-mini weights using a mixture of LoRA adapters.^[2]^[6] Together, the two models are positioned for on device deployment, low latency inference, and privacy sensitive workloads on consumer hardware.^[4]

What is Phi-4-mini?

Phi-4-mini is Microsoft's 3.8 billion parameter small language model: a single dense Transformer optimized for three deployment profiles that Microsoft calls out explicitly: memory and compute constrained environments, latency bound scenarios, and strong reasoning, especially math and logic.^[1] In plain terms, it is a model small enough to run on a laptop or a phone class accelerator that still scores in the high 80s on grade school math (GSM8K) and supports prompts up to 128,000 tokens.^[1]

The "mini" name distinguishes it from the larger 14 billion parameter Phi-4. The most widely used checkpoint is the instruction tuned microsoft/Phi-4-mini-instruct, which ships with chat formatting, system prompts, and tool calling enabled out of the box.^[1] A separate base (pretrained, non instruction tuned) checkpoint is also published but receives far less use.

Background

The Phi project started inside Microsoft Research with a 2023 paper titled Textbooks Are All You Need, which argued that training a 1.3 billion parameter Python coding model on a small but high quality synthetic dataset could rival models several times its size. That experiment, Phi-1, kicked off a sequence of releases (Phi-1.5, Phi-2, and the Phi-3 line) that all kept the same recipe: relatively small dense Transformers, heavy investment in synthetic data, and a focus on tasks the team called textbook reasoning. Each generation pushed the parameter budget up modestly while adding to the small language model conversation that emerged through 2024.

Phi-4 itself debuted in December 2024 as a 14 billion parameter dense model that beat several larger systems on math benchmarks. When Microsoft refreshed the line on February 26, 2025, the company added two siblings rather than a single successor.^[4] Phi-4-mini takes the data recipe and architecture lessons from Phi-4 and ports them to a 3.8 billion parameter footprint that can run on a laptop or a phone class accelerator. Phi-4-multimodal, announced the same day, glues vision and speech adapters onto the Phi-4-mini base.^[2]^[4] Both releases were positioned in the Azure blog as part of Microsoft's effort to push capable models out of the data center and onto edge devices, where latency, memory, and privacy constraints rule out frontier scale systems.^[4]

The Phi family also sits inside a broader market shift. By early 2025, the gap between very small open models (1B to 4B parameters) and mid size open models (7B to 13B) had narrowed sharply because of better training data, instruction tuning, and reasoning distillation. Llama 3.2, Qwen 2.5, Gemma 2, and SmolLM had all released competitive small models in the prior six months. Phi-4-mini is Microsoft's answer in that segment, and Microsoft's marketing emphasizes its strength on math and reasoning rather than raw multilingual knowledge.^[4]

How is Phi-4-mini built?

Phi-4-mini is a dense decoder only Transformer. It is not a mixture of experts model, and it does not use any sparse attention or routing tricks at inference time. The architecture is designed to be straightforward enough that it runs cleanly through standard inference engines such as vLLM, llama.cpp, and ONNX Runtime, and to map well onto consumer GPUs, NPUs, and Apple Silicon.

The table below summarizes the model's headline parameters as documented in the technical report and the Hugging Face model card.^[1]^[6]

Attribute	Value
Parameters	3.8 billion
Transformer blocks	32
Hidden size	3,072
Query heads	24
Key/value heads	8
Attention mechanism	Grouped query attention (GQA)
Vocabulary	200,064 tokens (o200k_base tiktoken)
Embeddings	Shared input/output (tied)
Context length	128,000 tokens
Long context method	LongRoPE
Precision	bfloat16

Grouped query attention is one of the most important practical choices. Rather than giving every query head its own key and value projection, Phi-4-mini uses 24 query heads but only 8 key/value heads, so each KV head is shared by 3 query heads.^[6] The result is a much smaller KV cache during inference, which matters far more than parameter count when serving long contexts. For a 128K token prompt, the KV cache savings can be the difference between fitting on a single consumer GPU and not.

The shared input and output embedding is a second compactness trick borrowed from earlier Phi generations and from models like Gemma. Tying the embedding and the language modeling head means the same 200,064 by 3,072 matrix is used at both ends of the network.^[6] That alone saves roughly 615 million parameters that would otherwise live in a separate output projection, and frees up budget for the Transformer blocks themselves.

The 200,064 entry vocabulary is unusually large for a model in this class. Microsoft adopted OpenAI's o200k_base tokenizer (the same tokenizer used by GPT-4o) specifically to give better coverage for non English scripts.^[1]^[6] A larger vocabulary means fewer tokens per word in languages like Chinese, Japanese, Korean, Arabic, Hebrew, and Thai, which both reduces inference cost and improves quality on multilingual benchmarks. The trade off is that the embedding matrix itself eats a sizable fraction of the model's parameters, but the tied embedding partially offsets that cost.

Long context support up to 128,000 tokens is implemented with LongRoPE, the positional encoding scheme Microsoft introduced in 2024. LongRoPE rescales rotary position embeddings in a way that lets a model trained mostly on shorter contexts extrapolate cleanly out to far longer prompts. In Phi-4-mini, the pretraining phase used shorter contexts and the long context behavior was extended in a post training stage.

How was Phi-4-mini trained?

Phi-4-mini was pretrained on roughly 5 trillion tokens, larger and, per Microsoft, of higher quality than the dataset used for Phi-3.5-mini.^[1]^[6] The training corpus mixes three sources: filtered high quality web data, code from public repositories, and a substantial volume of synthetic data generated by larger models in the Phi family. The synthetic data emphasizes math, reasoning, and code, which is the part of the distribution Microsoft has historically pushed hardest.

The Hugging Face model card lists the training run as 512 NVIDIA A100 80GB GPUs over 21 days, with a data cutoff of June 2024 for publicly sourced material.^[1] NVIDIA's developer blog, which co published a deployment article about the model, cites a 1,024 A100 80GB GPU figure over 14 days; both numbers refer to A100 80GB hardware at Microsoft scale and likely reflect different points in the training pipeline.^[7]

The post training stack is where Phi-4-mini gains most of its instruction following and function calling polish. Microsoft documents three post training stages:^[6]

Supervised fine tuning (SFT) on instruction data, including substantial code completion sets and a curated function calling dataset.
Direct Preference Optimization (DPO) to align the model against human or model judged preference pairs.
Reasoning oriented training on chain of thought rollouts from frontier models, which lifts the model's behavior on math and step by step problems without requiring a separate reasoning specialist.

The instruction tuned checkpoint, distributed as microsoft/Phi-4-mini-instruct, is what most users actually run.^[1] The base model is also available but receives less attention because the instruction tuned version already supports system prompts, tool calling, and chat formatting out of the box.

Function calling deserves a specific note because Phi-4-mini is one of the few models its size with first class tool use baked into the chat template. The model uses special <|tool|> and <|/tool|> tags to declare available tools and emit calls, and the post training set includes synthetic tool use trajectories.^[1] In practice this means Phi-4-mini can drive a local agent loop on a laptop without an external orchestration layer.

How does Phi-4-mini-instruct differ from Phi-4-multimodal?

Although they share the same backbone, Phi-4-mini and Phi-4-multimodal are distinct releases with different intended uses.

Phi-4-mini-instruct is text only. It is the 3.8 billion parameter chat model described in the architecture section above, optimized for instruction following, function calling, reasoning, and multilingual chat.^[1] It is the default choice for developers who want a small dense language model to embed in an application, run on a consumer GPU, or expose through Ollama.

Phi-4-multimodal wraps the same frozen Phi-4-mini weights with vision and audio encoders, then attaches separate LoRA adapters for each modality. The total parameter count is roughly 5.6 billion.^[2] The architecture is what Microsoft calls a mixture of LoRAs:^[6]

The vision branch uses a SigLIP-400M encoder fine tuned with the LLM2CLIP method, a 2 layer MLP projector, and a 370 million parameter vision LoRA adapter. Image input runs at 448 by 448 resolution with a dynamic multi crop strategy for higher resolution content.
The speech branch uses a 460 million parameter audio encoder made of 3 convolution layers followed by 24 Conformer blocks, with attention dimension 1024, feed forward dimension 1536, and 16 attention heads. It produces tokens at 80 millisecond intervals, roughly 750 tokens per minute of audio, and feeds into the language model through a 2 layer MLP projector and a 460 million parameter speech LoRA adapter.

Because the base Phi-4-mini weights stay frozen while the modality LoRAs are trained, Phi-4-multimodal preserves the text capabilities of Phi-4-mini while adding vision and speech understanding.^[6] The model can also combine modalities at inference time. The most cited result from the release is that Phi-4-multimodal climbed to the top of the Hugging Face OpenASR leaderboard for English automatic speech recognition with a word error rate of 6.14 percent, beating Whisper V3 and SeamlessM4T v2 Large on multiple speech benchmarks.^[2]^[4]

What is Phi-4-mini-reasoning?

Phi-4-mini-reasoning is a math focused fine tune of Phi-4-mini that Microsoft released in April 2025.^[3] It uses the same 3.8 billion parameter architecture but is trained on roughly 150 billion tokens of synthetic math content distilled from DeepSeek R1, then refined through reinforcement learning to specialize in multi step chain of thought reasoning.^[3] The result is a small model that posts reasoning scores well above its weight class.

Benchmark	Phi-4-mini-reasoning	Base Phi-4-mini
AIME 2024	57.5	10.0
MATH-500	94.6	71.8
GPQA Diamond	52.0	36.9

Those AIME 2024 (57.5) and MATH-500 (94.6) scores put Phi-4-mini-reasoning ahead of DeepSeek-R1-Distill-Qwen-7B (53.3 / 91.4) and within reach of OpenAI's o1-mini (63.6 / 90.0) despite being roughly half the parameter count.^[3]

What is Phi-4-mini-flash-reasoning?

Phi-4-mini-flash-reasoning is a latency optimized reasoning variant that Microsoft released on July 9, 2025.^[9] It keeps the 3.8 billion parameter budget and the math reasoning focus of Phi-4-mini-reasoning but replaces the standard Transformer with a new decoder hybrid decoder architecture called SambaY, and it supports a 64,000 token context window.^[9] The headline claim is up to 10 times higher throughput and 2 to 3 times lower latency than the previous reasoning model on long generation workloads (a 2,000 token prompt producing a 32,000 token answer).^[9]

The central innovation in SambaY is the Gated Memory Unit (GMU), a cheap element-wise gating mechanism that reuses the hidden state from the final state space model layer to share representations across layers and avoid redundant computation.^[9] The self decoder combines Mamba (a state space model) with sliding window attention plus a single full attention layer, while the cross decoder interleaves cross attention layers with the efficient GMUs. On the Math500 benchmark, Phi-4-mini-flash-reasoning reaches a pass@1 accuracy of about 92.5 percent.^[9]

Variant	Released	Context	Architecture	Focus
Phi-4-mini-instruct	Feb 2025	128K	Dense Transformer (GQA)	General chat, tools, math
Phi-4-mini-reasoning	Apr 2025	128K	Dense Transformer (GQA)	Math reasoning
Phi-4-mini-flash-reasoning	Jul 2025	64K	SambaY (decoder hybrid decoder)	Low latency math reasoning

How does Phi-4-mini perform on benchmarks?

The Hugging Face model card for Phi-4-mini-instruct includes a head to head comparison against the previous generation Phi-3.5-mini, Llama 3.2 3B Instruct, Qwen 2.5 7B Instruct, and GPT-4o-mini.^[1] Phi-4-mini is the smallest model in the comparison; Qwen 2.5 7B is about twice its size. All numbers below are taken directly from Microsoft's published results.^[1]

Benchmark	Phi-4-mini-instruct (3.8B)	Phi-3.5-mini-instruct (3.8B)	Llama 3.2 3B Instruct	Qwen 2.5 7B Instruct	GPT-4o-mini
MMLU (5-shot)	67.3	65.5	61.8	72.6	77.2
MMLU-Pro (0-shot, CoT)	52.8	47.4	39.2	56.2	62.8
GSM8K (8-shot, CoT)	88.6	76.9	75.6	88.7	91.3
MATH (0-shot, CoT)	64.0	49.8	46.7	60.4	70.2
BigBench Hard (0-shot, CoT)	70.4	63.1	55.4	72.4	80.4
ARC Challenge (10-shot)	83.7	84.6	76.1	90.1	93.5
HellaSwag (5-shot)	69.1	72.2	77.2	80.0	88.7
GPQA (0-shot, CoT)	25.2	26.6	24.3	30.6	41.1
Arena Hard	32.8	34.4	17.0	55.5	53.7
Multilingual MMLU (5-shot)	49.3	51.8	48.1	64.4	72.9
MGSM (0-shot, CoT)	63.9	49.6	44.6	64.5	81.7
Overall aggregate	63.5	60.5	56.2	67.9	75.5

A few patterns jump out of the table. Phi-4-mini's gains over Phi-3.5-mini are concentrated in reasoning heavy tasks: MATH jumps from 49.8 to 64.0, GSM8K from 76.9 to 88.6, and BigBench Hard from 63.1 to 70.4. Those gains track the team's emphasis on synthetic reasoning data in pretraining and chain of thought training in post training. The model does not improve on every benchmark, though. HellaSwag, a commonsense benchmark, actually drops from 72.2 to 69.1, and ARC Challenge and GPQA slip slightly. The team appears to have made an explicit trade off in favor of reasoning over rote commonsense recall.

Against Llama 3.2 3B Instruct, the most direct same size competitor, Phi-4-mini leads on nearly every benchmark in the table, with the gap widest on math and reasoning (MATH 64.0 vs 46.7, MGSM 63.9 vs 44.6, BigBench Hard 70.4 vs 55.4). Llama wins on HellaSwag and is roughly tied on Multilingual MMLU. Against Qwen 2.5 7B, which has nearly twice the parameter count, Phi-4-mini is competitive on math (GSM8K 88.6 vs 88.7) but trails on most knowledge heavy benchmarks. Against GPT-4o-mini, a much larger closed model, Phi-4-mini predictably trails across the board, but the gap on GSM8K (88.6 vs 91.3) is narrower than the parameter count difference would suggest.

Is Phi-4-mini open source, and what license does it use?

Microsoft released Phi-4-mini, Phi-4-mini-instruct, Phi-4-mini-reasoning, and Phi-4-multimodal under the MIT license.^[1]^[3]^[4] The MIT license is one of the most permissive licenses in widespread use. It allows commercial use, modification, redistribution, private use, and sublicensing, with the only requirement being that the original copyright and license notice be included in any substantial portion of the software.

This is a meaningful contrast with Llama 3.2, which ships under Meta's custom community license that imposes use case restrictions and a 700 million monthly active user threshold for commercial deployment. It is also more permissive than Gemma 2's Gemma terms of use, which include Google's prohibited use policy. Phi-4-mini's MIT license has no such restrictions, which has helped it spread quickly through the open weight ecosystem on Ollama, llama.cpp, vLLM, and downstream fine tunes such as Unsloth's GGUF quantizations.^[10]

The one footnote is that the model weights and license are governed by Microsoft, while the training data is not redistributed. Phi-4-mini is therefore an open weight model rather than a fully open source model in the sense used by some research groups. The architecture, code, and weights are open; the training corpus is not.

How does Phi-4-mini compare to other small models?

The table below collects published specifications and headline benchmark scores for the leading small open weight models in the 2 to 4 billion parameter range as of mid 2025. Numbers come from each model's official model card or technical report; the Phi-4-mini row reuses Microsoft's published figures.^[1]

Model	Developer	Parameters	Context	License	MMLU	GSM8K	Released
Phi-4-mini-instruct	Microsoft	3.8B	128K	MIT	67.3	88.6	Feb 2025
Phi-3.5-mini-instruct	Microsoft	3.8B	128K	MIT	65.5	76.9	Aug 2024
Llama 3.2 3B Instruct	Meta	3.2B	128K	Llama 3.2 Community	61.8	75.6	Sep 2024
Qwen 2.5 3B Instruct	Alibaba	3.1B	32K	Qwen Research License	65.6	86.7	Sep 2024
Gemma 2 2B Instruct	Google	2.6B	8K	Gemma Terms of Use	51.3	30.3	Jul 2024
Phi-4-multimodal	Microsoft	5.6B	128K	MIT	n/a	n/a	Feb 2025

The comparison points to a clear position. Phi-4-mini is the strongest 3B class model on reasoning and math benchmarks when restricted to permissively licensed weights. Qwen 2.5 3B comes closest on math but has a much shorter native context window and a more restrictive research license. Llama 3.2 3B matches the 128K context but trails on every reasoning benchmark. Gemma 2 2B is the smallest of the group and competes on speed rather than capability.

In practical deployment, Phi-4-mini's combination of MIT license, 128K context, GQA enabled small KV cache, and first class function calling support has made it a common default for on device assistants, retrieval augmented generation pipelines, and agent loops that run on consumer hardware. Microsoft's own Foundry Local runtime, NVIDIA's NIM microservices, and the Ollama community library all ship optimized builds.^[7]^[10]

Reception

Reception inside the open weight community was positive but measured. Reviewers on Hugging Face, the r/LocalLLaMA subreddit, and several developer blogs flagged the strong math performance and the function calling support as the most novel features. The model's 200,064 token vocabulary and its multilingual coverage were highlighted as notable for a 3.8 billion parameter model. The arXiv technical report (Microsoft, March 2025) became a frequent reference point for discussions about how far synthetic data scaling can carry a small model.^[6]

Criticism centered on a few familiar themes. The model's strength on reasoning benchmarks does not always transfer to open ended chat quality; Arena Hard scores of 32.8 trail Qwen 2.5 7B and other similarly sized peers, suggesting that human raters prefer the style of larger or more chat tuned models.^[1] The HellaSwag regression versus Phi-3.5-mini drew comments about whether the heavy emphasis on synthetic math content costs the model some breadth of world knowledge. And as with every Phi release, some researchers noted that Microsoft has never published the full training data composition, which makes it hard to independently reproduce or audit the model's behavior.

On the commercial side, Phi-4-mini and Phi-4-multimodal anchored Microsoft's small model strategy for 2025. Both became defaults in Azure AI Foundry's small model tier, and the Phi-4-mini family expanded over the following months to include the reasoning specialist and the flash reasoning variant.^[3]^[9] NVIDIA promoted Phi-4-multimodal heavily in its developer materials as a showcase for the NIM microservice deployment pattern.^[7]

ELI5: Phi-4-mini explained simply

Imagine a very small but very well taught student. Most big AI models learn by reading huge messy piles of the internet. Phi-4-mini instead learned mostly from clean, carefully written "textbook" style lessons, a lot of which were written by bigger AI models on purpose to teach it.^[6] Because the lessons were so good, this small student (only 3.8 billion "brain cells," compared with hundreds of billions in the biggest models) can still do hard math and follow instructions surprisingly well.^[1] It is small enough to live inside a laptop or phone instead of a giant data center, it can read a very long document at once (about 128,000 words worth of text), and it can even press buttons for you by calling tools and functions. Microsoft gave it away for free under the MIT license, so anyone can use it, change it, or build a product with it.^[1]^[4]

References

Microsoft. "microsoft/Phi-4-mini-instruct." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-mini-instruct ↩
Microsoft. "microsoft/Phi-4-multimodal-instruct." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-multimodal-instruct ↩
Microsoft. "microsoft/Phi-4-mini-reasoning." Hugging Face model card. https://huggingface.co/microsoft/Phi-4-mini-reasoning ↩
Microsoft Azure. "Empowering innovation: The next generation of the Phi family." Azure Blog, February 26, 2025. https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/ ↩
Microsoft Tech Community. "Welcome to the new Phi-4 models, Microsoft Phi-4-mini and Phi-4-multimodal." February 26, 2025. https://techcommunity.microsoft.com/blog/educatordeveloperblog/welcome-to-the-new-phi-4-models---microsoft-phi-4-mini--phi-4-multimodal/4386037
Microsoft Research. "Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs." arXiv preprint 2503.01743, March 2025. https://arxiv.org/abs/2503.01743 ↩
NVIDIA Developer Blog. "Latest Multimodal Addition to Microsoft Phi SLMs Trained on NVIDIA GPUs." February 26, 2025. https://developer.nvidia.com/blog/latest-multimodal-addition-to-microsoft-phi-slms-trained-on-nvidia-gpus/ ↩
Microsoft Azure. "Phi Open Models product page." https://azure.microsoft.com/en-us/products/phi/
Microsoft Azure. "Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning." Azure Blog, July 9, 2025. https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/ ↩
Ollama. "phi4-mini model library entry." https://ollama.com/library/phi4-mini ↩
Microsoft Research. "Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math." arXiv preprint 2504.21233, April 2025. https://arxiv.org/abs/2504.21233

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Best Local and On-Device LLMs Best Small Language Models Jamba Reasoning 3B Jet-Nemotron Microsoft Research Phi-3 Phi-4 Reasoning Phi-4-mini-flash-reasoning

What is Phi-4-mini?

Background

How is Phi-4-mini built?

How was Phi-4-mini trained?

How does Phi-4-mini-instruct differ from Phi-4-multimodal?

What is Phi-4-mini-reasoning?

What is Phi-4-mini-flash-reasoning?

How does Phi-4-mini perform on benchmarks?

Is Phi-4-mini open source, and what license does it use?

How does Phi-4-mini compare to other small models?

Reception

ELI5: Phi-4-mini explained simply

See also

References

Improve this article

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini-flash-reasoning

SmolLM 2

What links here

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini-flash-reasoning

SmolLM 2

What links here