Phi is a series of small language models (SLMs) developed by Microsoft Research, beginning with the release of Phi-1 in June 2023. The Phi series is built on a foundational insight that has reshaped how the AI research community thinks about training data: that the quality of training data matters far more than its quantity, and that carefully curated synthetic data can enable small models to match or exceed the performance of models many times their size. This principle, captured in the title of the original research paper "Textbooks Are All You Need," has guided every subsequent release in the series [1].
Across multiple generations, from the 1.3-billion-parameter Phi-1 to the 14-billion-parameter Phi-4 and its reasoning variants, Microsoft has consistently demonstrated that small, efficiently trained models can compete with much larger systems on reasoning, coding, and mathematical benchmarks. The Phi models are released under the MIT license, making them freely available for both commercial and research use, and they are designed to run on resource-constrained hardware including mobile phones, laptops, and edge devices [2]. As of March 2026, the Phi family spans text-only models, multimodal systems capable of processing images and speech, and dedicated reasoning models that rival systems tens of times their size on STEM benchmarks.
The following table summarizes all major releases in the Phi model family.
| Model | Release Date | Parameters | Context Length | Training Tokens | Key Feature | License |
|---|---|---|---|---|---|---|
| Phi-1 | June 2023 | 1.3B | 2K | ~7B (6B web + 1B synthetic) | Code generation; "Textbooks Are All You Need" | MIT |
| Phi-1.5 | September 2023 | 1.3B | 2K | ~30B | Extended to common sense reasoning and NLU | MIT |
| Phi-2 | December 2023 | 2.7B | 2K | 1.4T | Outperformed 7B-13B models; knowledge transfer from Phi-1.5 | MIT |
| Phi-3-mini | April 2024 | 3.8B | 4K / 128K | 3.3T | Ran on mobile phones; first production-ready Phi | MIT |
| Phi-3-small | April 2024 | 7B | 8K / 128K | -- | Higher capacity variant of Phi-3 | MIT |
| Phi-3-medium | April 2024 | 14B | 4K / 128K | -- | Largest Phi-3 dense model | MIT |
| Phi-3.5-mini | August 2024 | 3.8B | 128K | 3.4T | Improved multilingual support (20+ languages) | MIT |
| Phi-3.5-MoE | August 2024 | 16x3.8B (6.6B active) | 128K | 4.9T | Mixture-of-experts architecture | MIT |
| Phi-3.5-vision | August 2024 | 4.2B | 128K | 500B | Image + text input; single and multi-image support | MIT |
| Phi-4 | December 2024 | 14B | 16K | 9.8T | Surpassed GPT-4o on STEM benchmarks; heavy synthetic data use | MIT |
| Phi-4-mini | February 2025 | 3.8B | 128K | 5T | 200K vocabulary; GQA; LongRoPE; function calling | MIT |
| Phi-4-multimodal | February 2025 | 5.6B | 128K | 5T text + 2.3M hrs speech + 1.1T vision | Text + vision + speech in single model; mixture-of-LoRAs | MIT |
| Phi-4-reasoning | April 2025 | 14B | 32K | 16B (fine-tuning) | STEM reasoning; 62.9% on AIME 2025 | MIT |
| Phi-4-reasoning-plus | April 2025 | 14B | 32K | 16B (fine-tuning) | Outcome-based RL for longer reasoning traces; 78.0% on AIME 2025 | MIT |
| Phi-4-mini-reasoning | April 2025 | 3.8B | 128K | 150B (fine-tuning) | Compact reasoning model; 94.6% on MATH-500 | MIT |
| Phi-4-mini-flash-reasoning | July 2025 | 3.8B | 64K | Synthetic math data | Hybrid Mamba-attention architecture; 10x throughput | MIT |
| Phi-4-reasoning-vision | March 2026 | 15B | 16K | ~200B multimodal | Multimodal reasoning with selective thinking; SigLIP-2 vision encoder | MIT |
Phi-1, released in June 2023, was the model that established the core thesis of the Phi series. It is a 1.3-billion-parameter transformer model trained specifically for Python code generation [1].
The key innovation behind Phi-1 was its training data curation. Rather than training on the largest available corpus of code scraped from the internet, the research team at Microsoft constructed a carefully filtered and augmented dataset with two components [1]:
The total training dataset was roughly 7 billion tokens, a tiny fraction of the trillions of tokens used to train contemporary models. Training was completed in 4 days on 8 NVIDIA A100 GPUs [1].
Despite its small scale, Phi-1 achieved 50.6% pass@1 accuracy on HumanEval and 55.5% on MBPP, performance comparable to models 10 times larger that had been trained on 100 times more data. This result provided compelling evidence that data quality could substitute for data quantity and model scale in specific domains [1].
Released in September 2023, Phi-1.5 extended the Phi-1 approach from code generation to natural language understanding and common sense reasoning. The model retained the same 1.3 billion parameters but was trained on a larger dataset of approximately 30 billion tokens that included both the code-focused data from Phi-1 and new synthetic "textbook-quality" data covering topics in common sense, world knowledge, and logical reasoning [3].
Phi-1.5 demonstrated that the data-quality-over-quantity principle was not limited to code. The model performed competitively with much larger models on reasoning benchmarks, establishing a pattern that would continue throughout the Phi series.
Phi-2, released on December 12, 2023, scaled the approach to 2.7 billion parameters. It was trained on 1.4 trillion tokens over 14 days using 96 A100 GPUs [4].
Phi-2 built on the training insights from Phi-1 and Phi-1.5 with two key additions [4]:
Notably, Phi-2 was released as a base model without instruction tuning or RLHF alignment. Its strong performance came entirely from pretraining data quality.
With only 2.7 billion parameters, Phi-2 surpassed Mistral 7B and LLaMA 2 models at both 7B and 13B parameter counts on aggregated benchmarks. On multi-step reasoning tasks in coding and mathematics, it outperformed the 70-billion-parameter LLaMA 2 model. It also matched or exceeded Google's Gemini Nano 2 despite being a smaller model [4].
| Benchmark Category | Phi-2 (2.7B) |
|---|---|
| BigBench-Hard | 59.2 |
| Commonsense Reasoning | 68.8 |
| Language Understanding | 62.0 |
| Math | 61.1 |
| Coding | 53.7 |
These results drew significant attention from the research community and demonstrated that the "textbooks" approach scaled to general-purpose language modeling, not just code generation.
The Phi-3 family, released in April 2024, represented the first time Microsoft positioned Phi models as practical production-ready systems rather than primarily research demonstrations [5].
Phi-3-mini is a 3.8-billion-parameter model trained on 3.3 trillion tokens. It was released in two context-length variants: a 4K token version for constrained environments and a 128K token version using LongRoPE for extended context applications. The model architecture consists of 32 layers with 3,072 hidden dimensions and 32 attention heads [5].
The training dataset was described as a scaled-up version of the Phi-2 data, composed of heavily filtered web data and synthetic data, followed by alignment training for safety and chat formatting.
Phi-3-mini achieved 68.8% on MMLU and 8.38 on MT-Bench, performance that rivaled the much larger Mixtral 8x7B (with 12.9 billion active parameters) and GPT-3.5 [5]. The model could be quantized to 4 bits, occupying approximately 1.8 GB of memory and running at over 12 tokens per second on an iPhone 14 with an A16 Bionic chip.
Microsoft also released Phi-3-small (7 billion parameters) and Phi-3-medium (14 billion parameters) to provide higher-capacity options within the same architecture family. These models offered progressively better performance at the cost of higher resource requirements.
The accompanying technical report, "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (arXiv:2404.14219), provided detailed documentation of training procedures, benchmark evaluations, and deployment optimization techniques [5].
Released in August 2024, the Phi-3.5 family introduced three variants that expanded the Phi lineup in new directions [6].
An updated version of Phi-3-mini with 3.8 billion parameters, trained on 3.4 trillion tokens using 512 H100 GPUs over 10 days. The model added support for over 20 languages including Arabic, Chinese, Japanese, Korean, Russian, and several European languages, significantly improving multilingual performance compared to Phi-3 [6].
The most architecturally distinct model in the Phi-3.5 release, Phi-3.5-MoE uses a mixture-of-experts architecture with 16 expert networks, selecting the top 2 experts per token. The total parameter count is 16 x 3.8 billion, but with only 6.6 billion parameters active during inference. It was trained on 4.9 trillion tokens using 512 H100 GPUs over 23 days [6].
Despite its modest active parameter count, Phi-3.5-MoE achieved performance comparable to Gemini 1.5 Flash and GPT-4o mini on language reasoning, math, and coding tasks, while outperforming LLaMA 3.1 and Mixtral models of similar scale.
A 4.2-billion-parameter multimodal model capable of processing both single-image and multi-image inputs alongside text prompts. It was derived from Phi-3.5-mini and trained on 500 billion tokens using 256 A100 GPUs over 6 days. This was the first model in the Phi family to support visual understanding [6].
Phi-4, released on December 12, 2024, is a 14-billion-parameter model that represents the most ambitious application of the synthetic data training philosophy in the Phi series [7].
Phi-4 is a dense decoder-only transformer with 14 billion parameters and a default context length of 16,384 tokens. It was trained on 9.8 trillion tokens over 21 days using 1,920 NVIDIA H100 GPUs. The training data included approximately 400 billion high-quality synthetic tokens spread across more than 50 distinct synthetic datasets, each generated using different seed data and multi-stage prompting procedures [7].
The training data mixture consisted of five main categories:
While previous Phi models relied heavily on distillation from GPT-4 to generate their synthetic training data, Phi-4 substantially surpassed GPT-4 on STEM-focused question-answering tasks, demonstrating that the training methodology had moved beyond simple distillation into genuine capability gains.
Phi-4 outperformed both GPT-4o and Meta's LLaMA 3.3 70B on the GPQA and MATH benchmarks, a remarkable result for a 14-billion-parameter model. Performance improvements over Phi-3 exceeded 20% on some benchmarks [7].
| Benchmark | Phi-4 (14B) | Description |
|---|---|---|
| MMLU | 84.8 | Multi-task language understanding |
| GPQA | 56.1 | Graduate-level reasoning |
| MATH | 80.4 | Mathematical problem solving |
| HumanEval | 82.6 | Code generation |
| MGSM | 80.6 | Multilingual math |
| DROP | 75.5 | Complex comprehension and reasoning |
In February 2025, Microsoft released two new models that extended the Phi-4 generation in complementary directions [8].
Phi-4-mini is a 3.8-billion-parameter dense decoder-only transformer trained on 5 trillion tokens using 512 A100-80G GPUs over 21 days. Compared to its predecessor Phi-3.5-mini, it introduced several architectural improvements [8]:
The training data combined publicly available documents filtered for quality, synthetic "textbook-like" data for math, coding, common sense reasoning, and general knowledge, plus high-quality chat format supervised data. Post-training included both supervised fine-tuning and direct preference optimization.
| Benchmark | Phi-4-mini (3.8B) | Phi-3.5-mini (3.8B) | Llama-3.2-3B | Qwen2.5-7B | GPT-4o-mini |
|---|---|---|---|---|---|
| MMLU (5-shot) | 67.3 | 65.5 | 61.8 | 72.6 | 77.2 |
| GSM8K (8-shot, CoT) | 88.6 | 76.9 | 75.6 | 88.7 | 91.3 |
| MATH (0-shot, CoT) | 64.0 | 49.8 | 46.7 | 60.4 | 70.2 |
| BigBench Hard | 70.4 | 63.1 | 55.4 | 72.4 | 80.4 |
| Arena Hard | 32.8 | 34.4 | 17.0 | 55.5 | 53.7 |
Phi-4-multimodal is a 5.6-billion-parameter model and the first in the Phi family to support text, audio, and vision inputs within a single unified architecture. Built on the Phi-4-mini backbone, it was trained on 512 A100-80G GPUs over 28 days using a combined dataset of 5 trillion text tokens, 2.3 million hours of speech data, and 1.1 trillion vision-language tokens [8].
Rather than using separate models for each modality, Phi-4-multimodal employs a mixture-of-LoRAs approach where speech, vision, and language processing share the same core model with modality-specific low-rank adapters. This design keeps the model compact while enabling strong performance across all three modalities.
Speech capabilities. The model achieved the top position on the Hugging Face OpenASR leaderboard with a word error rate of 6.14%, surpassing specialized models like WhisperV3 (6.5% WER) and SeamlessM4T-v2-Large. It is among the first open models to support speech summarization at performance levels comparable to GPT-4o [8].
Vision capabilities. Despite having only 5.6 billion parameters, Phi-4-multimodal demonstrated strong performance on mathematical and scientific visual reasoning. Select vision benchmark results are shown below.
| Vision Benchmark | Phi-4-multimodal (5.6B) | Phi-3.5-vision (4.2B) | Qwen 2.5-VL-7B | GPT-4o |
|---|---|---|---|---|
| MMMU | 55.1 | 43.0 | 51.8 | 61.7 |
| MMBench (dev-en) | 86.7 | 81.9 | 87.8 | 89.0 |
| ScienceQA Visual | 97.5 | -- | -- | 97.3 |
| DocVQA | 93.2 | -- | -- | 95.7 |
| ChartQA | 81.4 | -- | -- | 85.0 |
| OCRBench | 84.4 | -- | -- | 87.7 |
Phi-4-multimodal also excels in cross-modal tasks, combining speech and vision inputs simultaneously. On spoken document understanding tasks (asking questions about documents via voice), it outperformed both InternOmni-7B and Gemini 2.0 Flash [8].
Released on April 30, 2025, the Phi-4-reasoning family brought dedicated chain-of-thought reasoning capabilities to the Phi lineup [9].
Phi-4-reasoning is a 14-billion-parameter model fine-tuned from Phi-4 on approximately 16 billion tokens (roughly 8.3 billion unique tokens) of curated reasoning data. The training focused on 1.4 million high-quality STEM and coding prompts, many of which were enhanced using OpenAI o3-mini to generate detailed reasoning traces. Training was completed in just 2.5 days on 32 H100-80G GPUs [9].
The model produces outputs in two sections: a reasoning chain-of-thought block where it works through the problem step by step, followed by a summarization block with the final answer.
Three variants were released simultaneously:
The Phi-4-reasoning models achieved results that challenged assumptions about what small models could accomplish on difficult reasoning tasks.
| Benchmark | Phi-4-reasoning (14B) | Phi-4-reasoning-plus (14B) | Phi-4-mini-reasoning (3.8B) |
|---|---|---|---|
| AIME 2025 | 62.9 | 78.0 | -- |
| AIME 2024 | 75.3 | 81.3 | 57.5 |
| MATH-500 | -- | -- | 94.6 |
| GPQA Diamond | 65.8 | 68.9 | 52.0 |
| HumanEvalPlus | 92.9 | 92.3 | -- |
| LiveCodeBench (8/24-2/25) | 53.8 | 53.1 | -- |
| OmniMath | 76.6 | 81.9 | -- |
| MMLU-Pro | 74.3 | 76.0 | -- |
| ArenaHard | 73.3 | 79.0 | -- |
Phi-4-reasoning outperformed OpenAI o1-mini and DeepSeek-R1-Distill-Llama-70B on most evaluated benchmarks, and Phi-4-reasoning-plus achieved performance comparable to the full DeepSeek R1 model (671 billion parameters) on AIME 2025. This result was particularly striking because Phi-4-reasoning-plus is roughly 48 times smaller than DeepSeek R1 [9].
Released on July 9, 2025, Phi-4-mini-flash-reasoning is a 3.8-billion-parameter model designed for scenarios where compute, memory, and latency are tightly constrained [10].
The model introduced a novel hybrid architecture called SambaY, which represents a departure from the pure transformer design used in all previous Phi models. SambaY combines three components:
This architecture achieves up to 10 times higher throughput and 2 to 3 times lower average latency compared to standard transformer-based reasoning models of similar size, making it viable for real-time applications on edge hardware [10].
The model was trained exclusively on synthetic mathematical content generated by DeepSeek R1. On Math-500, it achieved 92.45% pass@1 accuracy, outperforming the standard Phi-4-mini-reasoning (91.2%) and surpassing other open models in its size class including Qwen-1.5B and Bespoke-Stratos-7B. The model supports a 64K token context length [10].
Released on March 4, 2026, Phi-4-reasoning-vision-15B is the newest addition to the Phi family and the first Phi model to combine multimodal understanding with chain-of-thought reasoning [11].
The model has 15 billion parameters and uses a mid-fusion architecture that combines the Phi-4-reasoning language model backbone with a SigLIP-2 vision encoder. It supports a 16,384 token context length and can process up to 3,600 visual tokens per image through a dynamic resolution vision encoder. A distinctive architectural feature is the use of bidirectional attention within image tokens, which improves spatial reasoning about visual content [11].
Phi-4-reasoning-vision introduced a selective thinking mechanism that distinguishes it from prior reasoning models. The model can operate in two modes:
<think>...</think> blocks) for complex mathematical, scientific, or logical tasks.<nothink>) for perception-focused tasks like image captioning or OCR where extended reasoning is unnecessary.The model automatically selects the appropriate mode based on task complexity, reducing wasted computation on simple queries while maintaining strong performance on difficult problems [11].
The model was trained on approximately 200 billion tokens of multimodal data using 240 NVIDIA B200 GPUs over 4 days. This training data volume is significantly smaller than what competitors typically use (often exceeding 1 trillion tokens) [11].
| Benchmark | Phi-4-reasoning-vision (15B) | Description |
|---|---|---|
| ScreenSpot-V2 | 88.2 | GUI grounding |
| AI2D | 84.8 | Diagram understanding |
| ChartQA | 83.3 | Chart reasoning |
| OCRBench | 76.0 | Optical character recognition |
| MathVista | 75.2 | Visual math reasoning |
| MMMU | 54.3 | Multimodal understanding |
The model is particularly strong at computer-use agent tasks, interpreting graphical user interfaces and localizing interactive elements on screen. It also handles scientific diagram analysis, handwritten equation parsing, and document extraction [11].
Phi Silica is a specialized variant of the Phi family optimized specifically for the neural processing units (NPUs) found in Windows Copilot+ PCs. Announced in December 2024 and made available to developers starting in January 2025, Phi Silica has 3.3 billion parameters and is integrated directly into the Windows operating system as part of the Windows AI Foundry platform [12].
Phi Silica is designed for extreme efficiency on NPU hardware. On Copilot+ PC devices equipped with Qualcomm Snapdragon X Elite processors, the model achieves a first-token latency of approximately 650 tokens per second while consuming only about 1.5 watts of power. Context processing on the NPU consumes 4.8 milliwatt-hours of energy, representing a 56% improvement in power consumption compared to running the same model on the CPU [12].
The model is delivered as an OS-managed component that can be preloaded in memory, enabling near-instant response times. It powers several built-in Windows features, including the "Click to Do" functionality, and is available as a developer API through the Windows App SDK.
Throughout 2025, Microsoft expanded Phi Silica support beyond Qualcomm-based devices to include Intel and AMD silicon, delivering updates through Windows component packages across the 24H2, 25H2, and 26H1 branches. In May 2025, the Phi-4-reasoning and Phi-4-mini-reasoning models, optimized using ONNX Runtime, became available on Snapdragon-powered Copilot+ PCs [12].
The defining contribution of the Phi series to the broader AI field is the empirical demonstration that training data quality can substitute for model scale to a far greater degree than was previously believed. Several specific principles have emerged from the Phi research program [1][4][7]:
These principles have influenced training methodology beyond Microsoft. The success of the Phi series contributed to broader industry interest in synthetic data for training, with companies like Google and Meta subsequently investing more heavily in synthetic data pipelines for their own models.
Microsoft has integrated Phi models across its product and developer ecosystem in several ways:
The Phi series competes in an increasingly crowded small language model space. As of early 2026, several major players offer competitive alternatives.
Google's Gemma family, released in its third generation (Gemma 3) in March 2025, offers models ranging from 270 million to 27 billion parameters. The Gemma 3 4B model supports multimodal input (images and text) with a 128K context window and scores 71.3% on HumanEval and 89.2% on GSM8K. Google also released Gemma 3n, a purpose-built mobile variant with a 3 GB memory footprint that was the first sub-10B model to surpass 1,300 Elo on LMArena. Gemma uses the Gemma Terms of Use license rather than MIT [13].
Alibaba's Qwen series released Qwen3 in April 2025 with models ranging from 600 million to 235 billion parameters, all under the Apache 2.0 license. The Qwen3-4B model rivals the performance of Qwen2.5-72B-Instruct (a model 18 times larger), while the Qwen3-30B-A3B MoE model (3 billion active parameters) outperforms QwQ-32B. In early 2026, Alibaba released the Qwen 3.5 small model series, with Qwen3.5-9B scoring 82.5 on MMLU-Pro and 81.7 on GPQA Diamond. Qwen models were trained on approximately 36 trillion tokens and support 119 languages [14].
Meta's LLaMA 3.2 (September 2024) introduced lightweight 1B and 3B text models alongside vision-capable 11B and 90B variants. The LLaMA 3.2 3B model supports 128K context and outperformed Phi-3.5-mini on instruction following and summarization tasks at the time of its release. However, LLaMA models use the more restrictive Llama Community License rather than MIT [15].
| Model Family | Developer | Smallest Size | Largest Small Model | License | Multimodal |
|---|---|---|---|---|---|
| Phi-4 | Microsoft | 3.3B (Silica) | 15B (reasoning-vision) | MIT | Text, vision, speech |
| Gemma 3 | 270M | 27B | Gemma Terms of Use | Text, vision | |
| Qwen3 | Alibaba | 600M | 32B (dense) | Apache 2.0 | Text, vision |
| LLaMA 3.2 | Meta | 1B | 3B (text-only) | Llama Community License | Text, vision (11B+) |
Phi's key differentiators remain its MIT license (the most permissive among major model families), its strong per-parameter efficiency rooted in the textbook training approach, and its uniquely deep integration with the Windows and Azure ecosystems.
All Phi models are released under the MIT license, one of the most permissive open-source licenses available. This means the models can be freely used, modified, and distributed for both commercial and non-commercial purposes with minimal restrictions. The MIT license distinguishes the Phi series from competitors like Meta's LLaMA models (released under the more restrictive Llama Community License) and makes Phi models particularly attractive for enterprises and startups that need full legal flexibility [2].
Phi models are available on Hugging Face, Azure AI Foundry, the NVIDIA API Catalog, GitHub Models, and through the Ollama framework for local deployment.
As of March 2026, the Phi model family has established Microsoft as a leading developer of small language models. Over the span of less than three years, the series has grown from a single 1.3-billion-parameter code generation model to a comprehensive family spanning text, vision, speech, and reasoning. The most recent release, Phi-4-reasoning-vision-15B, demonstrates that multimodal reasoning can be achieved at the 15-billion-parameter scale using selective thinking to balance performance and efficiency [11].
The Phi-4-reasoning models have been particularly influential. Matching or exceeding the performance of OpenAI o1-mini with a 14-billion-parameter model, and approaching the full DeepSeek R1 (671 billion parameters) on mathematical reasoning benchmarks, challenged widespread assumptions about the relationship between model size and reasoning capability [9].
Microsoft continues to push the boundaries of efficient architecture design, as demonstrated by the SambaY hybrid Mamba-attention architecture in Phi-4-mini-flash-reasoning, which achieves a 10-fold throughput improvement over standard transformers [10]. The integration of Phi models into Windows through Phi Silica, combined with ongoing NPU optimization for Intel, AMD, and Qualcomm hardware, positions the Phi family for growing adoption in on-device AI applications [12].
Looking ahead, the competition among small language models continues to intensify, with Google, Alibaba, and Meta all releasing increasingly capable small models. The Phi series' core thesis, that training data quality and curation can compensate for model scale, has been validated across five generations of releases and adopted as a guiding principle by the broader research community.