# Phi (language model)

> Source: https://aiwiki.ai/wiki/phi
> Updated: 2026-06-21
> Categories: Large Language Models, Microsoft, Open Source AI, Small Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Phi** is a family of open-weight small language models (SLMs) developed by [Microsoft](/wiki/microsoft) Research, beginning with Phi-1 in June 2023 and spanning thirteen-plus releases through Phi-4-reasoning-vision-15B in March 2026. Every Phi model is released under the permissive MIT license and is built on a single thesis: that the quality of training data matters far more than its quantity, and that carefully curated [synthetic data](/wiki/synthetic_data) lets a small model match or beat models many times its size on reasoning, coding, and math. The original 2023 paper that introduced the series, titled "Textbooks Are All You Need," reports that the 1.3-billion-parameter Phi-1 was "trained for 4 days on 8 A100s, using a selection of textbook quality data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens)," and that despite this small scale it "attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP" [1].

This principle, captured in the title of that paper, has guided every subsequent release in the series. Across multiple generations, from the 1.3-billion-parameter Phi-1 to the 14-billion-parameter Phi-4 and its reasoning variants, Microsoft has consistently demonstrated that small, efficiently trained models can compete with much larger systems on reasoning, coding, and mathematical benchmarks. The Phi models are released under the MIT license, making them freely available for both commercial and research use, and they are designed to run on resource-constrained hardware including mobile phones, laptops, and edge devices [2]. As of March 2026, the Phi family spans text-only models, multimodal systems capable of processing images and speech, and dedicated reasoning models that rival systems tens of times their size on STEM benchmarks.

## Model Overview

The following table summarizes all major releases in the Phi model family.

| Model | Release Date | Parameters | Context Length | Training Tokens | Key Feature | License |
|---|---|---|---|---|---|---|
| Phi-1 | June 2023 | 1.3B | 2K | ~7B (6B web + 1B synthetic) | Code generation; "Textbooks Are All You Need" | MIT |
| Phi-1.5 | September 2023 | 1.3B | 2K | ~30B | Extended to common sense reasoning and NLU | MIT |
| Phi-2 | December 2023 | 2.7B | 2K | 1.4T | Outperformed 7B-13B models; knowledge transfer from Phi-1.5 | MIT |
| Phi-3-mini | April 2024 | 3.8B | 4K / 128K | 3.3T | Ran on mobile phones; first production-ready Phi | MIT |
| Phi-3-small | April 2024 | 7B | 8K / 128K | 4.8T | Higher capacity variant; 75% MMLU | MIT |
| Phi-3-medium | April 2024 | 14B | 4K / 128K | 4.8T | Largest Phi-3 dense model; 78% MMLU | MIT |
| Phi-3.5-mini | August 2024 | 3.8B | 128K | 3.4T | Improved multilingual support (20+ languages) | MIT |
| Phi-3.5-MoE | August 2024 | 16x3.8B (6.6B active) | 128K | 4.9T | Mixture-of-experts architecture | MIT |
| Phi-3.5-vision | August 2024 | 4.2B | 128K | 500B | Image + text input; single and multi-image support | MIT |
| Phi-4 | December 2024 | 14B | 16K | 9.8T | Surpassed GPT-4o on STEM benchmarks; heavy synthetic data use | MIT |
| Phi-4-mini | February 2025 | 3.8B | 128K | 5T | 200K vocabulary; GQA; LongRoPE; function calling | MIT |
| Phi-4-multimodal | February 2025 | 5.6B | 128K | 5T text + 2.3M hrs speech + 1.1T vision | Text + vision + speech in single model; mixture-of-LoRAs | MIT |
| Phi-4-reasoning | April 2025 | 14B | 32K | 16B (fine-tuning) | STEM reasoning; 62.9% on AIME 2025 | MIT |
| Phi-4-reasoning-plus | April 2025 | 14B | 32K | 16B (fine-tuning) | Outcome-based RL for longer reasoning traces; 78.0% on AIME 2025 | MIT |
| Phi-4-mini-reasoning | April 2025 | 3.8B | 128K | 150B (fine-tuning) | Compact reasoning model; 94.6% on MATH-500 | MIT |
| Phi-4-mini-flash-reasoning | July 2025 | 3.8B | 64K | Synthetic math data | Hybrid Mamba-attention architecture; 10x throughput | MIT |
| Phi-4-reasoning-vision | March 2026 | 15B | 16K | ~200B multimodal | Multimodal reasoning with selective thinking; SigLIP-2 vision encoder | MIT |

## Phi-1: Textbooks Are All You Need

Phi-1, released in June 2023, was the model that established the core thesis of the Phi series. It is a 1.3-billion-parameter [transformer](/wiki/transformer) model trained specifically for Python code generation [1].

### Training Data Philosophy

The key innovation behind Phi-1 was its training data curation. Rather than training on the largest available corpus of code scraped from the internet, the research team at Microsoft constructed a carefully filtered and augmented dataset with two components [1]:

1. **Filtered web data (6 billion tokens):** Selected from publicly available code repositories (The Stack and StackOverflow), using a classifier trained to identify code samples with high educational value. The classifier distinguished between code that teaches clear programming concepts and code that is noisy, repetitive, or poorly structured.
2. **Synthetic textbook data (1 billion tokens):** Generated using [GPT-3.5](/wiki/gpt-3), consisting of synthetic Python textbooks and programming exercises designed to cover specific reasoning patterns, algorithms, and coding conventions in a structured, pedagogically coherent format.

The total training dataset was roughly 7 billion tokens, a tiny fraction of the trillions of tokens used to train contemporary models. Training was completed in 4 days on 8 NVIDIA A100 GPUs [1].

### Performance

Despite its small scale, Phi-1 achieved 50.6% pass@1 accuracy on [HumanEval](/wiki/humaneval) and 55.5% on [MBPP](/wiki/humaneval), performance comparable to models 10 times larger that had been trained on 100 times more data. This result provided compelling evidence that data quality could substitute for data quantity and model scale in specific domains [1].

## Phi-1.5

Released in September 2023, Phi-1.5 extended the Phi-1 approach from code generation to natural language understanding and common sense reasoning. The model retained the same 1.3 billion parameters but was trained on a larger dataset of approximately 30 billion tokens that included both the code-focused data from Phi-1 and new synthetic "textbook-quality" data covering topics in common sense, world knowledge, and logical reasoning [3].

Phi-1.5 demonstrated that the data-quality-over-quantity principle was not limited to code. The model performed competitively with much larger models on reasoning benchmarks, establishing a pattern that would continue throughout the Phi series.

## Phi-2

Phi-2, released on December 12, 2023, scaled the approach to 2.7 billion parameters. It was trained on 1.4 trillion tokens over 14 days using 96 A100 GPUs [4].

### Training Methodology

Phi-2 built on the training insights from Phi-1 and Phi-1.5 with two key additions [4]:

1. **Scaled synthetic data:** The training mixture combined synthetic datasets specifically designed to teach common sense reasoning and general knowledge with carefully filtered web data selected for educational quality.
2. **Knowledge transfer:** Phi-2 incorporated knowledge transfer from the smaller Phi-1.5 model, which accelerated training convergence and produced measurable improvements in benchmark scores.

Microsoft Research summarized the central lesson of the model directly: "training data quality plays a critical role in model performance" [4]. Notably, Phi-2 was released as a base model without [instruction tuning](/wiki/instruction_tuning) or [RLHF](/wiki/rlhf) alignment. Its strong performance came entirely from pretraining data quality.

### Benchmark Results

With only 2.7 billion parameters, Phi-2 surpassed [Mistral](/wiki/mistral) 7B and [LLaMA 2](/wiki/llama_2) models at both 7B and 13B parameter counts on aggregated benchmarks. On multi-step reasoning tasks in coding and mathematics, it outperformed the 70-billion-parameter [LLaMA](/wiki/llama) 2 model. It also matched or exceeded Google's Gemini Nano 2 despite being a smaller model [4].

| Benchmark Category | Phi-2 (2.7B) |
|---|---|
| BigBench-Hard | 59.2 |
| Commonsense Reasoning | 68.8 |
| Language Understanding | 62.0 |
| Math | 61.1 |
| Coding | 53.7 |

These results drew significant attention from the research community and demonstrated that the "textbooks" approach scaled to general-purpose language modeling, not just code generation.

## Phi-3

The Phi-3 family, released in April 2024, represented the first time Microsoft positioned Phi models as practical production-ready systems rather than primarily research demonstrations [5].

### Phi-3-mini

Phi-3-mini is a 3.8-billion-parameter model trained on 3.3 trillion tokens. It was released in two context-length variants: a 4K token version for constrained environments and a 128K token version using LongRoPE for extended context applications. The model architecture consists of 32 layers with 3,072 hidden dimensions and 32 attention heads [5].

The training dataset was described as a scaled-up version of the Phi-2 data, composed of heavily filtered web data and synthetic data, followed by alignment training for safety and chat formatting.

Phi-3-mini achieved 68.8% on [MMLU](/wiki/mmlu) and 8.38 on [MT-Bench](/wiki/mt_bench), performance that rivaled the much larger [Mixtral](/wiki/mixtral) 8x7B (with 12.9 billion active parameters) and [GPT-3](/wiki/gpt-3).5 [5]. The technical report describes a model "whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 ... despite being small enough to be deployed on a phone" [5]. The model could be quantized to 4 bits, occupying approximately 1.8 GB of memory and running at over 12 tokens per second on an iPhone 14 with an A16 Bionic chip.

### Phi-3-small and Phi-3-medium

Microsoft also released Phi-3-small (7 billion parameters) and Phi-3-medium (14 billion parameters) to provide higher-capacity options within the same architecture family. Both were trained on 4.8 trillion tokens and reached 75% and 78% on MMLU respectively, with MT-bench scores of 8.7 and 8.9, offering progressively better performance at the cost of higher resource requirements [5].

### Technical Paper

The accompanying technical report, "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (arXiv:2404.14219), provided detailed documentation of training procedures, benchmark evaluations, and deployment optimization techniques [5].

## Phi-3.5

Released in August 2024, the Phi-3.5 family introduced three variants that expanded the Phi lineup in new directions [6].

### Phi-3.5-mini

An updated version of Phi-3-mini with 3.8 billion parameters, trained on 3.4 trillion tokens using 512 H100 GPUs over 10 days. The model added support for over 20 languages including Arabic, Chinese, Japanese, Korean, Russian, and several European languages, significantly improving multilingual performance compared to Phi-3 [6].

### Phi-3.5-MoE

The most architecturally distinct model in the Phi-3.5 release, Phi-3.5-MoE uses a [mixture-of-experts](/wiki/mixture_of_experts) architecture with 16 expert networks, selecting the top 2 experts per token. The total parameter count is 16 x 3.8 billion, but with only 6.6 billion parameters active during inference. It was trained on 4.9 trillion tokens using 512 H100 GPUs over 23 days [6].

Despite its modest active parameter count, Phi-3.5-MoE achieved performance comparable to [Gemini](/wiki/gemini) 1.5 Flash and [GPT-4o](/wiki/gpt-4) mini on language reasoning, math, and coding tasks, while outperforming [LLaMA 3](/wiki/llama_3).1 and Mixtral models of similar scale.

### Phi-3.5-vision

A 4.2-billion-parameter multimodal model capable of processing both single-image and multi-image inputs alongside text prompts. It was derived from Phi-3.5-mini and trained on 500 billion tokens using 256 A100 GPUs over 6 days. This was the first model in the Phi family to support visual understanding [6].

## Phi-4

Phi-4, released on December 12, 2024, is a 14-billion-parameter model that represents the most ambitious application of the synthetic data training philosophy in the Phi series [7].

### Architecture and Training

Phi-4 is a dense decoder-only [transformer](/wiki/transformer) with 14 billion parameters and a default context length of 16,384 tokens. It was trained on 9.8 trillion tokens over 21 days using 1,920 NVIDIA [H100](/wiki/nvidia) GPUs. The training data included approximately 400 billion high-quality synthetic tokens spread across more than 50 distinct synthetic datasets, each generated using different seed data and multi-stage prompting procedures [7].

The training data mixture consisted of five main categories:

1. **Synthetic data:** Over 50 types of synthetic datasets covering diverse topics, skills, and interaction styles.
2. **Web rewrites:** Web content reformulated into more structured, educational formats.
3. **Filtered web data:** Split into reasoning-heavy and knowledge-heavy portions, selected by quality classifiers.
4. **Targeted acquisitions:** Academic papers, books, and forum discussions.
5. **Code data:** Programming examples and repositories.

While previous Phi models relied heavily on distillation from [GPT-4](/wiki/gpt-4) to generate their synthetic training data, the Phi-4 technical report states that the model "substantially surpasses its teacher model on STEM-focused QA capabilities," demonstrating that the training methodology had moved beyond simple distillation into genuine capability gains [7].

### Benchmark Performance

Phi-4 outperformed both GPT-4o and Meta's [LLaMA 3.3](/wiki/llama_3) 70B on the [GPQA](/wiki/gpqa) and [MATH](/wiki/math) benchmarks, a remarkable result for a 14-billion-parameter model. Performance improvements over Phi-3 exceeded 20% on some benchmarks [7].

| Benchmark | Phi-4 (14B) | Description |
|---|---|---|
| MMLU | 84.8 | Multi-task language understanding |
| GPQA | 56.1 | Graduate-level reasoning |
| MATH | 80.4 | Mathematical problem solving |
| HumanEval | 82.6 | Code generation |
| MGSM | 80.6 | Multilingual math |
| DROP | 75.5 | Complex comprehension and reasoning |

## Phi-4-mini and Phi-4-multimodal

In February 2025, Microsoft released two new models that extended the Phi-4 generation in complementary directions [8].

### Phi-4-mini

Phi-4-mini is a 3.8-billion-parameter dense decoder-only transformer trained on 5 trillion tokens using 512 A100-80G GPUs over 21 days. Compared to its predecessor Phi-3.5-mini, it introduced several architectural improvements [8]:

- A significantly expanded vocabulary of 200,064 tokens, improving multilingual support across 24 languages.
- [Grouped-query attention](/wiki/grouped_query_attention) (GQA) for more efficient inference.
- Shared input and output embeddings to reduce memory footprint.
- LongRoPE positional encoding supporting 128K token context.
- Built-in function calling and improved instruction following.

The training data combined publicly available documents filtered for quality, synthetic "textbook-like" data for math, coding, common sense reasoning, and general knowledge, plus high-quality chat format supervised data. [Post-training](/wiki/post-training) included both supervised [fine-tuning](/wiki/fine_tuning) and direct preference optimization.

| Benchmark | Phi-4-mini (3.8B) | Phi-3.5-mini (3.8B) | Llama-3.2-3B | Qwen2.5-7B | GPT-4o-mini |
|---|---|---|---|---|---|
| MMLU (5-shot) | 67.3 | 65.5 | 61.8 | 72.6 | 77.2 |
| GSM8K (8-shot, CoT) | 88.6 | 76.9 | 75.6 | 88.7 | 91.3 |
| MATH (0-shot, CoT) | 64.0 | 49.8 | 46.7 | 60.4 | 70.2 |
| BigBench Hard | 70.4 | 63.1 | 55.4 | 72.4 | 80.4 |
| Arena Hard | 32.8 | 34.4 | 17.0 | 55.5 | 53.7 |

### Phi-4-multimodal

Phi-4-multimodal is a 5.6-billion-parameter model and the first in the Phi family to support text, audio, and vision inputs within a single unified architecture. Built on the Phi-4-mini backbone, it was trained on 512 A100-80G GPUs over 28 days using a combined dataset of 5 trillion text tokens, 2.3 million hours of speech data, and 1.1 trillion vision-language tokens [8].

Rather than using separate models for each modality, Phi-4-multimodal employs a mixture-of-LoRAs approach where speech, vision, and language processing share the same core model with modality-specific low-rank adapters. This design keeps the model compact while enabling strong performance across all three modalities.

**Speech capabilities.** The model achieved the top position on the Hugging Face OpenASR leaderboard with a word error rate of 6.14%, surpassing specialized models like WhisperV3 (6.5% WER) and SeamlessM4T-v2-Large. It is among the first open models to support speech summarization at performance levels comparable to GPT-4o [8].

**Vision capabilities.** Despite having only 5.6 billion parameters, Phi-4-multimodal demonstrated strong performance on mathematical and scientific visual reasoning. Select vision benchmark results are shown below.

| Vision Benchmark | Phi-4-multimodal (5.6B) | Phi-3.5-vision (4.2B) | Qwen 2.5-VL-7B | GPT-4o |
|---|---|---|---|---|
| MMMU | 55.1 | 43.0 | 51.8 | 61.7 |
| MMBench (dev-en) | 86.7 | 81.9 | 87.8 | 89.0 |
| ScienceQA Visual | 97.5 | -- | -- | 97.3 |
| DocVQA | 93.2 | -- | -- | 95.7 |
| ChartQA | 81.4 | -- | -- | 85.0 |
| OCRBench | 84.4 | -- | -- | 87.7 |

Phi-4-multimodal also excels in cross-modal tasks, combining speech and vision inputs simultaneously. On spoken document understanding tasks (asking questions about documents via voice), it outperformed both InternOmni-7B and Gemini 2.0 Flash [8].

## Phi-4-reasoning

Released on April 30, 2025, the Phi-4-reasoning family brought dedicated [chain-of-thought](/wiki/chain_of_thought) reasoning capabilities to the Phi lineup [9].

### Training Approach

Phi-4-reasoning is a 14-billion-parameter model fine-tuned from Phi-4 on approximately 16 billion tokens (roughly 8.3 billion unique tokens) of curated reasoning data. The training focused on 1.4 million high-quality STEM and coding prompts, many of which were enhanced using [OpenAI](/wiki/openai) o3-mini to generate detailed reasoning traces. Training was completed in just 2.5 days on 32 H100-80G GPUs [9].

The model produces outputs in two sections: a reasoning chain-of-thought block where it works through the problem step by step, followed by a summarization block with the final answer.

### Variants

Three variants were released simultaneously:

- **Phi-4-reasoning:** The base reasoning model with 14 billion parameters and 32K context length.
- **Phi-4-reasoning-plus:** Enhanced through a short phase of outcome-based [reinforcement learning](/wiki/reinforcement_learning) that encourages the model to generate longer, more thorough reasoning traces. This variant uses approximately 1.5 times more tokens than the base model on average, trading latency for improved accuracy.
- **Phi-4-mini-reasoning:** A compact 3.8-billion-parameter reasoning model with 128K context, fine-tuned on 150 billion tokens of synthetic mathematical content distilled from the [DeepSeek](/wiki/deepseek) R1 model. Training took 2 days on 128 H100-80G GPUs [9].

### Benchmark Performance

The Phi-4-reasoning models achieved results that challenged assumptions about what small models could accomplish on difficult reasoning tasks. The technical report states that the models "outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model" [9].

| Benchmark | Phi-4-reasoning (14B) | Phi-4-reasoning-plus (14B) | Phi-4-mini-reasoning (3.8B) |
|---|---|---|---|
| AIME 2025 | 62.9 | 78.0 | -- |
| AIME 2024 | 75.3 | 81.3 | 57.5 |
| MATH-500 | -- | -- | 94.6 |
| GPQA Diamond | 65.8 | 68.9 | 52.0 |
| HumanEvalPlus | 92.9 | 92.3 | -- |
| LiveCodeBench (8/24-2/25) | 53.8 | 53.1 | -- |
| OmniMath | 76.6 | 81.9 | -- |
| MMLU-Pro | 74.3 | 76.0 | -- |
| ArenaHard | 73.3 | 79.0 | -- |

Phi-4-reasoning outperformed [OpenAI o1](/wiki/o1)-mini and [DeepSeek-R1](/wiki/deepseek_r1)-Distill-Llama-70B on most evaluated benchmarks, and Phi-4-reasoning-plus achieved performance comparable to the full DeepSeek R1 model (671 billion parameters) on [AIME 2025](/wiki/aime_2025). This result was particularly striking because Phi-4-reasoning-plus is roughly 48 times smaller than DeepSeek R1 [9].

## Phi-4-mini-flash-reasoning

Released on July 9, 2025, Phi-4-mini-flash-reasoning is a 3.8-billion-parameter model designed for scenarios where compute, memory, and latency are tightly constrained [10].

### Architecture

The model introduced a novel hybrid architecture called SambaY, which represents a departure from the pure transformer design used in all previous Phi models. SambaY combines three components:

1. A **self-decoder** layer that integrates [Mamba](/wiki/mamba) (a [state space model](/wiki/state_space_model)) with sliding window attention for efficient sequence processing.
2. A **single layer of full attention** for global context integration.
3. A **decoder-hybrid-decoder** arrangement that balances computational efficiency with reasoning quality.

This architecture achieves up to 10 times higher throughput and 2 to 3 times lower average latency compared to standard transformer-based reasoning models of similar size, making it viable for real-time applications on edge hardware [10].

### Training and Performance

The model was trained exclusively on synthetic mathematical content generated by DeepSeek R1. On Math-500, it achieved 92.45% pass@1 accuracy, outperforming the standard Phi-4-mini-reasoning (91.2%) and surpassing other open models in its size class including Qwen-1.5B and Bespoke-Stratos-7B. The model supports a 64K token context length [10].

## Phi-4-reasoning-vision

Released on March 4, 2026, Phi-4-reasoning-vision-15B is the newest addition to the Phi family and the first Phi model to combine multimodal understanding with chain-of-thought reasoning [11].

### Architecture

The model has 15 billion parameters and uses a mid-fusion architecture that combines the Phi-4-reasoning language model backbone with a SigLIP-2 vision encoder. It supports a 16,384 token context length and can process up to 3,600 visual tokens per image through a dynamic resolution vision encoder. A distinctive architectural feature is the use of bidirectional attention within image tokens, which improves spatial reasoning about visual content [11].

### Selective Thinking

Phi-4-reasoning-vision introduced a selective thinking mechanism that distinguishes it from prior reasoning models. The model can operate in two modes:

- **Think mode:** Uses extended chain-of-thought reasoning (marked with `<think>...</think>` blocks) for complex mathematical, scientific, or logical tasks.
- **No-think mode:** Defaults to direct inference (marked with `<nothink>`) for perception-focused tasks like image captioning or OCR where extended reasoning is unnecessary.

The model automatically selects the appropriate mode based on task complexity, reducing wasted computation on simple queries while maintaining strong performance on difficult problems [11].

### Training and Performance

The model was trained on approximately 200 billion tokens of multimodal data using 240 NVIDIA B200 GPUs over 4 days. This training data volume is significantly smaller than what competitors typically use, often exceeding 1 trillion tokens, roughly 5 times more data [11]. Microsoft attributes the efficiency to meticulous data curation rather than brute-force scale, with the team manually reviewing datasets at a rate of 5 to 10 minutes per sample, regenerating incorrect answers using GPT-4o, and fixing formatting errors across widely used open-source benchmarks [11].

| Benchmark | Phi-4-reasoning-vision (15B) | Description |
|---|---|---|
| ScreenSpot-V2 | 88.2 | GUI grounding |
| AI2D | 84.8 | Diagram understanding |
| ChartQA | 83.3 | Chart reasoning |
| OCRBench | 76.0 | Optical character recognition |
| MathVista | 75.2 | Visual math reasoning |
| MMMU | 54.3 | Multimodal understanding |

The model is particularly strong at computer-use agent tasks, interpreting graphical user interfaces and localizing interactive elements on screen. It also handles scientific diagram analysis, handwritten equation parsing, and document extraction [11].

## Phi Silica: On-Device Deployment

Phi Silica is a specialized variant of the Phi family optimized specifically for the neural processing units (NPUs) found in [Windows](/wiki/microsoft) Copilot+ PCs. Announced in December 2024 and made available to developers starting in January 2025, Phi Silica has 3.3 billion parameters and is integrated directly into the Windows operating system as part of the Windows AI Foundry platform [12].

### Technical Specifications

Phi Silica is designed for extreme efficiency on NPU hardware. On Copilot+ PC devices equipped with Qualcomm Snapdragon X Elite processors, the model achieves a first-token latency of approximately 650 tokens per second while consuming only about 1.5 watts of power. Context processing on the NPU consumes 4.8 milliwatt-hours of energy, representing a 56% improvement in power consumption compared to running the same model on the CPU [12].

The model is delivered as an OS-managed component that can be preloaded in memory, enabling near-instant response times. It powers several built-in Windows features, including the "Click to Do" functionality, and is available as a developer API through the Windows App SDK.

### Platform Expansion

Throughout 2025, Microsoft expanded Phi Silica support beyond Qualcomm-based devices to include Intel and AMD silicon, delivering updates through Windows component packages across the 24H2, 25H2, and 26H1 branches. In May 2025, the Phi-4-reasoning and Phi-4-mini-reasoning models, optimized using [ONNX](/wiki/onnx) Runtime, became available on Snapdragon-powered Copilot+ PCs [12].

## Key Insight: Data Quality Over Quantity

The defining contribution of the Phi series to the broader AI field is the empirical demonstration that training data quality can substitute for model scale to a far greater degree than was previously believed. As Microsoft Research put it when introducing Phi-2, "training data quality plays a critical role in model performance" [4]. Several specific principles have emerged from the Phi research program [1][4][7]:

1. **Synthetic data as curriculum:** Rather than scraping the internet indiscriminately, the Phi team generates synthetic training examples that are structured like textbooks: organized around specific concepts, building from simple to complex reasoning, and free of the noise and redundancy found in web-scraped data.
2. **Quality filtering of web data:** When web data is used, it is aggressively filtered using classifiers that evaluate educational value rather than just content relevance.
3. **Knowledge transfer across model generations:** Each Phi model builds on insights and, in some cases, direct knowledge transfer from previous generations, creating a compounding effect.
4. **Iterative synthetic data improvement:** In later models like Phi-4, the synthetic data generation process itself became iterative, with AI systems generating answers, evaluating them, and improving them through multiple rounds.
5. **[Reasoning](/wiki/reasoning) distillation:** Starting with Phi-4-reasoning, the series has demonstrated that reasoning capabilities can be effectively distilled from larger models (such as o3-mini and DeepSeek R1) into much smaller architectures through carefully curated chain-of-thought fine-tuning data.

These principles have influenced training methodology beyond Microsoft. The success of the Phi series contributed to broader industry interest in synthetic data for training, with companies like [Google](/wiki/google) and [Meta](/wiki/meta) subsequently investing more heavily in synthetic data pipelines for their own models.

## Integration with the Microsoft Ecosystem

Microsoft has integrated Phi models across its product and developer ecosystem in several ways:

- **Azure AI Foundry:** All Phi models are available for deployment through the Azure AI Foundry Model Catalog, supporting both serverless API endpoints and managed compute deployments [2].
- **ONNX Runtime:** Phi models have been optimized for [ONNX](/wiki/onnx) Runtime with support for Windows DirectML, enabling cross-platform deployment across GPU, CPU, and mobile hardware. Phi-3-mini was among the first models optimized for this runtime [2].
- **Windows Copilot Runtime:** Phi Silica is built into the Windows Copilot Runtime, which includes over 40 machine learning models. The Windows AI Foundry platform provides developer APIs for integrating Phi-based capabilities into Windows applications [12].
- **NVIDIA:** Phi models are available through the NVIDIA API Catalog and NVIDIA NIM for optimized inference deployment [2].
- **[Ollama](/wiki/ollama) and Hugging Face:** For local development and experimentation, all Phi models are distributed through [Hugging Face](/wiki/hugging_face) and the Ollama framework [2].
- **GitHub Models:** Phi models are also accessible through GitHub Models for quick prototyping and evaluation [8].

## How does Phi compare to other small language models?

The Phi series competes in an increasingly crowded small language model space. As of early 2026, several major players offer competitive alternatives.

### Google Gemma

[Google](/wiki/google)'s [Gemma](/wiki/gemma) family, released in its third generation (Gemma 3) in March 2025, offers models ranging from 270 million to 27 billion parameters. The Gemma 3 4B model supports multimodal input (images and text) with a 128K context window and scores 71.3% on HumanEval and 89.2% on [GSM8K](/wiki/gsm8k). Google also released Gemma 3n, a purpose-built mobile variant with a 3 GB memory footprint that was the first sub-10B model to surpass 1,300 Elo on LMArena. Gemma uses the Gemma [Terms](/wiki/terms) of Use license rather than MIT [13].

### Alibaba Qwen

Alibaba's [Qwen](/wiki/qwen) series released Qwen3 in April 2025 with models ranging from 600 million to 235 billion parameters, all under the Apache 2.0 license. The Qwen3-4B model rivals the performance of Qwen2.5-72B-Instruct (a model 18 times larger), while the Qwen3-30B-A3B MoE model (3 billion active parameters) outperforms QwQ-32B. In early 2026, Alibaba released the Qwen 3.5 small model series, with Qwen3.5-9B scoring 82.5 on [MMLU-Pro](/wiki/mmlu-pro) and 81.7 on [GPQA Diamond](/wiki/gpqa_diamond). Qwen models were trained on approximately 36 trillion tokens and support 119 languages [14].

### Meta LLaMA

[Meta](/wiki/meta)'s [LLaMA 3.2](/wiki/llama_3) (September 2024) introduced lightweight 1B and 3B text models alongside vision-capable 11B and 90B variants. The LLaMA 3.2 3B model supports 128K context and outperformed Phi-3.5-mini on instruction following and summarization tasks at the time of its release. However, LLaMA models use the more restrictive Llama Community License rather than MIT [15].

### Comparative Summary

| Model Family | Developer | Smallest Size | Largest Small Model | License | Multimodal |
|---|---|---|---|---|---|
| Phi-4 | [Microsoft](/wiki/microsoft) | 3.3B (Silica) | 15B (reasoning-vision) | MIT | Text, vision, speech |
| Gemma 3 | [Google](/wiki/google) | 270M | 27B | Gemma Terms of Use | Text, vision |
| Qwen3 | [Alibaba](/wiki/alibaba_cloud) | 600M | 32B (dense) | Apache 2.0 | Text, vision |
| LLaMA 3.2 | [Meta](/wiki/meta) | 1B | 3B (text-only) | Llama Community License | Text, vision (11B+) |

Phi's key differentiators remain its MIT license (the most permissive among major model families), its strong per-parameter efficiency rooted in the textbook training approach, and its uniquely deep integration with the Windows and Azure ecosystems.

## Is Phi open source?

All Phi models are released under the MIT license, one of the most permissive [open-source](/wiki/open_source_ai) licenses available. This means the models can be freely used, modified, and distributed for both commercial and non-commercial purposes with minimal restrictions. The MIT license distinguishes the Phi series from competitors like Meta's LLaMA models (released under the more restrictive Llama Community License) and makes Phi models particularly attractive for enterprises and startups that need full legal flexibility [2].

Phi models are available on Hugging Face, Azure AI Foundry, the NVIDIA API Catalog, GitHub Models, and through the Ollama framework for local deployment.

## Current State (March 2026)

As of March 2026, the Phi model family has established Microsoft as a leading developer of small language models. Over the span of less than three years, the series has grown from a single 1.3-billion-parameter code generation model to a comprehensive family spanning text, vision, speech, and reasoning. The most recent release, Phi-4-reasoning-vision-15B, demonstrates that multimodal reasoning can be achieved at the 15-billion-parameter scale using selective thinking to balance performance and efficiency [11].

The Phi-4-reasoning models have been particularly influential. Matching or exceeding the performance of OpenAI o1-mini with a 14-billion-parameter model, and approaching the full DeepSeek R1 (671 billion parameters) on mathematical reasoning benchmarks, challenged widespread assumptions about the relationship between model size and reasoning capability [9].

Microsoft continues to push the boundaries of efficient architecture design, as demonstrated by the SambaY hybrid Mamba-attention architecture in Phi-4-mini-flash-reasoning, which achieves a 10-fold throughput improvement over standard transformers [10]. The integration of Phi models into Windows through Phi Silica, combined with ongoing NPU optimization for Intel, AMD, and Qualcomm hardware, positions the Phi family for growing adoption in on-device AI applications [12].

Looking ahead, the competition among small language models continues to intensify, with Google, Alibaba, and Meta all releasing increasingly capable small models. The Phi series' core thesis, that training data quality and curation can compensate for model scale, has been validated across five generations of releases and adopted as a guiding principle by the broader research community.

## See Also

- [Large language model](/wiki/large_language_model)
- [Synthetic data](/wiki/synthetic_data)
- [Mixture of experts](/wiki/mixture_of_experts)
- [Transformer](/wiki/transformer)
- [Chain-of-thought](/wiki/chain_of_thought)
- [ONNX](/wiki/onnx)
- [Gemma](/wiki/gemma)
- [Qwen](/wiki/qwen)

## References

1. [Textbooks Are All You Need](https://arxiv.org/abs/2306.11644) - Gunasekar et al., arXiv:2306.11644, June 2023
2. [Phi Open Models - Small Language Models](https://azure.microsoft.com/en-us/products/phi/) - Microsoft Azure
3. [Textbooks Are All You Need II: phi-1.5 technical report](https://www.microsoft.com/en-us/research/publication/textbooks-are-all-you-need-ii-phi-1-5-technical-report/) - Microsoft Research, September 2023
4. [Phi-2: The surprising power of small language models](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/) - Microsoft Research, December 2023
5. [Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone](https://arxiv.org/abs/2404.14219) - arXiv:2404.14219, April 2024
6. [Discover the new multi-lingual high-quality Phi-3.5 SLMs](https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280) - Microsoft Tech Community, August 2024
7. [Phi-4 Technical Report](https://arxiv.org/abs/2412.08905) - arXiv:2412.08905, December 2024
8. [Empowering innovation: The next generation of the Phi family](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/) - Microsoft Azure Blog, February 2025
9. [Phi-4-reasoning Technical Report](https://arxiv.org/abs/2504.21318) - arXiv:2504.21318, April 2025
10. [Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning](https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/) - Microsoft Azure Blog, July 2025
11. [Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model](https://www.microsoft.com/en-us/research/blog/phi-4-reasoning-vision-and-the-lessons-of-training-a-multimodal-reasoning-model/) - Microsoft Research, March 2026
12. [Phi Silica, small but mighty on-device SLM](https://blogs.windows.com/windowsexperience/2024/12/06/phi-silica-small-but-mighty-on-device-slm/) - Windows Experience Blog, December 2024
13. [Gemma 3 - Google DeepMind](https://deepmind.google/models/gemma/gemma-3/) - Google, March 2025
14. [Qwen3 Technical Report](https://arxiv.org/abs/2505.09388) - arXiv:2505.09388, May 2025
15. [Llama 3.2: Revolutionizing edge AI and vision with open, customizable models](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) - Meta AI, September 2024