Phi (language model)

Large Language Models Microsoft Open Source AI Small Language Models

26 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v7 · 5,228 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Phi is a family of open-weight small language models (SLMs) developed by Microsoft Research, beginning with Phi-1 in June 2023 and spanning thirteen-plus releases through Phi-4-reasoning-vision-15B in March 2026. Every Phi model is released under the permissive MIT license and is built on a single thesis: that the quality of training data matters far more than its quantity, and that carefully curated synthetic data lets a small model match or beat models many times its size on reasoning, coding, and math. The original 2023 paper that introduced the series, titled "Textbooks Are All You Need," reports that the 1.3-billion-parameter Phi-1 was "trained for 4 days on 8 A100s, using a selection of textbook quality data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens)," and that despite this small scale it "attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP" ^[1].

This principle, captured in the title of that paper, has guided every subsequent release in the series. Across multiple generations, from the 1.3-billion-parameter Phi-1 to the 14-billion-parameter Phi-4 and its reasoning variants, Microsoft has consistently demonstrated that small, efficiently trained models can compete with much larger systems on reasoning, coding, and mathematical benchmarks. The Phi models are released under the MIT license, making them freely available for both commercial and research use, and they are designed to run on resource-constrained hardware including mobile phones, laptops, and edge devices ^[2]. As of March 2026, the Phi family spans text-only models, multimodal systems capable of processing images and speech, and dedicated reasoning models that rival systems tens of times their size on STEM benchmarks.

Model Overview

The following table summarizes all major releases in the Phi model family.

Model	Release Date	Parameters	Context Length	Training Tokens	Key Feature	License
Phi-1	June 2023	1.3B	2K	~7B (6B web + 1B synthetic)	Code generation; "Textbooks Are All You Need"	MIT
Phi-1.5	September 2023	1.3B	2K	~30B	Extended to common sense reasoning and NLU	MIT
Phi-2	December 2023	2.7B	2K	1.4T	Outperformed 7B-13B models; knowledge transfer from Phi-1.5	MIT
Phi-3-mini	April 2024	3.8B	4K / 128K	3.3T	Ran on mobile phones; first production-ready Phi	MIT
Phi-3-small	April 2024	7B	8K / 128K	4.8T	Higher capacity variant; 75% MMLU	MIT
Phi-3-medium	April 2024	14B	4K / 128K	4.8T	Largest Phi-3 dense model; 78% MMLU	MIT
Phi-3.5-mini	August 2024	3.8B	128K	3.4T	Improved multilingual support (20+ languages)	MIT
Phi-3.5-MoE	August 2024	16x3.8B (6.6B active)	128K	4.9T	Mixture-of-experts architecture	MIT
Phi-3.5-vision	August 2024	4.2B	128K	500B	Image + text input; single and multi-image support	MIT
Phi-4	December 2024	14B	16K	9.8T	Surpassed GPT-4o on STEM benchmarks; heavy synthetic data use	MIT
Phi-4-mini	February 2025	3.8B	128K	5T	200K vocabulary; GQA; LongRoPE; function calling	MIT
Phi-4-multimodal	February 2025	5.6B	128K	5T text + 2.3M hrs speech + 1.1T vision	Text + vision + speech in single model; mixture-of-LoRAs	MIT
Phi-4-reasoning	April 2025	14B	32K	16B (fine-tuning)	STEM reasoning; 62.9% on AIME 2025	MIT
Phi-4-reasoning-plus	April 2025	14B	32K	16B (fine-tuning)	Outcome-based RL for longer reasoning traces; 78.0% on AIME 2025	MIT
Phi-4-mini-reasoning	April 2025	3.8B	128K	150B (fine-tuning)	Compact reasoning model; 94.6% on MATH-500	MIT
Phi-4-mini-flash-reasoning	July 2025	3.8B	64K	Synthetic math data	Hybrid Mamba-attention architecture; 10x throughput	MIT
Phi-4-reasoning-vision	March 2026	15B	16K	~200B multimodal	Multimodal reasoning with selective thinking; SigLIP-2 vision encoder	MIT

Phi-1: Textbooks Are All You Need

Phi-1, released in June 2023, was the model that established the core thesis of the Phi series. It is a 1.3-billion-parameter transformer model trained specifically for Python code generation ^[1].

Training Data Philosophy

The key innovation behind Phi-1 was its training data curation. Rather than training on the largest available corpus of code scraped from the internet, the research team at Microsoft constructed a carefully filtered and augmented dataset with two components ^[1]:

Filtered web data (6 billion tokens): Selected from publicly available code repositories (The Stack and StackOverflow), using a classifier trained to identify code samples with high educational value. The classifier distinguished between code that teaches clear programming concepts and code that is noisy, repetitive, or poorly structured.
Synthetic textbook data (1 billion tokens): Generated using GPT-3.5, consisting of synthetic Python textbooks and programming exercises designed to cover specific reasoning patterns, algorithms, and coding conventions in a structured, pedagogically coherent format.

The total training dataset was roughly 7 billion tokens, a tiny fraction of the trillions of tokens used to train contemporary models. Training was completed in 4 days on 8 NVIDIA A100 GPUs ^[1].

Performance

Despite its small scale, Phi-1 achieved 50.6% pass@1 accuracy on HumanEval and 55.5% on MBPP, performance comparable to models 10 times larger that had been trained on 100 times more data. This result provided compelling evidence that data quality could substitute for data quantity and model scale in specific domains ^[1].

Phi-1.5

Released in September 2023, Phi-1.5 extended the Phi-1 approach from code generation to natural language understanding and common sense reasoning. The model retained the same 1.3 billion parameters but was trained on a larger dataset of approximately 30 billion tokens that included both the code-focused data from Phi-1 and new synthetic "textbook-quality" data covering topics in common sense, world knowledge, and logical reasoning ^[3].

Phi-1.5 demonstrated that the data-quality-over-quantity principle was not limited to code. The model performed competitively with much larger models on reasoning benchmarks, establishing a pattern that would continue throughout the Phi series.

Phi-2

Phi-2, released on December 12, 2023, scaled the approach to 2.7 billion parameters. It was trained on 1.4 trillion tokens over 14 days using 96 A100 GPUs ^[4].

Training Methodology

Phi-2 built on the training insights from Phi-1 and Phi-1.5 with two key additions ^[4]:

Scaled synthetic data: The training mixture combined synthetic datasets specifically designed to teach common sense reasoning and general knowledge with carefully filtered web data selected for educational quality.
Knowledge transfer: Phi-2 incorporated knowledge transfer from the smaller Phi-1.5 model, which accelerated training convergence and produced measurable improvements in benchmark scores.

Microsoft Research summarized the central lesson of the model directly: "training data quality plays a critical role in model performance" ^[4]. Notably, Phi-2 was released as a base model without instruction tuning or RLHF alignment. Its strong performance came entirely from pretraining data quality.

Benchmark Results

With only 2.7 billion parameters, Phi-2 surpassed Mistral 7B and LLaMA 2 models at both 7B and 13B parameter counts on aggregated benchmarks. On multi-step reasoning tasks in coding and mathematics, it outperformed the 70-billion-parameter LLaMA 2 model. It also matched or exceeded Google's Gemini Nano 2 despite being a smaller model ^[4].

Benchmark Category	Phi-2 (2.7B)
BigBench-Hard	59.2
Commonsense Reasoning	68.8
Language Understanding	62.0
Math	61.1
Coding	53.7

These results drew significant attention from the research community and demonstrated that the "textbooks" approach scaled to general-purpose language modeling, not just code generation.

Phi-3

The Phi-3 family, released in April 2024, represented the first time Microsoft positioned Phi models as practical production-ready systems rather than primarily research demonstrations ^[5].

Phi-3-mini

Phi-3-mini is a 3.8-billion-parameter model trained on 3.3 trillion tokens. It was released in two context-length variants: a 4K token version for constrained environments and a 128K token version using LongRoPE for extended context applications. The model architecture consists of 32 layers with 3,072 hidden dimensions and 32 attention heads ^[5].

The training dataset was described as a scaled-up version of the Phi-2 data, composed of heavily filtered web data and synthetic data, followed by alignment training for safety and chat formatting.

Phi-3-mini achieved 68.8% on MMLU and 8.38 on MT-Bench, performance that rivaled the much larger Mixtral 8x7B (with 12.9 billion active parameters) and GPT-3.5 ^[5]. The technical report describes a model "whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 ... despite being small enough to be deployed on a phone" ^[5]. The model could be quantized to 4 bits, occupying approximately 1.8 GB of memory and running at over 12 tokens per second on an iPhone 14 with an A16 Bionic chip.

Phi-3-small and Phi-3-medium

Microsoft also released Phi-3-small (7 billion parameters) and Phi-3-medium (14 billion parameters) to provide higher-capacity options within the same architecture family. Both were trained on 4.8 trillion tokens and reached 75% and 78% on MMLU respectively, with MT-bench scores of 8.7 and 8.9, offering progressively better performance at the cost of higher resource requirements ^[5].

Technical Paper

The accompanying technical report, "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone" (arXiv:2404.14219), provided detailed documentation of training procedures, benchmark evaluations, and deployment optimization techniques ^[5].

Phi-3.5

Released in August 2024, the Phi-3.5 family introduced three variants that expanded the Phi lineup in new directions ^[6].

Phi-3.5-mini

An updated version of Phi-3-mini with 3.8 billion parameters, trained on 3.4 trillion tokens using 512 H100 GPUs over 10 days. The model added support for over 20 languages including Arabic, Chinese, Japanese, Korean, Russian, and several European languages, significantly improving multilingual performance compared to Phi-3 ^[6].

Phi-3.5-MoE

The most architecturally distinct model in the Phi-3.5 release, Phi-3.5-MoE uses a mixture-of-experts architecture with 16 expert networks, selecting the top 2 experts per token. The total parameter count is 16 x 3.8 billion, but with only 6.6 billion parameters active during inference. It was trained on 4.9 trillion tokens using 512 H100 GPUs over 23 days ^[6].

Despite its modest active parameter count, Phi-3.5-MoE achieved performance comparable to Gemini 1.5 Flash and GPT-4o mini on language reasoning, math, and coding tasks, while outperforming LLaMA 3.1 and Mixtral models of similar scale.

Phi-3.5-vision

A 4.2-billion-parameter multimodal model capable of processing both single-image and multi-image inputs alongside text prompts. It was derived from Phi-3.5-mini and trained on 500 billion tokens using 256 A100 GPUs over 6 days. This was the first model in the Phi family to support visual understanding ^[6].

Phi-4

Phi-4, released on December 12, 2024, is a 14-billion-parameter model that represents the most ambitious application of the synthetic data training philosophy in the Phi series ^[7].

Architecture and Training

Phi-4 is a dense decoder-only transformer with 14 billion parameters and a default context length of 16,384 tokens. It was trained on 9.8 trillion tokens over 21 days using 1,920 NVIDIA H100 GPUs. The training data included approximately 400 billion high-quality synthetic tokens spread across more than 50 distinct synthetic datasets, each generated using different seed data and multi-stage prompting procedures ^[7].

The training data mixture consisted of five main categories:

Synthetic data: Over 50 types of synthetic datasets covering diverse topics, skills, and interaction styles.
Web rewrites: Web content reformulated into more structured, educational formats.
Filtered web data: Split into reasoning-heavy and knowledge-heavy portions, selected by quality classifiers.
Targeted acquisitions: Academic papers, books, and forum discussions.
Code data: Programming examples and repositories.

While previous Phi models relied heavily on distillation from GPT-4 to generate their synthetic training data, the Phi-4 technical report states that the model "substantially surpasses its teacher model on STEM-focused QA capabilities," demonstrating that the training methodology had moved beyond simple distillation into genuine capability gains ^[7].

Benchmark Performance

Phi-4 outperformed both GPT-4o and Meta's LLaMA 3.3 70B on the GPQA and MATH benchmarks, a remarkable result for a 14-billion-parameter model. Performance improvements over Phi-3 exceeded 20% on some benchmarks ^[7].

Benchmark	Phi-4 (14B)	Description
MMLU	84.8	Multi-task language understanding
GPQA	56.1	Graduate-level reasoning
MATH	80.4	Mathematical problem solving
HumanEval	82.6	Code generation
MGSM	80.6	Multilingual math
DROP	75.5	Complex comprehension and reasoning

Phi-4-mini and Phi-4-multimodal

In February 2025, Microsoft released two new models that extended the Phi-4 generation in complementary directions ^[8].

Phi-4-mini

Phi-4-mini is a 3.8-billion-parameter dense decoder-only transformer trained on 5 trillion tokens using 512 A100-80G GPUs over 21 days. Compared to its predecessor Phi-3.5-mini, it introduced several architectural improvements ^[8]:

A significantly expanded vocabulary of 200,064 tokens, improving multilingual support across 24 languages.
Grouped-query attention (GQA) for more efficient inference.
Shared input and output embeddings to reduce memory footprint.
LongRoPE positional encoding supporting 128K token context.
Built-in function calling and improved instruction following.

The training data combined publicly available documents filtered for quality, synthetic "textbook-like" data for math, coding, common sense reasoning, and general knowledge, plus high-quality chat format supervised data. Post-training included both supervised fine-tuning and direct preference optimization.

Benchmark	Phi-4-mini (3.8B)	Phi-3.5-mini (3.8B)	Llama-3.2-3B	Qwen2.5-7B	GPT-4o-mini
MMLU (5-shot)	67.3	65.5	61.8	72.6	77.2
GSM8K (8-shot, CoT)	88.6	76.9	75.6	88.7	91.3
MATH (0-shot, CoT)	64.0	49.8	46.7	60.4	70.2
BigBench Hard	70.4	63.1	55.4	72.4	80.4
Arena Hard	32.8	34.4	17.0	55.5	53.7

Phi-4-multimodal

Phi-4-multimodal is a 5.6-billion-parameter model and the first in the Phi family to support text, audio, and vision inputs within a single unified architecture. Built on the Phi-4-mini backbone, it was trained on 512 A100-80G GPUs over 28 days using a combined dataset of 5 trillion text tokens, 2.3 million hours of speech data, and 1.1 trillion vision-language tokens ^[8].

Rather than using separate models for each modality, Phi-4-multimodal employs a mixture-of-LoRAs approach where speech, vision, and language processing share the same core model with modality-specific low-rank adapters. This design keeps the model compact while enabling strong performance across all three modalities.

Speech capabilities. The model achieved the top position on the Hugging Face OpenASR leaderboard with a word error rate of 6.14%, surpassing specialized models like WhisperV3 (6.5% WER) and SeamlessM4T-v2-Large. It is among the first open models to support speech summarization at performance levels comparable to GPT-4o ^[8].

Vision capabilities. Despite having only 5.6 billion parameters, Phi-4-multimodal demonstrated strong performance on mathematical and scientific visual reasoning. Select vision benchmark results are shown below.

Vision Benchmark	Phi-4-multimodal (5.6B)	Phi-3.5-vision (4.2B)	Qwen 2.5-VL-7B	GPT-4o
MMMU	55.1	43.0	51.8	61.7
MMBench (dev-en)	86.7	81.9	87.8	89.0
ScienceQA Visual	97.5	--	--	97.3
DocVQA	93.2	--	--	95.7
ChartQA	81.4	--	--	85.0
OCRBench	84.4	--	--	87.7

Phi-4-multimodal also excels in cross-modal tasks, combining speech and vision inputs simultaneously. On spoken document understanding tasks (asking questions about documents via voice), it outperformed both InternOmni-7B and Gemini 2.0 Flash ^[8].

Phi-4-reasoning

Released on April 30, 2025, the Phi-4-reasoning family brought dedicated chain-of-thought reasoning capabilities to the Phi lineup ^[9].

Training Approach

Phi-4-reasoning is a 14-billion-parameter model fine-tuned from Phi-4 on approximately 16 billion tokens (roughly 8.3 billion unique tokens) of curated reasoning data. The training focused on 1.4 million high-quality STEM and coding prompts, many of which were enhanced using OpenAI o3-mini to generate detailed reasoning traces. Training was completed in just 2.5 days on 32 H100-80G GPUs ^[9].

The model produces outputs in two sections: a reasoning chain-of-thought block where it works through the problem step by step, followed by a summarization block with the final answer.

Variants

Three variants were released simultaneously:

Phi-4-reasoning: The base reasoning model with 14 billion parameters and 32K context length.
Phi-4-reasoning-plus: Enhanced through a short phase of outcome-based reinforcement learning that encourages the model to generate longer, more thorough reasoning traces. This variant uses approximately 1.5 times more tokens than the base model on average, trading latency for improved accuracy.
Phi-4-mini-reasoning: A compact 3.8-billion-parameter reasoning model with 128K context, fine-tuned on 150 billion tokens of synthetic mathematical content distilled from the DeepSeek R1 model. Training took 2 days on 128 H100-80G GPUs ^[9].

Benchmark Performance

The Phi-4-reasoning models achieved results that challenged assumptions about what small models could accomplish on difficult reasoning tasks. The technical report states that the models "outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model" ^[9].

Benchmark	Phi-4-reasoning (14B)	Phi-4-reasoning-plus (14B)	Phi-4-mini-reasoning (3.8B)
AIME 2025	62.9	78.0	--
AIME 2024	75.3	81.3	57.5
MATH-500	--	--	94.6
GPQA Diamond	65.8	68.9	52.0
HumanEvalPlus	92.9	92.3	--
LiveCodeBench (8/24-2/25)	53.8	53.1	--
OmniMath	76.6	81.9	--
MMLU-Pro	74.3	76.0	--
ArenaHard	73.3	79.0	--

Phi-4-reasoning outperformed OpenAI o1-mini and DeepSeek-R1-Distill-Llama-70B on most evaluated benchmarks, and Phi-4-reasoning-plus achieved performance comparable to the full DeepSeek R1 model (671 billion parameters) on AIME 2025. This result was particularly striking because Phi-4-reasoning-plus is roughly 48 times smaller than DeepSeek R1 ^[9].

Phi-4-mini-flash-reasoning

Released on July 9, 2025, Phi-4-mini-flash-reasoning is a 3.8-billion-parameter model designed for scenarios where compute, memory, and latency are tightly constrained ^[10].

Architecture

The model introduced a novel hybrid architecture called SambaY, which represents a departure from the pure transformer design used in all previous Phi models. SambaY combines three components:

A self-decoder layer that integrates Mamba (a state space model) with sliding window attention for efficient sequence processing.
A single layer of full attention for global context integration.
A decoder-hybrid-decoder arrangement that balances computational efficiency with reasoning quality.

This architecture achieves up to 10 times higher throughput and 2 to 3 times lower average latency compared to standard transformer-based reasoning models of similar size, making it viable for real-time applications on edge hardware ^[10].

Training and Performance

The model was trained exclusively on synthetic mathematical content generated by DeepSeek R1. On Math-500, it achieved 92.45% pass@1 accuracy, outperforming the standard Phi-4-mini-reasoning (91.2%) and surpassing other open models in its size class including Qwen-1.5B and Bespoke-Stratos-7B. The model supports a 64K token context length ^[10].

Phi-4-reasoning-vision

Released on March 4, 2026, Phi-4-reasoning-vision-15B is the newest addition to the Phi family and the first Phi model to combine multimodal understanding with chain-of-thought reasoning ^[11].

Architecture

The model has 15 billion parameters and uses a mid-fusion architecture that combines the Phi-4-reasoning language model backbone with a SigLIP-2 vision encoder. It supports a 16,384 token context length and can process up to 3,600 visual tokens per image through a dynamic resolution vision encoder. A distinctive architectural feature is the use of bidirectional attention within image tokens, which improves spatial reasoning about visual content ^[11].

Selective Thinking

Phi-4-reasoning-vision introduced a selective thinking mechanism that distinguishes it from prior reasoning models. The model can operate in two modes:

Think mode: Uses extended chain-of-thought reasoning (marked with <think>...</think> blocks) for complex mathematical, scientific, or logical tasks.
No-think mode: Defaults to direct inference (marked with <nothink>) for perception-focused tasks like image captioning or OCR where extended reasoning is unnecessary.

The model automatically selects the appropriate mode based on task complexity, reducing wasted computation on simple queries while maintaining strong performance on difficult problems ^[11].

Training and Performance

The model was trained on approximately 200 billion tokens of multimodal data using 240 NVIDIA B200 GPUs over 4 days. This training data volume is significantly smaller than what competitors typically use, often exceeding 1 trillion tokens, roughly 5 times more data ^[11]. Microsoft attributes the efficiency to meticulous data curation rather than brute-force scale, with the team manually reviewing datasets at a rate of 5 to 10 minutes per sample, regenerating incorrect answers using GPT-4o, and fixing formatting errors across widely used open-source benchmarks ^[11].

Benchmark	Phi-4-reasoning-vision (15B)	Description
ScreenSpot-V2	88.2	GUI grounding
AI2D	84.8	Diagram understanding
ChartQA	83.3	Chart reasoning
OCRBench	76.0	Optical character recognition
MathVista	75.2	Visual math reasoning
MMMU	54.3	Multimodal understanding

The model is particularly strong at computer-use agent tasks, interpreting graphical user interfaces and localizing interactive elements on screen. It also handles scientific diagram analysis, handwritten equation parsing, and document extraction ^[11].

Phi Silica: On-Device Deployment

Phi Silica is a specialized variant of the Phi family optimized specifically for the neural processing units (NPUs) found in Windows Copilot+ PCs. Announced in December 2024 and made available to developers starting in January 2025, Phi Silica has 3.3 billion parameters and is integrated directly into the Windows operating system as part of the Windows AI Foundry platform ^[12].

Technical Specifications

Phi Silica is designed for extreme efficiency on NPU hardware. On Copilot+ PC devices equipped with Qualcomm Snapdragon X Elite processors, the model achieves a first-token latency of approximately 650 tokens per second while consuming only about 1.5 watts of power. Context processing on the NPU consumes 4.8 milliwatt-hours of energy, representing a 56% improvement in power consumption compared to running the same model on the CPU ^[12].

The model is delivered as an OS-managed component that can be preloaded in memory, enabling near-instant response times. It powers several built-in Windows features, including the "Click to Do" functionality, and is available as a developer API through the Windows App SDK.

Platform Expansion

Throughout 2025, Microsoft expanded Phi Silica support beyond Qualcomm-based devices to include Intel and AMD silicon, delivering updates through Windows component packages across the 24H2, 25H2, and 26H1 branches. In May 2025, the Phi-4-reasoning and Phi-4-mini-reasoning models, optimized using ONNX Runtime, became available on Snapdragon-powered Copilot+ PCs ^[12].

Key Insight: Data Quality Over Quantity

The defining contribution of the Phi series to the broader AI field is the empirical demonstration that training data quality can substitute for model scale to a far greater degree than was previously believed. As Microsoft Research put it when introducing Phi-2, "training data quality plays a critical role in model performance" ^[4]. Several specific principles have emerged from the Phi research program ^[1]^[4]^[7]:

Synthetic data as curriculum: Rather than scraping the internet indiscriminately, the Phi team generates synthetic training examples that are structured like textbooks: organized around specific concepts, building from simple to complex reasoning, and free of the noise and redundancy found in web-scraped data.
Quality filtering of web data: When web data is used, it is aggressively filtered using classifiers that evaluate educational value rather than just content relevance.
Knowledge transfer across model generations: Each Phi model builds on insights and, in some cases, direct knowledge transfer from previous generations, creating a compounding effect.
Iterative synthetic data improvement: In later models like Phi-4, the synthetic data generation process itself became iterative, with AI systems generating answers, evaluating them, and improving them through multiple rounds.
Reasoning distillation: Starting with Phi-4-reasoning, the series has demonstrated that reasoning capabilities can be effectively distilled from larger models (such as o3-mini and DeepSeek R1) into much smaller architectures through carefully curated chain-of-thought fine-tuning data.

These principles have influenced training methodology beyond Microsoft. The success of the Phi series contributed to broader industry interest in synthetic data for training, with companies like Google and Meta subsequently investing more heavily in synthetic data pipelines for their own models.

Integration with the Microsoft Ecosystem

Microsoft has integrated Phi models across its product and developer ecosystem in several ways:

Azure AI Foundry: All Phi models are available for deployment through the Azure AI Foundry Model Catalog, supporting both serverless API endpoints and managed compute deployments ^[2].
ONNX Runtime: Phi models have been optimized for ONNX Runtime with support for Windows DirectML, enabling cross-platform deployment across GPU, CPU, and mobile hardware. Phi-3-mini was among the first models optimized for this runtime ^[2].
Windows Copilot Runtime: Phi Silica is built into the Windows Copilot Runtime, which includes over 40 machine learning models. The Windows AI Foundry platform provides developer APIs for integrating Phi-based capabilities into Windows applications ^[12].
NVIDIA: Phi models are available through the NVIDIA API Catalog and NVIDIA NIM for optimized inference deployment ^[2].
Ollama and Hugging Face: For local development and experimentation, all Phi models are distributed through Hugging Face and the Ollama framework ^[2].
GitHub Models: Phi models are also accessible through GitHub Models for quick prototyping and evaluation ^[8].

How does Phi compare to other small language models?

The Phi series competes in an increasingly crowded small language model space. As of early 2026, several major players offer competitive alternatives.

Google Gemma

Google's Gemma family, released in its third generation (Gemma 3) in March 2025, offers models ranging from 270 million to 27 billion parameters. The Gemma 3 4B model supports multimodal input (images and text) with a 128K context window and scores 71.3% on HumanEval and 89.2% on GSM8K. Google also released Gemma 3n, a purpose-built mobile variant with a 3 GB memory footprint that was the first sub-10B model to surpass 1,300 Elo on LMArena. Gemma uses the Gemma Terms of Use license rather than MIT ^[13].

Alibaba Qwen

Alibaba's Qwen series released Qwen3 in April 2025 with models ranging from 600 million to 235 billion parameters, all under the Apache 2.0 license. The Qwen3-4B model rivals the performance of Qwen2.5-72B-Instruct (a model 18 times larger), while the Qwen3-30B-A3B MoE model (3 billion active parameters) outperforms QwQ-32B. In early 2026, Alibaba released the Qwen 3.5 small model series, with Qwen3.5-9B scoring 82.5 on MMLU-Pro and 81.7 on GPQA Diamond. Qwen models were trained on approximately 36 trillion tokens and support 119 languages ^[14].

Meta LLaMA

Meta's LLaMA 3.2 (September 2024) introduced lightweight 1B and 3B text models alongside vision-capable 11B and 90B variants. The LLaMA 3.2 3B model supports 128K context and outperformed Phi-3.5-mini on instruction following and summarization tasks at the time of its release. However, LLaMA models use the more restrictive Llama Community License rather than MIT ^[15].

Comparative Summary

Model Family	Developer	Smallest Size	Largest Small Model	License	Multimodal
Phi-4	Microsoft	3.3B (Silica)	15B (reasoning-vision)	MIT	Text, vision, speech
Gemma 3	Google	270M	27B	Gemma Terms of Use	Text, vision
Qwen3	Alibaba	600M	32B (dense)	Apache 2.0	Text, vision
LLaMA 3.2	Meta	1B	3B (text-only)	Llama Community License	Text, vision (11B+)

Phi's key differentiators remain its MIT license (the most permissive among major model families), its strong per-parameter efficiency rooted in the textbook training approach, and its uniquely deep integration with the Windows and Azure ecosystems.

Is Phi open source?

All Phi models are released under the MIT license, one of the most permissive open-source licenses available. This means the models can be freely used, modified, and distributed for both commercial and non-commercial purposes with minimal restrictions. The MIT license distinguishes the Phi series from competitors like Meta's LLaMA models (released under the more restrictive Llama Community License) and makes Phi models particularly attractive for enterprises and startups that need full legal flexibility ^[2].

Phi models are available on Hugging Face, Azure AI Foundry, the NVIDIA API Catalog, GitHub Models, and through the Ollama framework for local deployment.

Current State (March 2026)

As of March 2026, the Phi model family has established Microsoft as a leading developer of small language models. Over the span of less than three years, the series has grown from a single 1.3-billion-parameter code generation model to a comprehensive family spanning text, vision, speech, and reasoning. The most recent release, Phi-4-reasoning-vision-15B, demonstrates that multimodal reasoning can be achieved at the 15-billion-parameter scale using selective thinking to balance performance and efficiency ^[11].

The Phi-4-reasoning models have been particularly influential. Matching or exceeding the performance of OpenAI o1-mini with a 14-billion-parameter model, and approaching the full DeepSeek R1 (671 billion parameters) on mathematical reasoning benchmarks, challenged widespread assumptions about the relationship between model size and reasoning capability ^[9].

Microsoft continues to push the boundaries of efficient architecture design, as demonstrated by the SambaY hybrid Mamba-attention architecture in Phi-4-mini-flash-reasoning, which achieves a 10-fold throughput improvement over standard transformers ^[10]. The integration of Phi models into Windows through Phi Silica, combined with ongoing NPU optimization for Intel, AMD, and Qualcomm hardware, positions the Phi family for growing adoption in on-device AI applications ^[12].

Looking ahead, the competition among small language models continues to intensify, with Google, Alibaba, and Meta all releasing increasingly capable small models. The Phi series' core thesis, that training data quality and curation can compensate for model scale, has been validated across five generations of releases and adopted as a guiding principle by the broader research community.

References

Textbooks Are All You Need - Gunasekar et al., arXiv:2306.11644, June 2023 ↩
Phi Open Models - Small Language Models - Microsoft Azure ↩
Textbooks Are All You Need II: phi-1.5 technical report - Microsoft Research, September 2023 ↩
Phi-2: The surprising power of small language models - Microsoft Research, December 2023 ↩
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone - arXiv:2404.14219, April 2024 ↩
Discover the new multi-lingual high-quality Phi-3.5 SLMs - Microsoft Tech Community, August 2024 ↩
Phi-4 Technical Report - arXiv:2412.08905, December 2024 ↩
Empowering innovation: The next generation of the Phi family - Microsoft Azure Blog, February 2025 ↩
Phi-4-reasoning Technical Report - arXiv:2504.21318, April 2025 ↩
Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning - Microsoft Azure Blog, July 2025 ↩
Phi-4-reasoning-vision and the lessons of training a multimodal reasoning model - Microsoft Research, March 2026 ↩
Phi Silica, small but mighty on-device SLM - Windows Experience Blog, December 2024 ↩
Gemma 3 - Google DeepMind - Google, March 2025 ↩
Qwen3 Technical Report - arXiv:2505.09388, May 2025 ↩
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models - Meta AI, September 2024 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

Model Overview

Phi-1: Textbooks Are All You Need

Training Data Philosophy

Performance

Phi-1.5

Phi-2

Training Methodology

Benchmark Results

Phi-3

Phi-3-mini

Phi-3-small and Phi-3-medium

Technical Paper

Phi-3.5

Phi-3.5-mini

Phi-3.5-MoE

Phi-3.5-vision

Phi-4

Architecture and Training

Benchmark Performance

Phi-4-mini and Phi-4-multimodal

Phi-4-mini

Phi-4-multimodal

Phi-4-reasoning

Training Approach

Variants

Benchmark Performance

Phi-4-mini-flash-reasoning

Architecture

Training and Performance

Phi-4-reasoning-vision

Architecture

Selective Thinking

Training and Performance

Phi Silica: On-Device Deployment

Technical Specifications

Platform Expansion

Key Insight: Data Quality Over Quantity

Integration with the Microsoft Ecosystem

How does Phi compare to other small language models?

Google Gemma

Alibaba Qwen

Meta LLaMA

Comparative Summary

Is Phi open source?

Current State (March 2026)

See Also

References

Improve this article

Related Articles

Phi-3

Phi-4

Gemma

Gemma 2

Gemma 3

Phi-4-mini

What links here (24 of 42)

Related Articles

Phi-3

Phi-4

Gemma

Gemma 2

Gemma 3

Phi-4-mini

What links here (24 of 42)