Phi-3
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,153 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 4,153 words
Add missing citations, update stale details, or suggest a clearer explanation.
Phi-3 is a family of small, efficient open-weight language models developed by [[microsoft|Microsoft]] and first released on April 23, 2024. The family was designed to demonstrate that aggressive data-quality curation, combined with a continuation of Microsoft Research's "textbooks are all you need" philosophy, could produce small language models (SLMs) competitive with much larger systems and capable of running locally on consumer hardware, including smartphones.[^1][^2] The initial release centered on Phi-3 Mini, a 3.8-billion-parameter dense Transformer offered in 4K and 128K context variants. Subsequent releases in May 2024 added Phi-3 Small (7B), Phi-3 Medium (14B), and the multimodal Phi-3 Vision (4.2B).[^3] An updated Phi-3.5 sub-family followed in August 2024, comprising Phi-3.5-Mini, the first Mixture-of-Experts model in the lineage (Phi-3.5-MoE), and a refreshed Phi-3.5-Vision.[^4] All Phi-3 and Phi-3.5 models ship under the permissive [[mit_license|MIT License]] with open weights distributed through [[hugging_face|Hugging Face]] and the [[azure_ai|Azure AI]] Model Catalog.[^5]
The accompanying paper, Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone (arXiv:2404.14219), reports that a 4-bit quantized Phi-3 Mini occupies approximately 1.8 GB and runs at over 12 tokens per second on an iPhone 14 with an Apple A16 Bionic, marking one of the first widely-publicized demonstrations of a frontier-quality SLM operating entirely on-device.[^1] The Phi-3 family has since been superseded by [[phi_4|Phi-4]] (14B, December 2024), Phi-4-Mini (3.8B, February 2025), and Phi-4-Multimodal (5.6B, February 2025), but remains widely deployed for [[on_device_ai|on-device AI]] and cost-sensitive inference workloads.[^6][^7]
Phi-3 is the fourth named generation in a research lineage that began at [[microsoft_research|Microsoft Research]] in 2023. The earlier models established the methodological premise that would carry through to Phi-3: that small models trained on tightly curated, "textbook-quality" data can outperform substantially larger models trained on raw web text.
Phi-1, released in mid-2023, was a 1.3-billion-parameter model focused on Python coding. Its accompanying paper, "[[textbooks_are_all_you_need|Textbooks Are All You Need]]" (Gunasekar et al., 2023, arXiv:2306.11644), argued that the composition and clarity of training data — not raw volume — was the dominant factor in determining capability per parameter for code generation.[^8] Phi-1 achieved competitive HumanEval scores against models several times its size.
Phi-1.5 followed later in 2023, also at 1.3 billion parameters, extending the approach to common-sense reasoning and natural-language tasks.[^9] [[phi_2|Phi-2]], released in December 2023 at 2.7 billion parameters, demonstrated that the textbook-data approach scaled and that knowledge distillation from a smaller checkpoint (Phi-1.5) could be combined with synthetic data generation to produce a model competitive on reasoning benchmarks with systems up to 25 times larger.[^10]
Phi-3 represented both a scaling and a productization of this line of work. Where Phi-1 through Phi-2 had been principally research artifacts demonstrating the data-quality thesis, Phi-3 was conceived as a deployable family spanning multiple sizes, context lengths, and modalities. It was released on day one through [[azure_ai|Azure AI]] Studio, [[hugging_face|Hugging Face]], NVIDIA NIM microservices, and Ollama, with optimized ONNX-runtime variants for cross-platform on-device deployment.[^2][^11]
The Phi research lineage originated within Microsoft Research's Machine Learning Foundations group, with Sebastien Bubeck among its leading researchers. The work formed part of a broader strategic bet by Microsoft that small, efficient models — deployable on consumer hardware, finetuneable for narrow domains, and cheap to serve at scale — would constitute an important complementary tier to frontier-scale systems such as those produced by [[openai|OpenAI]], with which Microsoft maintains a substantial commercial and infrastructure partnership.[^2] Where frontier models target maximum capability per query, the Phi line targets maximum capability per parameter and per watt, intended for production scenarios where latency, cost, and privacy constraints make cloud-only frontier inference impractical.
Phi-3 Mini was the founding model of the Phi-3 family and the only variant available at the April 23, 2024 announcement. It is a dense, decoder-only Transformer with 3.8 billion parameters arranged in 32 layers, with 32 attention heads and a hidden dimension of 3,072.[^1] Its vocabulary contains 32,064 tokens and uses a tokenizer compatible with the [[llama_2|Llama-2]] format, allowing weights to be loaded by existing Llama-2 tooling.[^12]
Phi-3 Mini was released in two context-length variants from the start:
The base model was trained on 3.3 trillion tokens drawn from a heavily filtered web corpus and a large body of synthetic data — a budget characterized in the technical report as "data-optimal" rather than [[chinchilla|compute-optimal]], emphasizing per-token quality over total volume.[^1] Post-training combined supervised fine-tuning (SFT) on high-quality instruction and chat data with [[dpo|Direct Preference Optimization]] (DPO) for alignment.[^1] The data cutoff for the base model is October 2023.
The Phi-3 Mini paper reports MMLU (5-shot) of approximately 69%, MT-bench of 8.38, and HumanEval (0-shot) of 60.4, all measured against contemporaneous open models including [[llama_3|Llama 3 8B]], Mistral 7B, and Gemma 7B.[^1]
Phi-3 Mini's defining demonstration was on-device inference: a 4-bit quantized version using Activation-Aware Quantization (AWQ) occupies approximately 1.8 GB of storage and runs at over 12 tokens per second on an iPhone 14 with the Apple A16 Bionic, fully offline.[^1][^14]
Two larger Phi-3 variants were released on May 21, 2024, expanding the family upward while retaining the same data philosophy and the option of 128K context windows.[^3]
Phi-3 Small is a 7-billion-parameter dense Transformer with several architectural changes relative to Phi-3 Mini. It uses the tiktoken tokenizer with a 100,352-token vocabulary, providing substantially better coverage of non-English scripts and improving tokenization efficiency for multilingual content.[^15] Its attention mechanism uses [[gqa|grouped-query attention]] (GQA) with four query heads sharing each key-value head, reducing the KV-cache memory footprint at inference.
A second notable architectural choice in Phi-3 Small is its alternating dense-and-blocksparse attention pattern. Layers alternate between standard full-context attention and a novel block-sparse mechanism in which each attention head enforces a distinct sparsity pattern over the KV cache. This ensures that across the set of heads, every token position is attended to, while substantially lowering memory and compute relative to a fully dense implementation at long sequence lengths.[^1]
Phi-3 Small was trained on 4.8 trillion tokens over 18 days using 1,024 H100-80GB GPUs, with roughly 10% of the corpus drawn from multilingual sources.[^15] Both 8K and 128K context variants are released.
Phi-3 Medium is a 14-billion-parameter dense Transformer with 40 layers, 40 attention heads, and an embedding dimension of 5,120.[^16] It shares the same 32,064-token vocabulary and tokenizer format as Phi-3 Mini. Training ran from February to April 2024 on 512 H100-80GB GPUs over 42 days, consuming 4.8 trillion tokens from the same curated corpus as Phi-3 Small.[^16] The base model has an October 2023 data cutoff.
Phi-3 Medium reports MMLU (5-shot) of 78.0% in the technical report (76.6% on the model card protocol), GSM8K (8-shot CoT) of 87.5%, and an MT-bench score of 8.9 — the highest in the initial Phi-3 release.[^1][^16] It is available in both 4K and 128K context variants. As with the other variants, post-training used SFT followed by DPO.
Phi-3 Vision, also released on May 21, 2024, was the first multimodal model in the Phi family.[^3][^17] It has 4.2 billion parameters and combines two components: an image encoder (based on CLIP ViT-L/14) and the Phi-3-mini-128K language model, connected via a trainable projection that maps image embeddings into the language model's input space.[^17]
The model accepts interleaved text and image inputs and supports the full 128,000-token context window of its language backbone, making it suitable for long documents with embedded images. Training used 500 billion vision-and-text tokens over approximately 1.5 days on 512 H100-80GB GPUs between February and April 2024, with a data cutoff of March 15, 2024.[^17]
Phi-3 Vision's reported benchmark scores include ScienceQA at 90.8%, ChartQA at 81.4%, MMBench at 80.5%, TextVQA at 70.9%, and MMMU at 40.4%.[^17] On chart and table understanding in particular, the model performs strongly relative to other open multimodal models of similar or larger size, and Microsoft positioned it for enterprise document processing involving structured imagery (financial reports, scientific figures, scanned forms).
The training corpus for the Phi-3 family draws from three categories of data, broadly described in the technical report and model cards:[^1]
Training proceeded in two phases. Phase one covered broad general knowledge across all data sources. Phase two emphasized more heavily filtered web data targeting logical reasoning and specialized skills, increasing the share of reasoning-dense and synthetic textbook material.[^1] The report describes this regime as "data-optimal" — meaning the training-token budget was deliberately allocated toward curating the best possible tokens, rather than maximizing token count at fixed compute. Sebastien Bubeck, a Microsoft Research lead on the program, characterized the approach by asking, "Instead of training on just raw web data, why don't you look for data which is of extremely high quality?"[^2]
Post-training used a two-stage pipeline of SFT followed by DPO for Phi-3 Mini, Small, and Medium. Phi-3.5 models later moved to a three-stage SFT / [[ppo|PPO]] / DPO pipeline.[^4][^18]
The 128K context variants of every Phi-3 and Phi-3.5 model use [[longrope|LongRoPE]], a position-embedding rescaling technique developed by Microsoft Research and published at ICML 2024 (Ding et al., arXiv:2402.13753).[^19] Standard rotary position embeddings (RoPE) lose effectiveness when sequences extend beyond the lengths seen during training because the embedding's frequency components are calibrated to the training context. LongRoPE applies non-uniform rescaling factors per RoPE dimension and per position range, identified by an evolutionary search algorithm. After long-context fine-tuning, a final short-context re-adjustment at 8K preserves performance on short sequences.[^19] On the RULER benchmark, Phi-3 Mini 128K averages 84.6 across context lengths from 4K to 128K, with 65.6 at the full 128K — substantially above the pre-LongRoPE baseline.[^1]
All dense Phi-3 models (Mini, Medium, Phi-3.5-Mini) share a standard autoregressive Transformer decoder block with pre-normalization using RMSNorm, rotary positional embeddings, and a SwiGLU activation in the feedforward layer — a configuration broadly consistent with the Llama-2 block.[^1] Phi-3 Mini and Phi-3 Medium share the same 32,064-token vocabulary, allowing tooling reuse across the family.
Phi-3 Small diverges in three respects: (1) the tiktoken tokenizer with a 100,352-token vocabulary for stronger multilingual coverage; (2) [[gqa|grouped-query attention]] with 4 query heads per key-value head, reducing KV-cache memory; and (3) the alternating dense + blocksparse attention scheme, where each blocksparse head enforces a distinct sparsity pattern such that across heads every token is covered while per-layer compute remains tractable at 128K context.[^1]
Phi-3.5-MoE replaces the dense feedforward sublayers with a [[mixture_of_experts|Mixture-of-Experts]] module. A learned top-k gating function routes each token to two of sixteen GLU-feedforward experts; attention layers remain dense. Each expert is parameterized at the scale of Phi-3 Mini's feedforward block (3.8B nominal), with total parameters of approximately 42 billion and 6.6 billion active per token.[^18]
Across all variants, 128K context is implemented via LongRoPE, as described above.
The table below summarizes key benchmark results for the dense Phi-3 and Phi-3.5 language models. Scores are drawn from the Phi-3 Technical Report and the official Hugging Face model cards; protocols match each benchmark's standard configuration (5-shot for MMLU, 8-shot CoT for GSM8K, 0-shot for HumanEval).[^1][^12][^15][^16][^18][^20]
| Benchmark | Phi-3 Mini (3.8B) | Phi-3 Small (7B) | Phi-3 Medium (14B) | Phi-3.5 Mini (3.8B) | Phi-3.5 MoE (6.6B active) |
|---|---|---|---|---|---|
| MMLU (5-shot) | 69.7 | 75.5 | 78.0 | 69.0 | 78.9 |
| GSM8K (8-shot CoT) | 85.3 | 87.3 | 87.5 | 86.2 | 88.7 |
| HumanEval (0-shot) | 60.4 | 59.1 | 58.5 | 62.8 | 70.7 |
| MT-bench | 8.38 | 8.70 | 8.90 | -- | -- |
| ARC Challenge (10-shot) | 85.5 | 90.8 | 91.0 | -- | -- |
| BigBench Hard (3-shot) | -- | 77.6 | 77.9 | -- | 79.1 |
| MATH (0-shot CoT) | -- | -- | -- | -- | 59.5 |
| Multilingual MMLU | 51.08 | 62.6 | -- | 55.4 | 69.9 |
| RULER avg (to 128K) | 84.6 | -- | -- | 84.1 | 87.1 |
For comparison against contemporaneous open models, the Phi-3 Mini MMLU score of 69.7 exceeds Llama 3 8B (66.5) and Mistral 7B (61.7) despite being roughly half the size of either. On math reasoning, Phi-3 Mini at 85.3% GSM8K is dramatically above Mistral 7B (46.4%) and notably above Llama 3 8B (77.4%).[^1][^12]
Phi-3 Vision benchmarks (ScienceQA 90.8, ChartQA 81.4, MMBench 80.5, TextVQA 70.9, MMMU 40.4) place it competitively against larger multimodal systems such as Claude 3 Haiku and Gemini 1.0 Pro on chart and document understanding tasks.[^17]
A noted weakness across the family is factual recall (e.g., TriviaQA), reflecting the trade-off inherent in training small models on synthetic data rather than raw web crawls; the technical report explicitly recommends [[rag|retrieval-augmented generation]] for applications requiring broad factual knowledge.[^1]
While Phi-3 models scored well on academic benchmarks, independent human-preference evaluations such as Chatbot Arena initially placed Phi-3 Mini somewhat below its benchmark scores would suggest. Phi-3 Mini received an Elo rating in the range of competitive 7B models — broadly comparable to Mistral 7B Instruct but below Mixtral 8x7B and frontier-class systems — leading parts of the research community to argue that synthetic-data-heavy training optimized for academic benchmarks does not always fully transfer to conversational quality as judged by real users.[^1] Microsoft's three-stage post-training pipeline introduced with Phi-3.5 (adding PPO between SFT and DPO) was motivated in part by these gaps, and the Phi-3.5 family showed measurable improvements on multi-turn conversation quality and longer-context retrieval tasks.[^20]
On August 21, 2024, Microsoft released a second generation of models under the Phi-3.5 label, addressing several limitations of the original release — narrow English focus, single-image vision only, and absence of a Mixture-of-Experts variant.[^4][^21] All three Phi-3.5 models are released under the MIT License and support a 128K context window.
Phi-3.5-Mini is an updated 3.8-billion-parameter dense Transformer that retains the same architecture as Phi-3 Mini but is trained on a refreshed corpus with substantially expanded multilingual coverage.[^20] Training ran from June to August 2024 on 512 H100-80GB GPUs over approximately 10 days, consuming 3.4 trillion tokens. The model explicitly supports 22 languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, and Ukrainian.[^20]
Post-training switched to a three-stage SFT / PPO / DPO pipeline. Multilingual MMLU improved to 55.4 versus 51.08 for the original Phi-3 Mini.[^20]
Phi-3.5-MoE is the first Mixture-of-Experts model in the Phi lineage and the most capable model in the Phi-3.5 family. It consists of 16 expert networks (each roughly the scale of a Phi-3 Mini feedforward block) for a total of approximately 42 billion parameters, with a learned top-2 router selecting 2 experts per token, resulting in 6.6 billion active parameters per forward pass.[^18]
Training used 4.9 trillion tokens (10% multilingual) and ran from April to August 2024 on 512 H100-80GB GPUs over 23 days. Like Phi-3.5-Mini, post-training used the three-stage SFT / PPO / DPO pipeline and the model supports the same 22 languages.[^18] On Microsoft's internal aggregated evaluation across 80 benchmarks, Phi-3.5-MoE scored 69.2 — above Gemini 1.5 Flash (68.5), Llama 3.1 8B (61.0), Gemma 2 9B (63.3), and Mistral-Nemo-12B (61.3), while remaining below GPT-4o-mini (74.9).[^4]
Phi-3.5-Vision extends multimodal capability from single-image inputs to multi-frame and short-video understanding.[^4] It targets use cases including detailed image comparison, multi-image summarization, and short-clip video summarization, retaining the 128K context window of its language backbone. Phi-3.5-Vision uses the MIT License and was made available simultaneously on Hugging Face and Azure AI.
All Phi-3 and Phi-3.5 models are released under the [[mit_license|MIT License]], one of the most permissive open-source licenses.[^5] The MIT License allows use, modification, distribution, and commercial deployment without royalty obligation, subject only to preservation of the copyright notice. Notably, MIT imposes no use-case restrictions, distinguishing Phi-3 from models released under bespoke "community" licenses such as the Meta Llama 3 Community License, which carries an acceptable-use policy and restrictions for organizations with very large user bases.[^22]
This licensing choice is consistent with Microsoft's positioning of Phi-3 as a deployable family suitable for downstream fine-tuning, on-device packaging, and commercial integration. Open weights are distributed via Hugging Face and the Azure AI Model Catalog.[^2]
A central design objective for the Phi-3 family, and Phi-3 Mini in particular, was [[on_device_ai|on-device deployment]] without cloud connectivity, requiring that models fit within the memory and compute envelope of consumer hardware while preserving useful capability.
Microsoft worked with the [[onnx_runtime|ONNX Runtime]] team to produce optimized ONNX-format builds of Phi-3 models. These ONNX builds support multiple quantization formats, most notably INT4 block quantization using [[awq|Activation-Aware Quantization]] (AWQ), which preserves the 1% of weights most critical to model accuracy in higher precision while quantizing the rest to 4-bit.[^11] This brings the on-disk size of Phi-3 Mini to approximately 1.8 GB.
ONNX Runtime Mobile supports deployment across CPU, GPU, and neural processing unit (NPU) backends. On iOS, CoreML can be used; on Android, the NNAPI backend provides hardware acceleration. Microsoft published reference chatbot applications for both iOS and Android demonstrating offline inference, streaming token generation, and automatic model retrieval from Hugging Face. On an iPhone 14 with an A16 Bionic, the INT4 Phi-3 Mini runs at over 12 tokens per second fully on-device.[^1][^11]
For Windows deployments, ONNX Runtime's DirectML backend enables GPU acceleration on consumer hardware, including gaming-class GPUs not traditionally used for ML inference. Phi-3 Small and Phi-3 Medium were subsequently added to this pipeline. The combination of MIT licensing, small footprints, and official ONNX support has made the Phi-3 family one of the most production-ready open SLM families for edge scenarios.
Microsoft positioned the Phi-3 family for several distinct deployment scenarios based on size, latency profile, and capability:[^14]
Phi-3 has since been succeeded by the [[phi_4|Phi-4]] generation, which retained the data-quality philosophy while increasing scale and adding multimodality.
The Phi-4 line therefore represents a continuation rather than a break: same MIT licensing, same Hugging Face / Azure AI distribution model, same on-device positioning, but with refined post-training (DPO, RLHF), larger and more diverse synthetic data, and explicit multimodality.
Microsoft's technical report and the model cards acknowledge several limitations of the Phi-3 family:[^1]
Factual knowledge capacity. Because the models are trained on relatively fewer tokens than frontier systems, and because the training data prioritizes reasoning structure over raw factual breadth, Phi-3 models store less world knowledge per parameter than comparably-sized models trained on diverse web crawls. TriviaQA performance is correspondingly lower than for models such as Mixtral 8×7B. Microsoft explicitly recommends [[rag|retrieval-augmented generation]] for applications requiring broad factual recall.[^1]
Language coverage. The original Phi-3 models were primarily English-trained. Performance on non-English languages was substantially weaker than on English tasks. Phi-3.5 partially addressed this with dedicated multilingual data and explicit support for 22 languages, but English remained the strongest-supported language.[^20]
Hallucination. As with all language models, Phi-3 can produce plausible-sounding but factually incorrect output, an effect amplified by the reduced factual capacity noted above.
Safety and adversarial robustness. Internal evaluations showed that despite safety post-training, Phi-3 Mini could be induced to produce harmful content under adversarial prompting, with a measurable residual jailbreak rate.[^1]
Code generation breadth. Fine-tuning data for code generation was concentrated on Python; performance on other programming languages is weaker.[^1]
The Phi-3 family was, at release, the most public demonstration that small models trained on aggressively curated and synthetic data could compete with much larger systems on standard benchmarks while remaining deployable on consumer hardware. Phi-3 Mini's on-device demonstrations on smartphones, and the MIT licensing of all variants, made it a reference point for the [[on_device_ai|on-device AI]] discourse through 2024 and into 2025.[^1][^2]
The family also reinforced a broader research trend toward [[synthetic_data|synthetic training data]]. By demonstrating that the "Textbooks Are All You Need" approach scaled from Phi-1's narrow code domain to Phi-3's broad general-purpose family, the work fed into industry-wide adoption of synthetic-data pipelines, ultimately informing the design of Phi-4 and influencing approaches taken by other labs.[^6]
As of 2026, Phi-3 itself is considered superseded by Phi-4 and Phi-4-Mini for new deployments, but the older models remain widely used in production where compatibility, fine-tuning ecosystem maturity, or simpler runtime requirements favor the prior generation. Hugging Face download statistics and community fine-tunes of Phi-3 variants remain among the most active for small open models.[^12]