# Phi-3

> Source: https://aiwiki.ai/wiki/phi_3
> Updated: 2026-06-21
> Categories: AI Models, Large Language Models, Microsoft, Open Source AI, Small Language Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

<p><b>Phi-3</b> is a family of small, efficient open-weight language models developed by [Microsoft](/wiki/microsoft) and first released on April 23, 2024, under the permissive [MIT License](/wiki/mit_license). Its founding model, <b>Phi-3 Mini</b>, packs 3.8 billion parameters and scores about 69% on [MMLU](/wiki/mmlu), matching far larger systems while fitting in roughly 1.8 GB when 4-bit quantized and running at over 12 tokens per second fully offline on an iPhone 14.[^1][^2] The family was designed to demonstrate that aggressive data-quality curation, combined with a continuation of [Microsoft Research](/wiki/microsoft_research)'s "textbooks are all you need" philosophy, could produce [small language models](/wiki/small_language_model) (SLMs) competitive with much larger systems and capable of running locally on consumer hardware, including smartphones.[^1][^2] The initial release centered on Phi-3 Mini, a dense [Transformer](/wiki/transformer) offered in 4K and 128K context variants. Subsequent releases in May 2024 added <b>Phi-3 Small</b> (7B), <b>Phi-3 Medium</b> (14B), and the multimodal <b>Phi-3 Vision</b> (4.2B).[^3] An updated <b>Phi-3.5</b> sub-family followed in August 2024, comprising Phi-3.5-Mini, the first [Mixture-of-Experts](/wiki/mixture_of_experts) model in the lineage (Phi-3.5-MoE), and a refreshed Phi-3.5-Vision.[^4] All Phi-3 and Phi-3.5 models ship under the MIT License with open weights distributed through [Hugging Face](/wiki/hugging_face) and the Azure AI Model Catalog.[^5]</p>

<p>The accompanying paper, <i>Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone</i> (arXiv:2404.14219), opens by stating that Phi-3 Mini "rivals that of models such as Mixtral 8x7B and GPT-3.5" despite being "small enough to be deployed on a phone," and reports that a 4-bit quantized Phi-3 Mini occupies approximately 1.8 GB and runs at over 12 tokens per second on an iPhone 14 with an Apple A16 Bionic, marking one of the first widely-publicized demonstrations of a frontier-quality SLM operating entirely on-device.[^1] The Phi-3 family has since been succeeded by [Phi-4](/wiki/phi_4) (14B, December 2024), [Phi-4-mini](/wiki/phi_4_mini) (3.8B, February 2025), and Phi-4-Multimodal (5.6B, February 2025), but remains widely deployed for [edge AI](/wiki/edge_ai) and cost-sensitive inference workloads.[^6][^7]</p>

## What is the background of Phi-3?

<p>Phi-3 is the fourth named generation in a research lineage that began at [Microsoft Research](/wiki/microsoft_research) in 2023. The earlier models established the methodological premise that would carry through to Phi-3: that small models trained on tightly curated, "textbook-quality" data can outperform substantially larger models trained on raw web text.</p>

<p><b>Phi-1</b>, released in mid-2023, was a 1.3-billion-parameter model focused on Python coding. Its accompanying paper, "[Textbooks Are All You Need](/wiki/textbooks_are_all_you_need)" (Gunasekar et al., 2023, arXiv:2306.11644), argued that the composition and clarity of training data, not raw volume, was the dominant factor in determining capability per parameter for code generation.[^8] Phi-1 achieved competitive [HumanEval](/wiki/humaneval) scores against models several times its size, reporting approximately 50.6% pass@1 on HumanEval at 1.3 billion parameters trained on roughly 7 billion tokens of filtered web data and synthetic textbook content.[^8]</p>

<p><b>Phi-1.5</b> followed later in 2023, also at 1.3 billion parameters, extending the approach to common-sense reasoning and natural-language tasks.[^9] The Phi-1.5 paper, "Textbooks Are All You Need II" (Li et al., arXiv:2309.05463), reported that the model matched or exceeded models five times its size on benchmarks measuring common sense reasoning and basic world knowledge.[^9] <b>[Phi-2](/wiki/phi_2)</b>, released in December 2023 at 2.7 billion parameters, demonstrated that the textbook-data approach scaled and that knowledge distillation from a smaller checkpoint (Phi-1.5) could be combined with synthetic data generation to produce a model competitive on reasoning benchmarks with systems up to 25 times larger.[^10]</p>

<p>The intellectual origin of the program was significantly shaped by Ronen Eldan, a Microsoft Research mathematician who, by his own account, was inspired by watching how his young daughter learned language from a relatively narrow but high-quality vocabulary rather than from arbitrary text exposure.[^2] Eldan and colleagues created TinyStories, a dataset of millions of short children's narratives synthesized using only a vocabulary of roughly 3,000 foundational words, demonstrating that very small models (under 10 million parameters) could nonetheless produce coherent multi-paragraph English. The TinyStories result informed the broader thesis that curated data composition was the dominant lever for small-model capability, which became the foundation of the Phi line.[^2]</p>

<p>Phi-3 represented both a scaling and a productization of this line of work. Where Phi-1 through Phi-2 had been principally research artifacts demonstrating the data-quality thesis, Phi-3 was conceived as a deployable family spanning multiple sizes, context lengths, and modalities. It was released on day one through Azure AI Studio, [Hugging Face](/wiki/hugging_face), [NVIDIA NIM](/wiki/nvidia_nim) microservices, and [Ollama](/wiki/ollama), with optimized [ONNX](/wiki/onnx)-runtime variants for cross-platform on-device deployment.[^2][^11]</p>

<p>The Phi research lineage originated within Microsoft Research's Machine Learning Foundations group, with Sebastien Bubeck leading the program as Vice President of Generative AI Research at Microsoft. Bubeck had spent roughly a decade at Microsoft Research before the Phi work, where he was known for theoretical contributions to convex optimization and bandit problems prior to pivoting toward generative AI.[^12] In October 2024, Bubeck announced his departure from Microsoft to join [OpenAI](/wiki/openai), where he was expected to continue work on efficient and small-model methods.[^12] The Phi program continued at Microsoft after his departure, producing the Phi-3.5 family in August 2024 and the [Phi-4](/wiki/phi_4) line beginning in December 2024.</p>

<p>The work formed part of a broader strategic bet by [Microsoft](/wiki/microsoft) that small, efficient models, deployable on consumer hardware, finetuneable for narrow domains, and cheap to serve at scale, would constitute an important complementary tier to frontier-scale systems such as those produced by [OpenAI](/wiki/openai), with which Microsoft maintains a substantial commercial and infrastructure partnership.[^2] Where frontier models target maximum capability per query, the Phi line targets maximum capability per parameter and per watt, intended for production scenarios where latency, cost, and privacy constraints make cloud-only frontier inference impractical.</p>

## What is Phi-3 Mini (3.8B, April 2024)?

<p>Phi-3 Mini was the founding model of the Phi-3 family and the only variant available at the April 23, 2024 announcement. It is a dense, decoder-only [Transformer](/wiki/transformer) with 3.8 billion parameters arranged in 32 layers, with 32 attention heads and a hidden dimension of 3,072.[^1] Its vocabulary contains 32,064 tokens and uses a tokenizer compatible with the [Llama 2](/wiki/llama_2) format, allowing weights to be loaded by existing Llama-2 tooling.[^13]</p>

<p>Phi-3 Mini was released in two context-length variants from the start:</p>

<ul>
<li><b>Phi-3-mini-4k-instruct:</b> Native 4,096-token context window. Suitable for short-form interactive tasks, on-device assistants, and latency-sensitive applications.[^13]</li>
<li><b>Phi-3-mini-128k-instruct:</b> Extended 128,000-token context window achieved via the [LongRoPE](/wiki/longrope) rescaling technique. Suitable for long-document analysis, agent traces, and retrieval-augmented generation workflows.[^14]</li>
</ul>

<p>The base model was trained on 3.3 trillion tokens drawn from a heavily filtered web corpus and a large body of synthetic data, a budget characterized in the technical report as "data-optimal" rather than [compute-optimal](/wiki/chinchilla_scaling), emphasizing per-token quality over total volume.[^1] Post-training combined [supervised fine-tuning](/wiki/supervised_fine-tuning) (SFT) on high-quality instruction and chat data with [Direct Preference Optimization](/wiki/direct_preference_optimization_dpo) (DPO) for alignment.[^1] The data cutoff for the base model is October 2023. A June 2024 update to the model card noted substantial gains on instruction following and structured output through additional post-training data, with metrics such as JSON-structure-output rising from 11.5 to 52.3 on Microsoft's internal evaluation.[^13]</p>

<p>The Phi-3 Mini paper reports [MMLU](/wiki/mmlu) (5-shot) of approximately 69%, [MT-Bench](/wiki/mt_bench) of 8.38, and HumanEval (0-shot) of 60.4, all measured against contemporaneous open models including [Llama 3 8B](/wiki/llama_3), [Mistral 7B](/wiki/mistral_7b), and [Gemma 7B](/wiki/gemma).[^1] Updated Hugging Face card numbers list MMLU at 70.9 (5-shot) and GSM8K chain-of-thought at 85.7 (8-shot), with the model nominally trailing GPT-3.5 by roughly 2.8 points on an aggregate of 21 benchmarks (67.6 vs 70.4).[^13]</p>

<p>Phi-3 Mini's defining demonstration was on-device inference: a 4-bit quantized version using [Activation-aware Weight Quantization (AWQ)](/wiki/awq) occupies approximately 1.8 GB of storage and runs at over 12 tokens per second on an iPhone 14 with the [Apple A16 Bionic](/wiki/apple_silicon), fully offline.[^1][^15]</p>

## Phi-3 Small (7B) and Phi-3 Medium (14B)

<p>Two larger Phi-3 variants were released on May 21, 2024, expanding the family upward while retaining the same data philosophy and the option of 128K context windows.[^3]</p>

### Phi-3 Small (7B)

<p>Phi-3 Small is a 7-billion-parameter dense [Transformer](/wiki/transformer) with several architectural changes relative to Phi-3 Mini. It uses the tiktoken tokenizer with a 100,352-token vocabulary, providing substantially better coverage of non-English scripts and improving tokenization efficiency for multilingual content.[^16] Its attention mechanism uses [grouped-query attention](/wiki/gqa) (GQA) with four query heads sharing each key-value head, reducing the [KV-cache](/wiki/kv_cache) memory footprint at inference.</p>

<p>A second notable architectural choice in Phi-3 Small is its alternating dense-and-blocksparse attention pattern. Layers alternate between standard full-context attention and a novel block-sparse mechanism in which each attention head enforces a distinct sparsity pattern over the KV cache. This ensures that across the set of heads, every token position is attended to, while substantially lowering memory and compute relative to a fully dense implementation at long sequence lengths.[^1] The architecture comprises 32 layers, 32 attention heads, and a hidden dimension of 4,096.[^1]</p>

<p>Phi-3 Small was trained on 4.8 trillion tokens over 18 days using 1,024 H100-80GB GPUs, with roughly 10% of the corpus drawn from multilingual sources.[^16] Both 8K and 128K context variants are released. The model reports MMLU of 75.5% and MT-Bench of 8.70, placing it between Mistral 7B and frontier-scale systems on standard benchmarks.[^1]</p>

### Phi-3 Medium (14B)

<p>Phi-3 Medium is a 14-billion-parameter dense [Transformer](/wiki/transformer) with 40 layers, 40 attention heads, and an embedding dimension of 5,120.[^17] It shares the same 32,064-token vocabulary and tokenizer format as Phi-3 Mini. Training ran from February to April 2024 on 512 H100-80GB GPUs over 42 days, consuming 4.8 trillion tokens from the same curated corpus as Phi-3 Small.[^17] The base model has an October 2023 data cutoff.</p>

<p>Phi-3 Medium reports MMLU (5-shot) of 78.0% in the technical report (76.6% on the model card protocol), [GSM8K](/wiki/gsm8k) (8-shot CoT) of 87.5%, MBPP (3-shot) of 73.8%, and an MT-Bench score of 8.9, the highest in the initial Phi-3 release.[^1][^17] On Microsoft's average of 21 benchmarks the model scores 77.3%, with category-level breakdowns of 83.2% on reasoning, 75.3% on language understanding, 64.2% on code, 52.9% on math, and 47.5% on factual knowledge.[^17] It is available in both 4K and 128K context variants. As with the other variants, post-training used SFT followed by DPO.</p>

## What is Phi-3 Vision (4.2B multimodal)?

<p>Phi-3 Vision, also released on May 21, 2024, was the first multimodal model in the Phi family.[^3][^18] It has 4.2 billion parameters and combines two components: an image encoder based on the [CLIP](/wiki/clip) ViT-L/14 model and the Phi-3-mini-128K language model, connected via a trainable projection (a multi-layer perceptron) that maps image embeddings into the language model's input space.[^18]</p>

<p>The model accepts interleaved text and image inputs and supports the full 128,000-token context window of its language backbone, making it suitable for long documents with embedded images. Training used 500 billion vision-and-text tokens over approximately 1.5 days on 512 H100-80GB GPUs between February and April 2024, with a data cutoff of March 15, 2024.[^18] Training data composition was reported to include publicly available documents, high-quality educational data and code, interleaved image-text data, synthetic "textbook-like" content, newly created image data covering charts, tables, diagrams, and slides, and high-quality chat-format supervised data.[^18]</p>

<p>Phi-3 Vision's reported benchmark scores include ScienceQA at 90.8%, ChartQA at 81.4%, MMBench at 80.5%, TextVQA at 70.9%, and [MMMU](/wiki/mmmu) at 40.4%.[^18] On chart and table understanding in particular, the model performs strongly relative to other open multimodal models of similar or larger size, and Microsoft positioned it for enterprise document processing involving structured imagery such as financial reports, scientific figures, and scanned forms.[^18]</p>

## How was Phi-3 trained?

<p>The training corpus for the Phi-3 family draws from three categories of data, broadly described in the technical report and model cards:[^1]</p>

<ol>
<li><b>Heavily filtered public web data.</b> Documents from web crawls were evaluated for "educational level," logical structure, factual density, and clarity of exposition, with the vast majority discarded as below threshold. The retained subset emphasizes content an informed reader would find genuinely instructive.[^1]</li>
<li><b>LLM-generated synthetic data.</b> Continuing the "Textbooks Are All You Need" tradition, large amounts of [synthetic data](/wiki/synthetic_data) were generated to teach mathematical reasoning, algorithmic thinking, programming, general science, and structured world knowledge. Microsoft's team manually reviewed batches of this synthetic output and filtered for coherence before inclusion.[^1]</li>
<li><b>Supervised instruction and chat data</b> covering a broad range of conversational topics, used during post-training to convert the base model into an instruction-following assistant.[^1]</li>
</ol>

<p>Training proceeded in two phases. Phase one covered broad general knowledge across all data sources. Phase two emphasized more heavily filtered web data targeting logical reasoning and specialized skills, increasing the share of reasoning-dense and synthetic textbook material.[^1] The report describes this regime as "data-optimal," meaning the training-token budget was deliberately allocated toward curating the best possible tokens, rather than maximizing token count at fixed compute. Sebastien Bubeck, a Microsoft Research lead on the program, characterized the approach by asking, "Instead of training on just raw web data, why don't you look for data which is of extremely high quality?"[^2]</p>

<p>The team has described examples of the synthetic-data pipeline: a frontier model is prompted to generate, for instance, a large set of multiplication problems with worked solutions; a smaller verification model (or a calculator-style checker) discards items whose answers are wrong; and the surviving filtered set, often a small fraction of the initial generation, is added to the corpus.[^2] The same logic was extended to mathematical reasoning, code synthesis, and structured-knowledge instruction. Bubeck noted that ChatGPT's known weaknesses at exact arithmetic did not prevent it from producing useful textbook-style math exercises once the outputs were checked, because the model's role in the pipeline was content generation rather than ground-truth provision.[^2]</p>

<p>Post-training used a two-stage pipeline of SFT followed by DPO for Phi-3 Mini, Small, and Medium. Phi-3.5 models later moved to a three-stage SFT/[PPO](/wiki/ppo)/DPO pipeline.[^4][^19]</p>

<p>The report explicitly notes that the training data composition emphasized reasoning over breadth of factual knowledge for small models. In one example given in the report, the result of a Premier League football match on a particular day might be valuable for frontier models but was excluded from Phi-3 Mini's corpus to leave more model capacity for general reasoning ability. The team chose more data from the Phase-2 corpus, dense in reasoning-relevant material, than from Phase-1.[^1]</p>

### How does Phi-3 reach 128K context with LongRoPE?

<p>The 128K context variants of every Phi-3 and Phi-3.5 model use [LongRoPE](/wiki/longrope), a position-embedding rescaling technique developed by Microsoft Research and posted to arXiv on February 21, 2024 (Ding et al., arXiv:2402.13753).[^20] Standard [rotary position embeddings](/wiki/rotary_position_embedding) (RoPE) lose effectiveness when sequences extend beyond the lengths seen during training because the embedding's frequency components are calibrated to the training context. LongRoPE applies non-uniform rescaling factors per RoPE dimension and per position range, identified by an evolutionary search algorithm. After long-context fine-tuning, a final short-context re-adjustment at 8K preserves performance on short sequences.[^20]</p>

<p>The progressive extension strategy in LongRoPE first fine-tunes to 256K context length, then applies secondary positional interpolation to reach lengths as long as 2,048K (2 million) tokens; the method requires only on the order of 1,000 fine-tuning steps within 256K training lengths.[^20] On the [RULER](/wiki/ruler_benchmark) benchmark, Phi-3 Mini 128K averages 84.6 across context lengths from 4K to 128K, with 65.6 at the full 128K, substantially above the pre-LongRoPE baseline.[^1] Phi-3.5-MoE achieves a RULER average of 87.1 with 64.2 at 128K context.[^19]</p>

## What is the architecture of Phi-3?

<p>All dense Phi-3 models (Mini, Medium, Phi-3.5-Mini) share a standard autoregressive [Transformer](/wiki/transformer) decoder block with pre-normalization using [RMSNorm](/wiki/rmsnorm), rotary positional embeddings, and a [SwiGLU](/wiki/swiglu) activation in the feedforward layer, a configuration broadly consistent with the [Llama 2](/wiki/llama_2) block.[^1] Phi-3 Mini and Phi-3 Medium share the same 32,064-token vocabulary, allowing tooling reuse across the family.</p>

<p>Phi-3 Small diverges in three respects: (1) the tiktoken tokenizer with a 100,352-token vocabulary for stronger multilingual coverage; (2) [grouped-query attention](/wiki/gqa) with 4 query heads per key-value head, reducing KV-cache memory; and (3) the alternating dense plus blocksparse attention scheme, where each blocksparse head enforces a distinct sparsity pattern such that across heads every token is covered while per-layer compute remains tractable at 128K context.[^1]</p>

<p>Phi-3.5-MoE replaces the dense feedforward sublayers with a [Mixture-of-Experts](/wiki/mixture_of_experts) module. A learned top-k gating function routes each token to two of sixteen GLU-feedforward experts; attention layers remain dense. Each expert is parameterized at the scale of Phi-3 Mini's feedforward block (3.8B nominal), with total parameters of approximately 42 billion and 6.6 billion active per token.[^19]</p>

<p>Across all variants, 128K context is implemented via LongRoPE, as described above.</p>

### Architectural comparison table

<table>
<thead>
<tr>
<th>Model</th>
<th>Params</th>
<th>Layers</th>
<th>Heads</th>
<th>Hidden dim</th>
<th>Vocab</th>
<th>Attention</th>
<th>Context</th>
</tr>
</thead>
<tbody>
<tr><td>Phi-3 Mini</td><td>3.8B</td><td>32</td><td>32</td><td>3,072</td><td>32,064 (Llama-2)</td><td>Dense MHA</td><td>4K / 128K</td></tr>
<tr><td>Phi-3 Small</td><td>7B</td><td>32</td><td>32</td><td>4,096</td><td>100,352 (tiktoken)</td><td>GQA + blocksparse</td><td>8K / 128K</td></tr>
<tr><td>Phi-3 Medium</td><td>14B</td><td>40</td><td>40</td><td>5,120</td><td>32,064 (Llama-2)</td><td>Dense MHA</td><td>4K / 128K</td></tr>
<tr><td>Phi-3 Vision</td><td>4.2B</td><td>32 (LM)</td><td>32 (LM)</td><td>3,072 (LM)</td><td>32,064</td><td>Dense + CLIP ViT-L/14</td><td>128K</td></tr>
<tr><td>Phi-3.5-Mini</td><td>3.8B</td><td>32</td><td>32</td><td>3,072</td><td>32,064</td><td>Dense MHA</td><td>128K</td></tr>
<tr><td>Phi-3.5-MoE</td><td>42B (6.6B active)</td><td>32</td><td>32</td><td>3,072</td><td>32,064</td><td>Dense + 16 MoE experts, top-2</td><td>128K</td></tr>
<tr><td>Phi-3.5-Vision</td><td>4.2B</td><td>32 (LM)</td><td>32 (LM)</td><td>3,072 (LM)</td><td>32,064</td><td>Dense + CLIP ViT-L/14</td><td>128K</td></tr>
</tbody>
</table>

## How does Phi-3 perform on benchmarks?

<p>The table below summarizes key benchmark results for the dense Phi-3 and Phi-3.5 language models. Scores are drawn from the Phi-3 Technical Report and the official Hugging Face model cards; protocols match each benchmark's standard configuration (5-shot for MMLU, 8-shot CoT for GSM8K, 0-shot for HumanEval).[^1][^13][^16][^17][^19][^21]</p>

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Phi-3 Mini (3.8B)</th>
<th>Phi-3 Small (7B)</th>
<th>Phi-3 Medium (14B)</th>
<th>Phi-3.5 Mini (3.8B)</th>
<th>Phi-3.5 MoE (6.6B active)</th>
</tr>
</thead>
<tbody>
<tr><td>MMLU (5-shot)</td><td>69.7</td><td>75.5</td><td>78.0</td><td>69.0</td><td>78.9</td></tr>
<tr><td>GSM8K (8-shot CoT)</td><td>85.3</td><td>87.3</td><td>87.5</td><td>86.2</td><td>88.7</td></tr>
<tr><td>HumanEval (0-shot)</td><td>60.4</td><td>59.1</td><td>58.5</td><td>62.8</td><td>70.7</td></tr>
<tr><td>MT-Bench</td><td>8.38</td><td>8.70</td><td>8.90</td><td>--</td><td>--</td></tr>
<tr><td>ARC Challenge (10-shot)</td><td>85.5</td><td>90.8</td><td>91.0</td><td>84.6</td><td>91.0</td></tr>
<tr><td>BigBench Hard (3-shot)</td><td>73.5</td><td>77.6</td><td>77.9</td><td>69.0</td><td>79.1</td></tr>
<tr><td>MBPP (3-shot)</td><td>71.7</td><td>69.6</td><td>73.8</td><td>69.6</td><td>80.8</td></tr>
<tr><td>Multilingual MMLU</td><td>51.08</td><td>62.6</td><td>--</td><td>55.4</td><td>69.9</td></tr>
<tr><td>RULER avg (to 128K)</td><td>84.6</td><td>--</td><td>--</td><td>84.1</td><td>87.1</td></tr>
</tbody>
</table>

<p>For comparison against contemporaneous open models, the Phi-3 Mini MMLU score of 69.7 exceeds [Llama 3](/wiki/llama_3) 8B (66.5) and [Mistral 7B](/wiki/mistral_7b) (61.7) despite being roughly half the size of either. On math reasoning, Phi-3 Mini at 85.3% GSM8K is dramatically above Mistral 7B (46.4%) and notably above Llama 3 8B (77.4%).[^1][^13]</p>

<p>Phi-3 Vision benchmarks (ScienceQA 90.8, ChartQA 81.4, MMBench 80.5, TextVQA 70.9, MMMU 40.4) place it competitively against larger multimodal systems such as Claude 3 Haiku and Gemini 1.0 Pro on chart and document understanding tasks.[^18]</p>

<p>A noted weakness across the family is factual recall (e.g., [TriviaQA](/wiki/triviaqa)), reflecting the trade-off inherent in training small models on synthetic data rather than raw web crawls; the technical report explicitly recommends [retrieval-augmented generation](/wiki/retrieval_augmented_generation_rag) for applications requiring broad factual knowledge.[^1]</p>

### Independent and arena-style evaluation

<p>While Phi-3 models scored well on academic benchmarks, independent human-preference evaluations such as Chatbot Arena initially placed Phi-3 Mini somewhat below its benchmark scores would suggest. Phi-3 Mini received an Elo rating in the range of competitive 7B models, broadly comparable to Mistral 7B Instruct but below [Mixtral](/wiki/mixtral) 8x7B and frontier-class systems, leading parts of the research community to argue that synthetic-data-heavy training optimized for academic benchmarks does not always fully transfer to conversational quality as judged by real users.[^1] Apple's WWDC 2024 disclosure also reported that its approximately 3-billion-parameter on-device Apple Foundation Model outperformed Phi-3 Mini, Mistral 7B, Gemma 7B, and Llama 3 8B on Apple's internal human-preference evaluations, though Apple did not publish full benchmark protocols.[^22]</p>

<p>Microsoft's three-stage post-training pipeline introduced with Phi-3.5 (adding PPO between SFT and DPO) was motivated in part by these gaps, and the Phi-3.5 family showed measurable improvements on multi-turn conversation quality and longer-context retrieval tasks.[^21] On Microsoft's internal aggregated evaluation across 80 benchmarks, Phi-3.5-MoE scored 69.2, above Gemini 1.5 Flash (68.5), Llama 3.1 8B (61.0), Gemma 2 9B (63.3), and Mistral-Nemo-12B (61.3), while remaining below GPT-4o-mini (74.9).[^4][^19]</p>

<p>Community discussion on Hugging Face raised separate concerns about how the model card initially presented benchmark comparisons against unspecified baseline configurations; Microsoft updated the model card with corrected charts after community feedback.[^23] These episodes were widely interpreted as an inevitable consequence of releasing a model family on a rapid cycle into a contested benchmark landscape rather than as substantive failures of the underlying training methodology.</p>

### How does Phi-3 compare with other small models?

<p>The table below compares Phi-3 variants with the three other open-weight model families most directly competitive in mid-2024.</p>

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>Phi-3 Mini 3.8B</th>
<th>Phi-3 Small 7B</th>
<th>Llama 3 8B Instruct</th>
<th>Mistral 7B Instruct</th>
<th>Gemma 7B Instruct</th>
<th>Mixtral 8x7B</th>
<th>GPT-3.5 Turbo</th>
</tr>
</thead>
<tbody>
<tr><td>MMLU (5-shot)</td><td>69.7</td><td>75.5</td><td>66.5</td><td>61.7</td><td>63.6</td><td>70.5</td><td>71.4</td></tr>
<tr><td>GSM8K (8-shot CoT)</td><td>85.3</td><td>87.3</td><td>77.4</td><td>46.4</td><td>59.8</td><td>64.7</td><td>78.1</td></tr>
<tr><td>HumanEval (0-shot)</td><td>60.4</td><td>59.1</td><td>60.4</td><td>28.0</td><td>34.1</td><td>37.8</td><td>62.2</td></tr>
<tr><td>BigBench Hard (3-shot)</td><td>73.5</td><td>77.6</td><td>51.5</td><td>57.3</td><td>59.6</td><td>69.7</td><td>68.3</td></tr>
<tr><td>Average</td><td>67.6</td><td>72.4</td><td>65.5</td><td>56.4</td><td>56.0</td><td>62.0</td><td>70.4</td></tr>
</tbody>
</table>

<p>Numbers reflect each model's reported scores under matching protocols where available; comparisons across distinct model cards always carry the risk of slightly different evaluation setups. The pattern is consistent across multiple independent reproductions: Phi-3 Mini at 3.8B competes with or exceeds 7B and 8B models on most benchmarks except factual-knowledge and certain conversational evaluations.[^1][^13]</p>

## What is the Phi-3.5 family (August 2024)?

<p>On August 21, 2024, Microsoft released a second generation of models under the Phi-3.5 label, addressing several limitations of the original release: narrow English focus, single-image vision only, and absence of a Mixture-of-Experts variant.[^4][^24] All three Phi-3.5 models are released under the MIT License and support a 128K context window.</p>

### Phi-3.5-Mini-Instruct (3.8B)

<p>Phi-3.5-Mini is an updated 3.8-billion-parameter dense [Transformer](/wiki/transformer) that retains the same architecture as Phi-3 Mini but is trained on a refreshed corpus with substantially expanded multilingual coverage.[^21] Training ran from June to August 2024 on 512 H100-80GB GPUs over approximately 10 days, consuming 3.4 trillion tokens. The model explicitly supports 22 languages: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, and Ukrainian.[^21]</p>

<p>Post-training switched to a three-stage SFT/PPO/DPO pipeline. Multilingual MMLU improved to 55.4 versus 51.08 for the original Phi-3 Mini. Arena-Hard scores rose from below 30 to 37, indicating clear gains on adversarial conversational evaluation.[^21] On reasoning-focused tasks, Phi-3.5-Mini reached GPQA at 30.4 (0-shot CoT), outperforming Llama 3.1 8B at 26.3 and Mistral 7B at 15.6 in the same configuration.[^21]</p>

### Phi-3.5-MoE-Instruct (16x3.8B)

<p>Phi-3.5-MoE is the first [Mixture-of-Experts](/wiki/mixture_of_experts) model in the Phi lineage and the most capable model in the Phi-3.5 family. It consists of 16 expert networks (each roughly the scale of a Phi-3 Mini feedforward block) for a total of approximately 42 billion parameters, with a learned top-2 router selecting 2 experts per token, resulting in 6.6 billion active parameters per forward pass.[^19]</p>

<p>Training used 4.9 trillion tokens (10% multilingual) and ran from April to August 2024 on 512 H100-80GB GPUs over 23 days. Like Phi-3.5-Mini, post-training used the three-stage SFT/PPO/DPO pipeline and the model supports the same 22 languages.[^19] On Microsoft's internal aggregated evaluation across 80 benchmarks, Phi-3.5-MoE scored 69.2, above Gemini 1.5 Flash (68.5), Llama 3.1 8B (61.0), Gemma 2 9B (63.3), and Mistral-Nemo-12B (61.3), while remaining below GPT-4o-mini (74.9).[^4]</p>

<p>Notable per-task scores for Phi-3.5-MoE include MATH (0-shot CoT) at 59.5, GSM8K at 88.7, HumanEval at 70.7, MBPP at 80.8, and multilingual MMLU at 69.9. On long-context RULER tasks, the model averages 87.1 across 4K to 128K context, with 64.2 at the full 128K, the highest among the Phi-3.5 family.[^19]</p>

### Phi-3.5-Vision-Instruct (4.2B)

<p>Phi-3.5-Vision extends multimodal capability from single-image inputs to multi-frame and short-video understanding.[^4][^25] It targets use cases including detailed image comparison, multi-image summarization, and short-clip video summarization, retaining the 128K context window of its language backbone. Phi-3.5-Vision uses the MIT License and was made available simultaneously on Hugging Face and Azure AI.</p>

## Is Phi-3 open source?

<p>Yes. All Phi-3 and Phi-3.5 models are released under the [MIT License](/wiki/mit_license), one of the most permissive open-source licenses.[^5] The MIT License allows use, modification, distribution, and commercial deployment without royalty obligation, subject only to preservation of the copyright notice. Notably, MIT imposes no use-case restrictions, distinguishing Phi-3 from models released under bespoke "community" licenses such as the Meta [Llama 3](/wiki/llama_3) Community License, which carries an acceptable-use policy and restrictions for organizations with very large user bases.[^26]</p>

<p>This licensing choice is consistent with Microsoft's positioning of Phi-3 as a deployable family suitable for downstream fine-tuning, on-device packaging, and commercial integration. Open weights are distributed via Hugging Face and the Azure AI Model Catalog.[^2] Microsoft also lists Phi-3 models in NVIDIA NIM microservices, which package an [ONNX](/wiki/onnx) or TensorRT-LLM engine plus an OpenAI-compatible HTTP server in a standard container image for deployment on any system with NVIDIA hardware.[^27]</p>

## Can Phi-3 run on-device and at the edge?

<p>A central design objective for the Phi-3 family, and Phi-3 Mini in particular, was [edge AI](/wiki/edge_ai) and on-device deployment without cloud connectivity, requiring that models fit within the memory and compute envelope of consumer hardware while preserving useful capability.</p>

<p>Microsoft worked with the [ONNX](/wiki/onnx) Runtime team to produce optimized ONNX-format builds of Phi-3 models. These ONNX builds support multiple quantization formats, most notably INT4 block quantization using [Activation-aware Weight Quantization](/wiki/awq) (AWQ), which preserves the 1% of weights most critical to model accuracy in higher precision while quantizing the rest to 4-bit.[^11] This brings the on-disk size of Phi-3 Mini to approximately 1.8 GB.</p>

<p>ONNX Runtime Mobile supports deployment across CPU, GPU, and neural processing unit (NPU) backends. Microsoft published reference chatbot applications for both iOS and Android demonstrating offline inference, streaming token generation, and automatic model retrieval from Hugging Face. On an iPhone 14 with an [Apple A16 Bionic](/wiki/apple_silicon), the INT4 Phi-3 Mini runs at over 12 tokens per second fully on-device.[^1][^11] On an Android Samsung Galaxy S21, Phi-3 Mini using RTN INT4 quantization runs at a moderate speed suitable for interactive assistant use.[^11]</p>

<p>For Windows deployments, ONNX Runtime's DirectML backend enables GPU acceleration on consumer hardware, including gaming-class GPUs not traditionally used for ML inference. On an NVIDIA RTX 4090, Phi-3 Mini achieves throughput between 217 and 308 tokens per second at batch size 1, depending on prompt length.[^11] CUDA-only deployment on server hardware achieves up to 5x speedup over PyTorch in FP16 and up to 10x speedup in INT4 for the 4K variant; the 128K variant shows up to 9x INT4 speedup.[^11] Phi-3 Small and Phi-3 Medium were subsequently added to this pipeline. The combination of MIT licensing, small footprints, and official ONNX support has made the Phi-3 family one of the most production-ready open SLM families for edge scenarios.</p>

### What is Phi-3 used for?

<p>Microsoft positioned the Phi-3 family for several distinct deployment scenarios based on size, latency profile, and capability:[^15]</p>

<ul>
<li><b>On-device and mobile AI:</b> Phi-3 Mini's ability to run on smartphone hardware enables applications that require local inference for privacy, offline availability, or latency reasons. Examples include on-device summarization, local code completion, and personal assistants that do not transmit user data to remote servers.</li>
<li><b>Latency-sensitive applications:</b> Smaller models produce tokens faster and need fewer computational resources than frontier-scale systems, making Phi-3 Mini and Phi-3 Small well-suited to real-time interactive tools, customer-facing chat systems, and embedded workflows where round-trip to a cloud API is unacceptable.</li>
<li><b>Cost-efficient cloud inference:</b> Deploying Phi-3 Medium or Phi-3.5-MoE on cloud infrastructure offers substantially lower per-token cost than frontier models while remaining competitive on reasoning, coding, and math tasks for enterprise applications that do not require frontier capability.</li>
<li><b>Retrieval-augmented generation:</b> The technical report explicitly recommends pairing Phi-3 with search or retrieval systems to compensate for the limited factual knowledge capacity of small synthetic-data-trained models, leveraging strong reasoning while providing external factual grounding.</li>
<li><b>Multimodal document understanding:</b> Phi-3 Vision and Phi-3.5 Vision target enterprise document processing involving charts, tables, scanned forms, and mixed image-text content, including financial, medical, and scientific reporting.</li>
<li><b>Fine-tuning base models:</b> MIT licensing and open weights make Phi-3 variants practical candidates for domain-specific fine-tuning. The compact size of Phi-3 Mini reduces the compute cost of fine-tuning relative to 7B or 13B models, enabling teams with modest GPU budgets to adapt the model for specialized applications.</li>
</ul>

## How safe is Phi-3?

<p>Microsoft's safety program for Phi-3 follows a "break-fix" cycle described in the paper "Phi-3 Safety Post-Training: Aligning Language Models with a 'Break-Fix' Cycle" (Haider et al., July 2024, arXiv:2407.13833).[^28] The cycle consists of five iterated stages: (1) curating safety-relevant training data; (2) safety-focused post-training combining SFT and DPO; (3) standardized internal safety evaluation; (4) red-team probing by Microsoft's AI Red Team (AIRT); and (5) targeted fixes informed by red-team findings, which feed back into the next round of curation and post-training.[^28]</p>

<p>The Microsoft AI Red Team probed Phi-3 release candidates using both single-turn and multi-turn conversational attacks, with adversary personas ranging from "low-skilled" attackers using only direct prompts to "intermediate" attackers using basic encodings and known jailbreak templates. The team reported that several iterations of the break-fix cycle cut the harmful-output rate for Phi-3 by roughly 75% relative to the pre-aligned baseline.[^28] Microsoft also reported substantial reductions in "ungroundedness" scores: Phi-3 Medium achieved an internal ungroundedness score of 0.213 versus 1.481 for Phi-2 on the same evaluation suite.[^1]</p>

<p>For the multilingual Phi-3.5 models, AIRT evaluated safety in Chinese, Spanish, Dutch, and English, finding that refusal behaviors and jailbreak robustness transferred well to non-English languages even though safety post-training was conducted predominantly in English.[^28] Multimodal safety for Phi-3.5-Vision was evaluated using benchmarks including RTVLM and VLGuard alongside internal harm-category measurements, with post-training improving safety scores across nearly all harm categories.[^25]</p>

<p>The technical report acknowledges that despite safety post-training, Phi-3 Mini retained a measurable residual jailbreak rate under adversarial prompting, and recommended that production deployments combine the model's built-in safety with application-layer content filtering, output classification, and user-context controls.[^1]</p>

## What replaced Phi-3? Phi-4, Phi-4-mini, and Phi-4-Multimodal

<p>Phi-3 has since been succeeded by the [Phi-4](/wiki/phi_4) generation, which retained the data-quality philosophy while increasing scale and adding multimodality.</p>

<ul>
<li><b>[Phi-4](/wiki/phi_4) (14B)</b>, released December 12, 2024 under the MIT License, is a 14-billion-parameter dense Transformer trained on 9.8 trillion tokens with a 16K context window. It introduced an even more synthetic-data-centric training recipe, with multi-agent prompting, self-revision workflows, and instruction reversal used during dataset generation. Phi-4 reports MMLU of 84.8%, GPQA of 56.1%, and MATH of 80.4%, with Microsoft claiming it matches or exceeds GPT-4o on specific reasoning benchmarks.[^6][^7]</li>
<li><b>[Phi-4-mini](/wiki/phi_4_mini) (3.8B)</b>, released February 26, 2025, is a 3.8-billion-parameter dense decoder-only Transformer with grouped-query attention, a 200,000-token vocabulary, shared input-output embeddings, and a 128K context window. It introduces function-calling support and improves multilingual coverage relative to Phi-3.5-Mini.[^29][^30]</li>
<li><b>Phi-4-Multimodal (5.6B)</b>, also released February 26, 2025, is built on the Phi-4-mini backbone and uses a Mixture-of-LoRAs (Low-Rank Adapters) approach to integrate vision, audio, and text within a single representation space. It supports speech input in eight languages, vision in English, and text in 23 languages, with a 128K context window.[^29][^30]</li>
</ul>

<p>The Phi-4 line therefore represents a continuation rather than a break: same MIT licensing, same Hugging Face and Azure AI distribution model, same on-device positioning, but with refined post-training (DPO, RLHF), larger and more diverse synthetic data, and explicit multimodality. Phi-4 in turn spawned reasoning-specialized variants such as Phi-4-Reasoning (released April 2025) and Phi-4-mini-flash-reasoning, demonstrating that the Phi pipeline could be retargeted toward chain-of-thought reasoning competitive with much larger systems.[^7]</p>

## What are the limitations of Phi-3?

<p>Microsoft's technical report and the model cards acknowledge several limitations of the Phi-3 family:[^1]</p>

<p><b>Factual knowledge capacity.</b> Because the models are trained on relatively fewer tokens than frontier systems, and because the training data prioritizes reasoning structure over raw factual breadth, Phi-3 models store less world knowledge per parameter than comparably-sized models trained on diverse web crawls. [TriviaQA](/wiki/triviaqa) performance is correspondingly lower than for models such as Mixtral 8x7B. The Phi-3 Mini Hugging Face card reports the model's factual-knowledge category score at 38.4%, the lowest of any benchmark category.[^13] Microsoft explicitly recommends [retrieval-augmented generation](/wiki/retrieval_augmented_generation_rag) for applications requiring broad factual recall.[^1]</p>

<p><b>Language coverage.</b> The original Phi-3 models were primarily English-trained. Performance on non-English languages was substantially weaker than on English tasks. Phi-3.5 partially addressed this with dedicated multilingual data and explicit support for 22 languages, but English remained the strongest-supported language, and even Phi-3.5 trails Gemma 2 9B on some multilingual benchmarks (e.g., Multilingual MMLU 55.4 versus 63.8 for Gemma 2 9B; MGSM 47.9 versus 76.4 for Gemma 2 9B).[^21]</p>

<p><b>Hallucination.</b> As with all language models, Phi-3 can produce plausible-sounding but factually incorrect output, an effect amplified by the reduced factual capacity noted above. The technical report notes hallucination as a residual risk despite SFT and DPO alignment.[^1]</p>

<p><b>Safety and adversarial robustness.</b> Internal evaluations showed that despite safety post-training, Phi-3 Mini could be induced to produce harmful content under adversarial prompting, with a measurable residual jailbreak rate.[^1][^28]</p>

<p><b>Code generation breadth.</b> Fine-tuning data for code generation was concentrated on Python; performance on other programming languages is weaker. Microsoft suggests that production deployments targeting non-Python code should consider additional language-specific fine-tuning.[^1]</p>

<p><b>Benchmark-versus-arena gap.</b> As noted above, Phi-3 Mini's strong academic benchmark scores did not always translate to proportionate Chatbot Arena Elo gains, an outcome attributed by external commentators to synthetic data optimized for benchmark formats. Microsoft's three-stage Phi-3.5 post-training partially closed this gap but did not eliminate it.[^21]</p>

<p><b>Benchmark contamination risk.</b> Researchers studying [MMLU](/wiki/mmlu) contamination have shown that simple paraphrasing of test items can defeat string-matching decontamination, leaving open the question of how much synthetic-data pipelines may inadvertently include rephrased benchmark content; the Phi-3 technical report describes decontamination procedures but does not publish a full audit.[^1]</p>

## Legacy and current status

<p>The Phi-3 family was, at release, the most public demonstration that small models trained on aggressively curated and synthetic data could compete with much larger systems on standard benchmarks while remaining deployable on consumer hardware. Phi-3 Mini's on-device demonstrations on smartphones, and the MIT licensing of all variants, made it a reference point for the on-device and edge AI discourse through 2024 and into 2025.[^1][^2]</p>

<p>The family also reinforced a broader research trend toward [synthetic training data](/wiki/synthetic_data). By demonstrating that the "Textbooks Are All You Need" approach scaled from Phi-1's narrow code domain to Phi-3's broad general-purpose family, the work fed into industry-wide adoption of synthetic-data pipelines, ultimately informing the design of [Phi-4](/wiki/phi_4) and influencing approaches taken by other labs.[^6] OpenAI's open-weight releases starting in 2025 were widely interpreted as reflecting Bubeck's influence after his October 2024 move from Microsoft.[^12]</p>

<p>As of 2026, Phi-3 itself is considered superseded by Phi-4 and Phi-4-mini for new deployments, but the older models remain widely used in production where compatibility, fine-tuning ecosystem maturity, or simpler runtime requirements favor the prior generation. Hugging Face download statistics and community fine-tunes of Phi-3 variants remain among the most active for small open models, with derivatives such as LLaVA-Phi-3 (a multimodal fine-tune merging the Phi-3 backbone with LLaVA's visual reasoning pipeline) extending the family's reach into community research.[^31] The Phi-3-mini-instruct base has also been used as the language component in a number of academic vision-language papers studying the small-model limit of multimodal alignment.[^31]</p>

## See also

<ul>
<li>[Phi-2](/wiki/phi_2)</li>
<li>[Phi-4](/wiki/phi_4)</li>
<li>[Phi-4-mini](/wiki/phi_4_mini)</li>
<li>[Microsoft](/wiki/microsoft)</li>
<li>[Microsoft Research](/wiki/microsoft_research)</li>
<li>[Textbooks Are All You Need](/wiki/textbooks_are_all_you_need)</li>
<li>[Small language model](/wiki/small_language_model)</li>
<li>[LongRoPE](/wiki/longrope)</li>
<li>[Synthetic data](/wiki/synthetic_data)</li>
<li>[Edge AI](/wiki/edge_ai)</li>
<li>[Hugging Face](/wiki/hugging_face)</li>
<li>[MIT License](/wiki/mit_license)</li>
<li>[Llama 3](/wiki/llama_3)</li>
<li>[Mistral 7B](/wiki/mistral_7b)</li>
<li>[Mixtral](/wiki/mixtral)</li>
<li>[Gemma](/wiki/gemma)</li>
<li>[Ollama](/wiki/ollama)</li>
<li>[ONNX](/wiki/onnx)</li>
<li>[NVIDIA NIM](/wiki/nvidia_nim)</li>
<li>[Grouped-Query Attention](/wiki/gqa)</li>
<li>[SwiGLU](/wiki/swiglu)</li>
<li>[RMSNorm](/wiki/rmsnorm)</li>
<li>[Mixture of Experts](/wiki/mixture_of_experts)</li>
<li>[CLIP](/wiki/clip)</li>
<li>[MMLU](/wiki/mmlu)</li>
<li>[GSM8K](/wiki/gsm8k)</li>
<li>[HumanEval](/wiki/humaneval)</li>
<li>[MT-Bench](/wiki/mt_bench)</li>
<li>[RULER](/wiki/ruler_benchmark)</li>
<li>[TriviaQA](/wiki/triviaqa)</li>
<li>[Direct Preference Optimization](/wiki/direct_preference_optimization_dpo)</li>
<li>[Supervised fine-tuning](/wiki/supervised_fine-tuning)</li>
<li>[Proximal Policy Optimization](/wiki/ppo)</li>
<li>[Retrieval-Augmented Generation](/wiki/retrieval_augmented_generation_rag)</li>
<li>[Apple Silicon](/wiki/apple_silicon)</li>
<li>[NVIDIA H100](/wiki/nvidia_h100)</li>
</ul>

## References

[^1]: Abdin, M. et al. "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone", arXiv, 2024-04-22 (revised through v4). https://arxiv.org/abs/2404.14219. Accessed 2026-05-24.

[^2]: Microsoft Source, "Tiny but mighty: The Phi-3 small language models with big potential", Microsoft News, 2024-04-23. https://news.microsoft.com/source/features/ai/the-phi-3-small-language-models-with-big-potential/. Accessed 2026-05-24.

[^3]: Misha Bilenko, "New models added to the Phi-3 family, available on Microsoft Azure", Azure Blog, 2024-05-21. https://azure.microsoft.com/en-us/blog/new-models-added-to-the-phi-3-family-available-on-microsoft-azure/. Accessed 2026-05-24.

[^4]: Weizhu Chen, "Discover the New Multi-Lingual, High-Quality Phi-3.5 SLMs", Microsoft Tech Community / Azure AI Foundry Blog, 2024-08-21. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/discover-the-new-multi-lingual-high-quality-phi-3-5-slms/4225280. Accessed 2026-05-24.

[^5]: Open Source Initiative, "The MIT License", OSI Approved Licenses. https://opensource.org/license/mit. Accessed 2026-05-24.

[^6]: Abdin, M. et al. "Phi-4 Technical Report", Microsoft Research / arXiv:2412.08905, 2024-12-12. https://www.microsoft.com/en-us/research/publication/phi-4-technical-report/. Accessed 2026-05-24.

[^7]: Hugging Face, "microsoft/phi-4 model card", Hugging Face Hub. https://huggingface.co/microsoft/phi-4. Accessed 2026-05-24.

[^8]: Gunasekar, S. et al. "Textbooks Are All You Need", arXiv:2306.11644, 2023-06-20. https://arxiv.org/abs/2306.11644. Accessed 2026-05-24.

[^9]: Li, Y. et al. "Textbooks Are All You Need II: phi-1.5 technical report", arXiv:2309.05463, 2023-09-11. https://arxiv.org/abs/2309.05463. Accessed 2026-05-24.

[^10]: Mojan Javaheripi and Sebastien Bubeck, "Phi-2: The surprising power of small language models", Microsoft Research Blog, 2023-12-12. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/. Accessed 2026-05-24.

[^11]: ONNX Runtime Team, "ONNX Runtime supports Phi-3 mini models across platforms and devices", ONNX Runtime Blog, 2024-04-23. https://onnxruntime.ai/blogs/accelerating-phi-3. Accessed 2026-05-24.

[^12]: Kyle Wiggers, "OpenAI snatches up Microsoft generative AI research lead", TechCrunch, 2024-10-14. https://techcrunch.com/2024/10/14/openai-snatches-up-microsoft-generative-ai-research-lead/. Accessed 2026-05-24.

[^13]: Microsoft, "Phi-3-mini-4k-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3-mini-4k-instruct. Accessed 2026-05-24.

[^14]: Microsoft, "Phi-3-mini-128k-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3-mini-128k-instruct. Accessed 2026-05-24.

[^15]: Misha Bilenko, "Introducing Phi-3: Redefining what's possible with SLMs", Azure Blog, 2024-04-23. https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/. Accessed 2026-05-24.

[^16]: Microsoft, "Phi-3-small-128k-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3-small-128k-instruct. Accessed 2026-05-24.

[^17]: Microsoft, "Phi-3-medium-128k-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3-medium-128k-instruct. Accessed 2026-05-24.

[^18]: Microsoft, "Phi-3-vision-128k-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3-vision-128k-instruct. Accessed 2026-05-24.

[^19]: Microsoft, "Phi-3.5-MoE-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3.5-MoE-instruct. Accessed 2026-05-24.

[^20]: Ding, Y. et al. "LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens", arXiv:2402.13753, 2024-02-21 (ICML 2024). https://arxiv.org/abs/2402.13753. Accessed 2026-05-24.

[^21]: Microsoft, "Phi-3.5-mini-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3.5-mini-instruct. Accessed 2026-05-24.

[^22]: Apple Machine Learning Research, "Introducing Apple's On-Device and Server Foundation Models", Apple, 2024-06-10. https://machinelearning.apple.com/research/introducing-apple-foundation-models. Accessed 2026-05-24.

[^23]: Hugging Face community discussion, "microsoft/Phi-3-mini-128k-instruct discussions on benchmark accuracy", Hugging Face. https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/20. Accessed 2026-05-24.

[^24]: MarkTechPost, "Microsoft AI Releases Phi 3.5 mini, MoE and Vision with 128K context, Multilingual and MIT License", MarkTechPost, 2024-08-21. https://www.marktechpost.com/2024/08/21/microsoft-ai-releases-phi-3-5-mini-moe-and-vision-with-128k-context-multilingual-and-mit-license/. Accessed 2026-05-24.

[^25]: Microsoft, "Phi-3.5-vision-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-3.5-vision-instruct. Accessed 2026-05-24.

[^26]: Meta, "Llama 3 Community License Agreement", Meta Llama. https://llama.meta.com/llama3/license/. Accessed 2026-05-24.

[^27]: NVIDIA, "Phi-3 on NVIDIA NIM microservices catalog", NVIDIA Developer / build.nvidia.com. https://build.nvidia.com/microsoft. Accessed 2026-05-24.

[^28]: Haider, E. et al. "Phi-3 Safety Post-Training: Aligning Language Models with a 'Break-Fix' Cycle", arXiv:2407.13833, 2024-07-18. https://arxiv.org/abs/2407.13833. Accessed 2026-05-24.

[^29]: Microsoft Azure Blog, "Empowering innovation: The next generation of the Phi family", Azure Blog, 2025-02-26. https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/. Accessed 2026-05-24.

[^30]: Microsoft, "Phi-4-multimodal-instruct model card", Hugging Face. https://huggingface.co/microsoft/Phi-4-multimodal-instruct. Accessed 2026-05-24.

[^31]: Hugging Face, "Phi-3 family pipeline and derivative models documentation", Hugging Face Transformers documentation. https://huggingface.co/docs/transformers/en/model_doc/phi3. Accessed 2026-05-24.

[^32]: Continuum Labs, "Phi-3 Technical Report annotated walkthrough", Continuum Labs, 2024-04. https://training.continuumlabs.ai/models/foundation-models/phi-3-technical-report. Accessed 2026-05-24.

[^33]: Microsoft Tech Community, "Phi-3 Vision: catalyzing multimodal innovation", Azure AI Foundry Blog, 2024-05. https://techcommunity.microsoft.com/blog/azure-ai-foundry-blog/phi-3-vision-%E2%80%93-catalyzing-multimodal-innovation/4170251. Accessed 2026-05-24.

[^34]: Microsoft Tech Community, "A better Phi family is coming: multi-language support, better vision, intelligent MoEs", Microsoft Tech Community, 2024-08. https://techcommunity.microsoft.com/blog/educatordeveloperblog/a-better-phi-family-is-coming---multi-language-support-better-vision-intelligenc/4224181. Accessed 2026-05-24.

[^35]: Microsoft Security Blog, "AI jailbreaks: What they are and how they can be mitigated", Microsoft Security Blog, 2024-06-04. https://www.microsoft.com/en-us/security/blog/2024/06/04/ai-jailbreaks-what-they-are-and-how-they-can-be-mitigated/. Accessed 2026-05-24.

[^36]: TechTarget Editorial, "Microsoft's new Phi-3-mini AI language model runs on iPhone", TechTarget, 2024-04-24. https://www.techtarget.com/searchenterpriseai/news/366582218/Microsofts-new-Phi-3-mini-AI-language-model-runs-on-iPhone. Accessed 2026-05-24.

[^37]: Maginative, "Microsoft Launches Phi-3 Mini: A Lightweight AI Model Packing a Punch", Maginative, 2024-04-23. https://www.maginative.com/article/microsoft-launches-phi-3-mini-a-lightweight-ai-model-packing-a-punch/. Accessed 2026-05-24.

[^38]: Stephanie Palazzolo, "Microsoft AI Vice President Sebastien Bubeck to Join OpenAI", Bloomberg, 2024-10-14. https://www.bloomberg.com/news/articles/2024-10-14/microsoft-artificial-intelligence-vp-bubeck-to-join-openai. Accessed 2026-05-24.

[^39]: Sebastien Bubeck and Eldan, "TinyStories: How Small Can Language Models Be and Still Speak Coherent English?", arXiv:2305.07759, 2023-05-12. https://arxiv.org/abs/2305.07759. Accessed 2026-05-24.

[^40]: Ritvik Rastogi, "Papers Explained 130: Phi-3", Medium, 2024-04. https://ritvik19.medium.com/papers-explained-130-phi-3-0dfc951dc404. Accessed 2026-05-24.
Model	Params	Layers	Heads	Hidden dim	Vocab	Attention	Context
Phi-3 Mini	3.8B	32	32	3,072	32,064 (Llama-2)	Dense MHA	4K / 128K
Phi-3 Small	7B	32	32	4,096	100,352 (tiktoken)	GQA + blocksparse	8K / 128K
Phi-3 Medium	14B	40	40	5,120	32,064 (Llama-2)	Dense MHA	4K / 128K
Phi-3 Vision	4.2B	32 (LM)	32 (LM)	3,072 (LM)	32,064	Dense + CLIP ViT-L/14	128K
Phi-3.5-Mini	3.8B	32	32	3,072	32,064	Dense MHA	128K
Phi-3.5-MoE	42B (6.6B active)	32	32	3,072	32,064	Dense + 16 MoE experts, top-2	128K
Phi-3.5-Vision	4.2B	32 (LM)	32 (LM)	3,072 (LM)	32,064	Dense + CLIP ViT-L/14	128K
Benchmark	Phi-3 Mini (3.8B)	Phi-3 Small (7B)	Phi-3 Medium (14B)	Phi-3.5 Mini (3.8B)	Phi-3.5 MoE (6.6B active)
MMLU (5-shot)	69.7	75.5	78.0	69.0	78.9
GSM8K (8-shot CoT)	85.3	87.3	87.5	86.2	88.7
HumanEval (0-shot)	60.4	59.1	58.5	62.8	70.7
MT-Bench	8.38	8.70	8.90	--	--
ARC Challenge (10-shot)	85.5	90.8	91.0	84.6	91.0
BigBench Hard (3-shot)	73.5	77.6	77.9	69.0	79.1
MBPP (3-shot)	71.7	69.6	73.8	69.6	80.8
Multilingual MMLU	51.08	62.6	--	55.4	69.9
RULER avg (to 128K)	84.6	--	--	84.1	87.1