Phi-4
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v4 · 3,658 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
13 citations
Review status
Source-backed
Revision
v4 · 3,658 words
Add missing citations, update stale details, or suggest a clearer explanation.
Phi-4 is a 14-billion-parameter small language model developed by Microsoft Research and released in December 2024. It is the fourth major generation of Microsoft's Phi family and the clearest expression to date of the family's guiding idea: that careful data curation, and in particular the heavy use of synthetic data, can produce a compact model that rivals or beats systems several times its size on reasoning-heavy tasks. Phi-4 is a dense decoder-only transformer trained on roughly 9.8 trillion tokens, with a deliberate emphasis on mathematics, science, and coding. Microsoft published the accompanying "Phi-4 Technical Report" (arXiv:2412.08905) on December 12, 2024, and released the open model weights under the permissive MIT License on Hugging Face in January 2025. [1][2][3]
The most cited result from the launch is that Phi-4 outscores the much larger GPT-4o on graduate-level science questions (GPQA) and on competition mathematics (MATH), despite using a small fraction of the parameters and compute of frontier proprietary systems. Because parts of Phi-4's training data were generated by GPT-4-class models, this is notable: the technical report frames it as evidence that the team's data-generation and post-training methods go beyond simple knowledge distillation, since the student surpasses the teacher on the targeted STEM skills. [1][3]
Phi-4 belongs to the category of models that Microsoft and the broader field call small language models, a loose label for systems in roughly the 1-billion to 15-billion-parameter range that are designed to run cheaply and, increasingly, to run on a single GPU or even a laptop. The defining contrast is with the large language model class, where capability has historically been pursued by scaling parameters and raw training compute. Phi-4 pursues capability along a different axis. Rather than adding parameters, the Phi team invests in the quality and structure of the training data, on the premise that a model learns reasoning more efficiently from text that was itself produced through coherent reasoning. [1][4]
At a high level, Phi-4 is:
Microsoft positioned Phi-4 not as a replacement for frontier models but as an efficient option for reasoning tasks where cost, latency, and the ability to self-host matter. Over the course of 2025, the company expanded the single flagship model into a full family, adding smaller, multimodal, and explicitly reasoning-tuned siblings. [2][5]
The Phi line began as a research bet inside Microsoft Research, associated in its early years with the researcher Sebastien Bubeck and the provocative claim that "textbooks are all you need." The argument, set out in a 2023 paper of that name, was that training a model on textbook-quality, educational, reasoning-dense content could yield capabilities far out of proportion to the model's parameter count. Each generation of Phi has extended that thesis to a broader set of skills. [4][6]
| Model | Parameters | Released | What it established |
|---|---|---|---|
| Phi-1 | 1.3B | June 2023 | A coding-focused model trained largely on synthetic Python "textbooks" and exercises; strong HumanEval and MBPP scores for its size. [6] |
| Phi-1.5 | 1.3B | Late 2023 | Extended the data-quality approach to common-sense reasoning and general language, performing like models several times larger. [6] |
| Phi-2 | 2.7B | December 2023 | Added distillation signal and broader reasoning, matching or beating models up to roughly 25 times its size on several benchmarks. [6] |
| Phi-3 | 3.8B to 14B | April 2024 | First Phi generation released as a public product family, with a 128,000-token context in some variants and the move to an MIT License. |
| Phi-4 | 14B | December 2024 | Made synthetic data the primary driver of pretraining and introduced new post-training techniques such as Pivotal Token Search. [1] |
Phi-1 (June 2023) was a 1.3-billion-parameter model trained mostly on synthetic Python tutorials and exercises plus filtered code, and it set new marks for small coding models on HumanEval and MBPP. Phi-1.5, also 1.3 billion parameters, broadened the recipe toward common-sense reasoning and language understanding. Phi-2 (December 2023, 2.7 billion parameters) pushed further into general reasoning and became a widely used research base. Phi-3 (April 2024) was the turning point from research artifact to product: it shipped in multiple sizes from 3.8 billion to 14 billion parameters, introduced long-context variants, and adopted the MIT License that Phi-4 would inherit. [4][6]
Phi-4 builds directly on the architectural foundation of Phi-3 while rethinking the data strategy. Where Phi-3 had already leaned on synthetic data, Phi-4 made it the centerpiece of pretraining. The model also arrived during a period of change for the team. In October 2024, Sebastien Bubeck, a longtime lead on the series and Microsoft's vice president of generative AI research, left to join OpenAI. Microsoft said the bulk of his Phi team would remain, and Bubeck is still listed among the authors of the December 2024 Phi-4 technical report. [1][7]
Phi-4's architecture is intentionally conservative. The team kept structural changes from Phi-3-medium to a minimum and spent its research effort on data quality and post-training rather than on novel model components. The result is a fairly standard dense, decoder-only transformer whose performance comes from what it was trained on rather than from architectural tricks. [1]
| Specification | Value |
|---|---|
| Parameters | 14 billion (about 14.7B with embeddings) |
| Architecture | Dense decoder-only transformer |
| Context length | 16,000 tokens (extended from a 4,000-token pretraining context) |
| Tokenizer | tiktoken-based, vocabulary of 100,352 tokens |
| Training data | Approximately 9.8 trillion tokens across all stages |
| Training hardware | 1,920 NVIDIA H100-80GB GPUs |
| Training duration | About 21 days |
| Training period | October to November 2024 |
| Knowledge cutoff | June 2024 |
| License | MIT |
Source: Phi-4 technical report and the microsoft/phi-4 model card. [1][3]
The distinctive feature of Phi-4 is its pretraining mixture. Microsoft assembled roughly 50 broad types of synthetic datasets totaling around 400 billion tokens, the largest synthetic corpus in the Phi series up to that point. In the final training mixture, the technical report describes a composition of approximately 40 percent synthetic data, 15 percent rewritten web content, 15 percent filtered web data, 20 percent code, and 10 percent acquired sources such as academic books and question-answer datasets. Roughly 8 percent of the overall data is multilingual. [1][3]
The synthetic data is not produced by a single naive prompt. The technical report describes several distinct generation techniques:
The underlying intuition, as the report puts it, is that synthetic tokens are "by definition predicted by the preceding tokens": each token follows from a context that was itself produced according to a coherent pattern, which makes the resulting reasoning structures easier for the model to absorb than the noisier statistics of raw web text. Microsoft reports that, in ablation studies, extra passes over synthetic data produced larger capability gains than adding equivalent volumes of fresh web tokens. [1][4]
Phi-4 is pretrained with a 4,000-token context and then put through a dedicated mid-training stage that lengthens its effective context to 16,000 tokens. This stage uses roughly 250 billion additional tokens at the longer length, blending newly curated long-context documents with recall tokens drawn from the main pretraining corpus so that earlier capabilities are preserved. To represent the longer range, the base frequency of the rotary position embedding was raised to 250,000. Ablation work reported in the paper found that training on genuinely long documents produced better long-context behavior than simply padding short sequences to the target length. [1]
After pretraining and mid-training, Phi-4 goes through a post-training pipeline that moves from broad supervised signal toward increasingly targeted preference optimization. The technical report describes three stages. [1]
The first stage is supervised fine-tuning on about 8 billion tokens of high-quality chat-formatted data spanning mathematics, coding, science, and general question answering. This establishes instruction-following behavior and output format. [1]
The second stage introduces a technique the team calls Pivotal Token Search (PTS). The idea is that within a generated response, a small number of individual tokens have an outsized effect on whether the answer ultimately turns out correct, and these pivotal tokens are not always at the obvious decision points. PTS finds them by sampling multiple continuations from candidate positions and watching how the probability of reaching a correct answer shifts; the positions where it shifts most sharply are treated as pivotal, and token-level preference pairs are built there for direct preference optimization. This gives a more precise training signal than response-level preference optimization, which treats the whole response as a single unit, and the report finds it especially helpful on reasoning-heavy benchmarks where individual steps determine the outcome. [1]
The third stage is a judge-guided round of direct preference optimization, in which GPT-4o acts as a preference judge over roughly 850,000 pairs generated from the model's own outputs. This provides broad coverage that complements the narrow, surgical signal from PTS, and the report notes it was particularly useful for conversational quality benchmarks that themselves rely on a model judge. [1]
Across 2025, Microsoft turned Phi-4 from a single flagship into a family. The original 14B model is the subject of most of this article, but the siblings are an important part of the story.
| Variant | Parameters | Released | Distinguishing feature |
|---|---|---|---|
| Phi-4 (flagship) | 14B dense | Dec 12, 2024 (preview); Jan 8, 2025 (open weights) | Synthetic-data pretraining, 16K context, MIT license [1][3] |
| Phi-4-mini | 3.8B dense | February 2025 | 128K context, expanded ~200K-token vocabulary, multilingual [5][8] |
| Phi-4-multimodal | 5.6B | February 2025 | Text, vision, and audio inputs via a Mixture-of-LoRAs design [5][9] |
| Phi-4-reasoning | 14B dense | April 30, 2025 | Fine-tuned on o3-mini reasoning traces, 32K context [10][11] |
| Phi-4-reasoning-plus | 14B dense | April 30, 2025 | Adds reinforcement learning on top of Phi-4-reasoning [10][11] |
| Phi-4-mini-reasoning | 3.8B dense | April 2025 | Distilled from DeepSeek-R1 reasoning traces, math-focused [11] |
Phi-4-mini is a 3.8-billion-parameter dense model aimed at memory- and latency-constrained settings. Compared with its Phi-3.5 predecessor it expands the vocabulary to roughly 200,000 tokens for better multilingual coverage, uses grouped-query attention for efficient long-sequence generation, and offers a 128,000-token context window, far longer than the flagship's 16,000. [5][8]
Phi-4-multimodal is a roughly 5.6-billion-parameter model that accepts text, images, and audio and produces text. It is built around a Mixture-of-LoRAs approach, attaching modality-specific low-rank adapters to a shared language backbone so that vision and speech capabilities can be added without disturbing the core text model. It supports a 128,000-token context and is documented in a separate Phi-4-Mini technical report (arXiv:2503.01743). [5][9]
Phi-4-reasoning and Phi-4-reasoning-plus, released April 30, 2025, take the original 14B base and specialize it for step-by-step reasoning. Phi-4-reasoning is supervised-fine-tuned on around 8.3 billion tokens of synthetic chain-of-thought traces generated by OpenAI's o3-mini, covering STEM, coding, and logic, and it uses a 32,000-token context with explicit reasoning delimited by <think> tags. Phi-4-reasoning-plus adds a short reinforcement-learning phase on top. Microsoft reports that Phi-4-reasoning scores about 75.3 percent on the AIME 2024 competition and Phi-4-reasoning-plus about 81.3 percent, ahead of the much larger DeepSeek-R1-Distill-Llama-70B at roughly 69.3 percent. A smaller Phi-4-mini-reasoning (3.8B), distilled from DeepSeek-R1 traces and focused on mathematics, was released alongside them. [10][11]
Phi-4's benchmark results were first published in the technical report released alongside the model on December 12, 2024. The headline comparison set the 14B model against GPT-4o, GPT-4o-mini, Llama-3.3-70B, Qwen-2.5-14B-Instruct, and Qwen-2.5-72B across general knowledge, science, mathematics, coding, and multilingual reasoning. The table below reproduces the core numbers from the model card and report. As with any single-vendor benchmark suite, the figures should be read as one snapshot rather than a definitive ranking. [1][3]
| Benchmark | Phi-4 (14B) | Phi-3 (14B) | Qwen-2.5-14B | GPT-4o-mini | Llama-3.3-70B | GPT-4o |
|---|---|---|---|---|---|---|
| MMLU (general knowledge) | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 88.1 |
| GPQA (graduate science) | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 50.6 |
| MATH (competition math) | 80.4 | 44.6 | 75.6 | 73.0 | 66.3 | 74.6 |
| HumanEval (coding) | 82.6 | 67.8 | 72.1 | 86.2 | 78.9 | 90.6 |
| MGSM (multilingual math) | 80.6 | 53.5 | 79.6 | 86.5 | 89.1 | 90.4 |
| DROP (reading comprehension) | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 80.9 |
| SimpleQA (factual recall) | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 39.4 |
Source: microsoft/phi-4 model card and the Phi-4 technical report. [1][3]
Several patterns stand out. On GPQA, Phi-4's 56.1 exceeds GPT-4o's 50.6, meaning a 14-billion-parameter open model beat a far larger proprietary system on graduate-level science questions. On MATH, Phi-4's 80.4 likewise tops GPT-4o's 74.6, placing it among the strongest sub-20B models on competition mathematics at release. These are exactly the STEM skills the synthetic curriculum and PTS post-training were designed to sharpen, which is why the report treats them as evidence that the method goes beyond distillation. [1][3]
The model is weaker in other areas, and Microsoft is candid about this. On DROP, a reading-comprehension benchmark that rewards multi-hop extraction over long passages, Phi-4 trails both Llama-3.3-70B and Qwen-2.5-14B. On MGSM, a multilingual math benchmark, it sits below the GPT-4o models and Llama-3.3-70B, reflecting its English-centric training. Its SimpleQA score of 3.0 is low, which the model card attributes to the limited factual storage of a 14B model rather than a flaw in reasoning. The report notes that the PTS-based preference training substantially reduced hallucination on the team's internal evaluations, but raw recall of obscure facts remains a structural limitation of a model this size. [1][3]
Microsoft placed particular weight on a contamination-resistant test. The technical report evaluates Phi-4 on the November 2024 AMC-10 and AMC-12 mathematics competitions, which took place after the training-data cutoff and so could not have leaked into pretraining. Phi-4 performed strongly on this genuinely held-out set, which the authors present as cleaner evidence of mathematical reasoning than scores on public benchmarks that may be partially memorized. [1]
All members of the Phi-4 family are released under the MIT License, one of the most permissive open-source licenses in wide use. It allows users to use, copy, modify, merge, publish, distribute, sublicense, and sell the software and model weights, subject only to including the license and copyright notice. This is more permissive than the custom community licenses attached to several competing open-weight models, which can restrict commercial use or constrain derivative model development. Microsoft first adopted MIT licensing for the Phi family with Phi-3 and continued it for Phi-4, signaling an intent to position Phi as infrastructure-grade components that organizations can adopt with minimal legal friction. [2][3]
The rollout happened in two stages. Microsoft first made the 14B model available on December 12, 2024 as a research preview through Azure AI Foundry, accompanied by the technical report but without public weights. On January 8, 2025 it published the official open weights on Hugging Face at the repository microsoft/phi-4 under the MIT License. In the gap between the two dates, community members extracted and re-uploaded weights from the preview, which drew attention from commentators such as Simon Willison and made the official open release one of the more anticipated small-model launches of early 2025. [2][12][13]
Today the family is distributed through Hugging Face (microsoft/phi-4, microsoft/Phi-4-mini-instruct, microsoft/Phi-4-multimodal-instruct, and microsoft/Phi-4-reasoning), through Azure AI Foundry, through GitHub Models, and via the NVIDIA API Catalog. The models also run through common open inference stacks including llama.cpp, Ollama, vLLM, and ONNX Runtime, and the 14B size loads comfortably on a single high-memory data-center GPU in 16-bit precision or on consumer hardware in quantized form. [3][5][8][12]
Microsoft and third-party deployers describe several broad uses for Phi-4 and its siblings. [2][3][5]
Mathematical and scientific assistants. The model's strength on mathematics and science benchmarks suits it to tutoring tools, homework help, and step-by-step problem solving, and the MIT license makes it attractive to education-technology builders who want to self-host. [1][2]
Coding assistance. With a HumanEval score of 82.6, Phi-4 is competitive among models of its size for code generation and has been used to back code completion, review, and programming-tutor tools where a smaller model lowers latency and cost. Its coding training is concentrated on Python, so other languages are less reliable without additional fine-tuning. [1][3]
On-device and edge AI. The smaller Phi-4-mini, at 3.8 billion parameters, can run on laptops, high-end phones, and embedded systems where connectivity or cloud latency is a problem, making it suitable for offline field-service tools and bandwidth-limited environments. [5][8]
Multimodal document intelligence. Phi-4-multimodal combines vision and audio in one compact model, which fits document-understanding workflows that mix printed text and spoken queries, as well as accessibility and inspection applications. [5][9]
Reasoning workloads and agents. The base Phi-4 is a strong non-reasoning option for production tasks where speed and cost matter, while the Phi-4-reasoning siblings are intended for harder problems that benefit from explicit chain-of-thought. The model's instruction following is adequate for well-defined steps in agentic pipelines, though it is acknowledged as less reliable than its reasoning ability. [1][10]
Privacy-sensitive and self-hosted deployments. Because the weights are open and MIT-licensed, organizations in regulated fields can run Phi-4 inside their own security perimeter rather than sending data to an external API, which has driven interest in finance, customer-support triage, and other settings where data control is a hard requirement. [2][3]
Microsoft documents Phi-4's limitations directly on the model card, and most of them follow from its size and training focus. [1][3]
Factual knowledge and hallucination. A 14-billion-parameter model stores far less factual detail than a frontier model, which shows up in its low SimpleQA score and in a tendency to produce plausible but incorrect details for obscure people, events, or citations. The report recommends pairing the model with a retrieval system for fact-heavy applications and notes that hallucination cannot be eliminated entirely. The June 2024 knowledge cutoff also means the base model is unaware of later events. [1][3]
Multilingual performance. Phi-4 is optimized for English, with only about 8 percent multilingual training data, and its scores on multilingual benchmarks such as MGSM trail several competitors. Non-English use, especially beyond the languages emphasized in Phi-4-mini, sees noticeably degraded quality. [1][3]
Context length. The flagship's 16,000-token window is shorter than the 128,000-token windows of contemporaries such as Llama-3.3-70B and Qwen-2.5-14B, which makes it less suitable for very long documents or large retrieval contexts. Phi-4-mini offers 128,000 tokens at reduced reasoning strength, and Phi-4-reasoning extends the window to 32,000. [1][5]
Instruction following and formatting. The model is less reliable at obeying precise formatting instructions, such as strict table layouts or exact-length outputs, which Microsoft attributes to a training emphasis on reasoning quality over format compliance. [1][3]
Coding scope. Coding ability is concentrated on Python and common libraries; other languages and niche packages are less dependable and may need verification or fine-tuning. [3]
High-risk domains. Microsoft explicitly advises against deploying any Phi-4 variant without additional safeguards in high-stakes settings such as legal, medical, or financial decision-making, and notes that safety training reduces but does not remove harmful output under adversarial prompting. [3]