| DeepSeek-R1 |
|---|
| Developer |
| Release date |
| Type |
| Architecture |
| Parameters |
| Context length |
| Training method |
| License |
| Updated version |
DeepSeek-R1 is an open-source reasoning-focused large language model developed by DeepSeek, a Chinese artificial intelligence company. Released on January 20, 2025 under the MIT license, R1 was one of the first open-weight models to match the reasoning performance of OpenAI's proprietary o1 model across mathematics, coding, and scientific reasoning tasks. The model uses a Mixture of Experts architecture with 671 billion total parameters, of which 37 billion are activated per forward pass, keeping computational costs manageable during inference.[1][2]
DeepSeek-R1's release triggered what became known as the "DeepSeek shock," a market event on January 27, 2025 that erased over $1 trillion from U.S. technology stocks in a single trading session. Nvidia alone lost approximately $589 billion in market capitalization, the largest single-day loss for any company in stock market history. The shock stemmed from the revelation that a small Chinese startup with roughly 160 employees had trained a reasoning model competitive with the world's most expensive AI systems, using an estimated $5.6 million in compute, a fraction of the hundreds of millions typically spent by Western labs.[3][4]
Beyond its market impact, DeepSeek-R1 was scientifically significant for demonstrating that complex reasoning behaviors could emerge from pure reinforcement learning without supervised fine-tuning. The companion model DeepSeek-R1-Zero, trained entirely through RL, developed chain-of-thought reasoning, self-reflection, and error correction spontaneously during training, a finding that challenged assumptions about how reasoning capabilities must be instilled in language models.[1][5]
DeepSeek had been building toward R1 throughout 2024. The company released DeepSeek-V2 in May 2024 and DeepSeek-V3 in December 2024, both using Mixture of Experts architectures that prioritized computational efficiency. V3 served as the base model for R1's training, providing a strong foundation of general language capabilities and world knowledge.[2][6]
The broader context for R1's development was the emergence of inference-time reasoning as a new paradigm in AI. OpenAI's o1, released in September 2024, had demonstrated that training models with reinforcement learning to "think before answering" could dramatically improve performance on difficult tasks. DeepSeek's contribution was to show that this approach could be replicated with open-source models at a fraction of the cost, and that the reasoning behaviors could emerge more naturally than previously assumed.[1]
DeepSeek-R1 is built on top of DeepSeek-V3, which uses a Mixture of Experts transformer architecture. The key architectural features include:[2][6]
The MoE architecture is central to R1's efficiency story. By activating only 37 billion of its 671 billion parameters for each token, the model achieves inference costs comparable to a much smaller dense model while maintaining the knowledge capacity of its full parameter count.
DeepSeek-R1's training followed a four-stage pipeline that combined supervised learning and reinforcement learning:[1][5]
Stage 1: Cold Start. The DeepSeek-V3 base model was fine-tuned on a small set of curated chain-of-thought reasoning examples. This "cold start" data provided the model with initial examples of structured reasoning, addressing issues like repetitive loops and poor readability that occurred when applying RL directly to the base model.
Stage 2: Reasoning-Oriented Reinforcement Learning. Large-scale RL was applied using Group Relative Policy Optimization (GRPO), focused on tasks with verifiable answers (mathematics, coding, logic problems). The model learned to generate extended chains of thought and was rewarded based solely on the correctness of its final answers.
Stage 3: Rejection Sampling and Supervised Fine-Tuning. The RL-trained model generated a large set of reasoning traces. High-quality traces were selected through rejection sampling and combined with non-reasoning data (general conversation, writing, etc.) for an additional round of supervised fine-tuning.
Stage 4: Reinforcement Learning for All Scenarios. A final round of RL was applied across both reasoning and general tasks, optimizing for helpfulness and harmlessness using a combination of rule-based and model-based reward signals.
GRPO is the reinforcement learning algorithm used to train both R1-Zero and R1. Originally proposed by DeepSeek for their earlier DeepSeek-Math model in a February 2024 paper, GRPO simplifies the RL training process compared to Proximal Policy Optimization (PPO), which had been the standard approach for language model RL training (as used in RLHF).[1][5][16]
The key innovation of GRPO is eliminating the need for a separate critic (value) model. In standard PPO-based RLHF, two models must be maintained during training: the policy model being optimized and a value model that estimates expected returns. The value model alone can be as large as the policy model, effectively doubling the computational requirements of training. GRPO removes this requirement by using a simpler baseline.[1][16]
The algorithm works as follows:[1][5][16]
This group-relative approach has several advantages. By normalizing rewards within each group, GRPO reduces the impact of reward scale differences across different problem types. The elimination of the value model cuts training memory requirements roughly in half, allowing the same hardware to train larger models. The algorithm is also simpler to implement and tune than PPO with a learned value function.[16]
Since R1's release, GRPO has become widely adopted in the language model community. Hugging Face's TRL library added native GRPO support, and numerous research groups have used it to train their own reasoning models. The algorithm's combination of simplicity, efficiency, and effectiveness made it particularly attractive for smaller teams and academic researchers.[16]
The reward signals used in R1's training were deliberately simple. For math problems, the reward was based on whether the final numerical answer matched the ground truth. For coding tasks, it was based on whether the generated code passed test cases. This simplicity was part of the research insight: complex reasoning behaviors could emerge from optimizing for straightforward correctness metrics, without needing elaborate reward shaping.
Before training R1, DeepSeek conducted an experiment called R1-Zero that became one of the most discussed results in the paper. R1-Zero was trained by applying reinforcement learning directly to the DeepSeek-V3 base model, without any supervised fine-tuning or curated reasoning examples. The model was simply given problems and rewarded for producing correct answers.[1][5]
Despite receiving no explicit training on how to reason, R1-Zero spontaneously developed several sophisticated reasoning behaviors during RL training:[1][5]
Researchers tracked the emergence of reflective reasoning behaviors across training by measuring the frequency of specific terms in the model's outputs. The results showed a clear phase transition:[1][5][17]
| Training Stage | Reflective Term Frequency | Behavior |
|---|---|---|
| Steps 0-4,000 | Virtually absent | Model generates linear, non-reflective solutions |
| Steps 4,000-7,000 | Sporadic appearance | Occasional use of "wait," "but," "however" |
| Steps 8,000+ | Marked increase | Systematic self-monitoring and error correction |
Specific reflective terms tracked included "wait," "mistake," "however," "but," "retry," "error," "verify," "wrong," "evaluate," and "check." These terms were virtually absent in the early stages of training, appeared sporadically in the middle stages, and showed a marked increase after step 8,000, suggesting the emergence of temporal reasoning or self-monitoring behavior.[17]
The model also showed a clear increase in the length of its reasoning chains over the course of training. Early in RL training, the model generated short, direct answers. As training progressed, the average response length grew steadily, with the model learning to allocate more "thinking time" to harder problems. This adaptive allocation of compute was not explicitly trained but emerged naturally from the RL optimization process.[1][5]
DeepSeek's paper highlighted what they called an "aha moment" during R1-Zero's training. At a certain point in RL training, the model showed a sudden increase in the use of reflective language (particularly the word "wait") during its reasoning chains. This marked a qualitative shift in the model's reasoning patterns, where it began systematically re-evaluating and correcting its own work rather than simply proceeding linearly through a solution.[1][5]
The aha moment became widely discussed in the AI research community. DeepSeek described it as evidence of "the self-evolution process" of the model, suggesting that reinforcement learning could induce genuinely emergent cognitive strategies. However, subsequent research by other groups has debated whether these behaviors were truly emergent or whether traces of reflective reasoning were already present in the base model's pre-training data. A study by Sea AI Lab titled "There May Not be Aha Moment in R1-Zero-like Training" argued that the observed behaviors could be attributed to pre-existing patterns in the training data rather than genuine emergence.[5][7]
Despite its impressive emergent behaviors, R1-Zero had practical limitations that motivated the development of the full R1 model. Its outputs often suffered from poor readability, with reasoning chains that mixed languages, repeated phrases endlessly, or failed to clearly delineate the final answer. The model also struggled with tasks outside of mathematics and coding, where the lack of supervised fine-tuning left it without appropriate response formats. These issues were addressed in R1 through the cold-start data and multi-stage training pipeline.[1]
DeepSeek-R1 achieved performance competitive with OpenAI's o1 across major reasoning benchmarks.
| Benchmark | DeepSeek-R1 | OpenAI o1 | GPT-4o | Description |
|---|---|---|---|---|
| AIME 2024 | 79.8% (pass@1) | 74.3% (pass@1) | 13.4% | American Invitational Mathematics Exam |
| MATH-500 | 97.3% | 96.4% | 60.3% | Mathematical problem solving |
| GPQA Diamond | 71.5% | 78.0% | 53.6% | Graduate-level science questions |
| Codeforces Elo | 2,029 | 1,891 | - | Competitive programming rating |
| MMLU | 91.8% | 92.3% | 87.2% | Multitask language understanding |
| LiveCodeBench | 65.9% | - | - | Real-world coding tasks |
| SWE-bench Verified | 49.2% | 48.9% | 33.2% | Software engineering tasks |
| HumanEval | 85.4% | 92.4% | 90.2% | Code generation |
The results showed that R1 matched or exceeded o1 on most mathematical and coding benchmarks while trailing slightly on general knowledge and code generation tasks. The fact that an open-source model could achieve these results, trained at a fraction of the cost, was the central claim that drove both the scientific interest and the market reaction.
Alongside R1, DeepSeek released six smaller "distilled" models created through knowledge distillation, where R1's reasoning capabilities were transferred to smaller, more efficient base models. DeepSeek used R1 as a teacher model to generate approximately 800,000 high-quality reasoning traces, which were then used to fine-tune smaller models from the Qwen2.5 and Llama 3 families.[1][2]
| Distilled Model | Base Model | Parameters | AIME 2024 | MATH-500 | GPQA Diamond |
|---|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-1.5B | 1.5B | 28.9% | 83.9% | - |
| DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-7B | 7B | 55.5% | 92.8% | 49.1% |
| DeepSeek-R1-Distill-Llama-8B | Llama 3.1-8B | 8B | 50.4% | 89.1% | 49.0% |
| DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 14B | 69.7% | 93.9% | 59.1% |
| DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 32B | 72.6% | 94.3% | 62.1% |
| DeepSeek-R1-Distill-Llama-70B | Llama 3.3-70B | 70B | 70.0% | 94.5% | 65.2% |
The distilled models were a major part of R1's impact. The smallest model, Qwen-1.5B, outperformed GPT-4o and Claude 3.5 Sonnet on math benchmarks despite being small enough to run on consumer hardware. The 32B and 70B distilled models set new state-of-the-art results among dense (non-MoE) open-source models on reasoning benchmarks, outperforming the contemporaneous QwQ-32B-Preview by substantial margins.[1]
The distilled models enabled local deployment on consumer hardware, which became a major driver of community adoption. The hardware requirements for running different distilled models are as follows:[18]
| Distilled Model | Minimum VRAM | Recommended GPU | Performance Notes |
|---|---|---|---|
| 1.5B / 7B / 8B | 8 GB | NVIDIA RTX 3060 12GB | Runs efficiently at standard quantization |
| 14B | 12-16 GB | NVIDIA RTX 4070 Ti 16GB | Fits in VRAM at 4-bit quantization |
| 32B | 20-24 GB | NVIDIA RTX 3090/4090 24GB | Smooth performance at 4-bit quantization |
| 70B | 40-48 GB | 2x NVIDIA RTX 3090 or A100 | Requires multi-GPU or offloading |
The 32B distilled model hit a particularly attractive sweet spot, offering performance comparable to OpenAI's o1-mini on several benchmarks while running on a single consumer-grade RTX 4090 GPU. Running locally eliminated API costs, kept data private, removed rate limits, and provided offline access to explicit chain-of-thought reasoning.[18]
The distilled models proved especially popular with the open-source community. Within weeks of release, hundreds of derivative models were created on Hugging Face, fine-tuned for specific use cases ranging from medical reasoning to financial analysis.
DeepSeek released R1, R1-Zero, and all six distilled models under the MIT license, one of the most permissive open-source licenses available. This meant that anyone could download, modify, redistribute, and commercially deploy the models without restriction. The full model weights, training code, and technical paper were all made publicly available on GitHub and Hugging Face.[1][2]
The open-source strategy was a deliberate choice that amplified R1's impact far beyond what a proprietary release would have achieved. Within a month of launch, over 700 community-built models derived from R1 appeared on Hugging Face, collectively downloaded more than 5 million times. Major cloud providers including Microsoft Azure, Amazon Web Services, and Nvidia's inference platforms quickly added support for R1, making it accessible through familiar enterprise interfaces.[2][9]
DeepSeek-R1 became the most-liked model on Hugging Face among nearly 1.5 million models on the platform, surpassing 10,000 likes. The variant versions collectively exceeded 10 million downloads. R1's release also catalyzed a broader shift in the open-source AI ecosystem: the number of competitive Chinese organizations releasing models increased dramatically, with Baidu going from zero releases on Hugging Face in 2024 to over 100 in 2025, and ByteDance and Tencent each increasing releases by eight to nine times.[19]
The MIT license also enabled a wave of academic research building on R1's approach. Researchers at universities and smaller labs could study the model's reasoning traces, replicate the RL training methodology, and test hypotheses about emergent reasoning that would have been impossible with a proprietary model.
The market reaction to R1's release became a defining financial event of early 2025. On January 27, 2025, one week after R1's public release, U.S. technology stocks experienced their steepest single-day decline in history.[3][4]
The sell-off was triggered by a sudden reassessment of the AI investment thesis. For years, the market had priced technology companies, especially chipmakers and cloud providers, on the assumption that building frontier AI required massive and growing capital expenditures. DeepSeek's demonstration that a 160-person Chinese startup could produce competitive results for $5.6 million in compute undermined that assumption.[3][4]
Nvidia's stock fell nearly 17% in a single session, losing approximately $589 billion in market capitalization. This was the largest single-day market value loss for any company in history. Other semiconductor companies including Broadcom, Marvell, Micron, and TSMC also fell sharply. The Nasdaq composite lost roughly $1 trillion in value by the end of the day. Meta and Alphabet (Google's parent company) also declined significantly.[3][4]
Marc Andreessen, the prominent technology investor, described the event as "AI's Sputnik moment," drawing a parallel to the 1957 Soviet satellite launch that shocked the United States into accelerating its space program. The comparison captured the sense that a competitor working with far fewer resources had achieved something that the established players, with their billions in investment, had assumed only they could do.[4][10]
The market impact extended beyond the immediate sell-off. Chinese AI companies entered an aggressive price war, with some cutting API prices by up to 97% in the weeks following R1's release. In the United States, the event forced a public debate about whether the hundreds of billions being invested in AI data centers and chip manufacturing were truly necessary, or whether architectural innovation could substitute for raw compute.[3][4]
Stanford HAI faculty noted that DeepSeek's open releases represented "a significant step in democratizing AI," enabling smaller companies and individual developers to build on frontier-capable models without massive compute budgets.[9]
On May 28, 2025, DeepSeek released a major update to R1 designated R1-0528. Despite being described as a "minor upgrade" in official communications, the update delivered substantial improvements across all major benchmarks.[11][12]
| Benchmark | R1 (Jan 2025) | R1-0528 (May 2025) | Change |
|---|---|---|---|
| AIME 2024 | 79.8% | 91.4% | +11.6% |
| AIME 2025 | 70.0% | 87.5% | +17.5% |
| Codeforces Rating | ~1,530 | ~1,930 | +400 points |
The AIME 2025 improvement from 70% to 87.5% was particularly notable, bringing R1-0528 into competitive range with OpenAI's o3 (88.9% on AIME 2025). The Codeforces rating jump of approximately 400 points reflected dramatically improved code generation and problem-solving ability.[11]
R1-0528 demonstrated deeper chain-of-thought reasoning than its predecessor. On challenging problems, the model averaged approximately 23,000 thinking tokens per query, compared to roughly 12,000 for the original R1. This near-doubling of reasoning depth, enabled by additional algorithmic optimization during post-training, contributed to the accuracy improvements.[12]
DeepSeek also reported that the rate of hallucinations (false or misleading outputs) was reduced by approximately 45-50% in scenarios such as rewriting and summarization.[12]
The update added several capabilities requested by the developer community:[12]
DeepSeek also released a distilled model from R1-0528: DeepSeek-R1-0528-Qwen3-8B, which achieved state-of-the-art performance among open-source 8B models on AIME 2024, surpassing the base Qwen3 8B by 10 percentage points.[12]
DeepSeek offered R1 through its API at prices dramatically lower than competing reasoning models.
| Model | Input (per 1M tokens, cache miss) | Input (per 1M tokens, cache hit) | Output (per 1M tokens) |
|---|---|---|---|
| DeepSeek-R1 | $0.55 | $0.14 | $2.19 |
| OpenAI o1 | $15.00 | $7.50 | $60.00 |
| OpenAI o3 | $2.00 | $0.50 | $8.00 |
The pricing differential was stark: R1 was roughly 27 times cheaper than o1 for input tokens and 27 times cheaper for output tokens. Even after OpenAI's June 2025 price cuts brought o3 down to $2/$8, R1 remained approximately 3-4 times cheaper. Combined with the MIT license allowing self-hosting (eliminating API costs entirely for organizations with their own compute), R1's economics were a core part of its appeal.
DeepSeek-R1's impact extended well beyond its benchmark scores and market disruption. The model fundamentally changed several assumptions about AI development.
Before R1, the prevailing assumption was that training frontier reasoning models required resources available only to a handful of well-funded Western labs. DeepSeek showed that a combination of architectural innovation (MoE, MLA), efficient training algorithms (GRPO, FP8 mixed precision), and clever engineering could produce competitive results at dramatically lower cost. This finding had practical consequences: smaller companies and research institutions began building on R1's open weights rather than training models from scratch.[1][9]
R1 demonstrated that open-source AI models could match proprietary frontier models in at least one important capability dimension (reasoning). This intensified the debate within the AI industry about the relative merits of open and closed development approaches. Labs that had been reluctant to open-source their models faced renewed pressure to justify keeping weights proprietary, while the open-source community gained a powerful new proof point for their approach.[9]
R1-Zero's spontaneous development of reasoning strategies through pure RL, without being shown any examples of reasoning, contributed to the ongoing scientific debate about emergent abilities in language models. The result suggested that reasoning was not something that needed to be explicitly taught through supervised learning but could arise naturally from optimization pressure on task performance. This finding influenced subsequent research across multiple labs exploring RL-based training for reasoning.[1][5]
R1's release accelerated development timelines across the industry. Several months after R1, multiple labs released improved reasoning models: OpenAI shipped o3-mini in January 2025 and o3 in April 2025; Google released Gemini 2.5 Pro with extended thinking; and Anthropic enhanced Claude's reasoning capabilities. The competitive dynamic R1 created pushed the entire field forward at a faster pace than might otherwise have occurred.
As a model developed by a Chinese company, DeepSeek-R1 faced regulatory scrutiny in multiple Western countries. Concerns centered on data privacy (DeepSeek's servers are located in China, subject to Chinese data laws), potential content alignment with Chinese government positions, and national security implications of a widely deployed Chinese AI model.[15]
The US government response to DeepSeek was swift and multi-pronged. On February 6, 2025, Representatives Josh Gottheimer and Darin LaHood introduced the bipartisan "No DeepSeek on Government Devices Act," which specifically targeted the DeepSeek mobile application and API for prohibition on federal government devices. The bill passed in August 2025, banning federal employees from using the app.[15][20]
Additional legislative efforts included Representative Mark Green's China Technology Transfer Control Act and Senator Josh Hawley's "Decoupling America's Artificial Intelligence Capabilities from China Act," introduced on January 29, 2025. Gottheimer and LaHood also wrote to all 50 US governors urging them to implement similar bans at the state level.[15][20]
Multiple government agencies independently restricted or banned use of DeepSeek products:[15]
| Agency / Government | Action | Date |
|---|---|---|
| U.S. Navy | Issued warnings against use | Late January 2025 |
| NASA | Reinforced security concerns | January 31, 2025 |
| Texas | First state ban on government systems | February 2025 |
| Virginia | Banned on government systems | February 2025 |
| New York | Banned on government systems | February 2025 |
| U.S. Congress | Restricted on congressional devices | February 2025 |
| Pentagon | Restricted usage | February 2025 |
| Australia | Government agency restrictions | February 2025 |
| South Korea | Government restrictions | February 2025 |
| Taiwan | Government restrictions | February 2025 |
| India | Government restrictions | February 2025 |
The Fiscal Year 2026 National Defense Authorization Act, signed in December 2025, included provisions restricting DeepSeek usage within the Department of Defense and Intelligence Community.[15]
Technical security analyses identified several concerns with DeepSeek's infrastructure, including reports of hidden code linking to China Mobile servers, the collection of keystroke data, data storage on Chinese servers subject to Chinese government access, and multiple cybersecurity test failures. Content analysis also documented instances of censorship related to politically sensitive topics, including Tiananmen Square and Taiwan, with DeepSeek consistently avoiding inquiries related to these subjects.[15][20]
These restrictions applied primarily to government use. Commercial and individual use of R1's open weights remained unrestricted in most jurisdictions, and the model continued to be widely deployed through cloud providers and self-hosted infrastructure. The open-source nature of the model weights meant that security concerns about DeepSeek's servers could be entirely mitigated by self-hosting.
As of March 2026, DeepSeek-R1 and its derivatives remain among the most widely used open-source reasoning models. The R1-0528 update brought the model closer to the performance of proprietary alternatives like OpenAI's o3, while maintaining its cost and accessibility advantages.
DeepSeek has continued releasing new models beyond R1, including DeepSeek-V3.1 (August 2025), DeepSeek-V3.2-Exp (September 2025), and DeepSeek-OCR (October 2025). The company's anticipated DeepSeek-V4 release, expected in April 2026, is likely to include a next-generation reasoning component that builds on R1's approach.
The model's legacy is perhaps best measured by its influence on the field. R1 proved that reasoning-capable language models could be built openly and cheaply, that reinforcement learning could induce genuine reasoning behaviors without supervised examples, and that a small team with limited resources could compete with the largest AI labs in the world. These demonstrations reshaped the economics, strategy, and research direction of the AI industry in ways that continue to play out.