| DeepSeek-R1 | |
|---|---|
| Developer | DeepSeek |
| Release date | January 20, 2025 |
| Type | Large language model (reasoning) |
| Architecture | Mixture of Experts (MoE), Transformer |
| Parameters | 671 billion total; 37 billion active per token |
| Context length | 128,000 tokens |
| Training method | Reinforcement learning with GRPO |
| Training compute | 512 Nvidia H800 GPUs, ~80 hours |
| Reported training cost | $294,000 (R1 RL stage only) |
| License | MIT |
| Updated version | DeepSeek-R1-0528 (May 28, 2025) |
| Successor lineage | DeepSeek-V3.1 (Aug 2025), V3.2 (Dec 2025) |
| Paper | arXiv:2501.12948; Nature 645, 633-638 (Sept 18, 2025) |
DeepSeek-R1 is an open-source reasoning-focused large language model developed by DeepSeek, a Chinese artificial intelligence company spun out of the High-Flyer quantitative hedge fund. Released on January 20, 2025 under the MIT license, R1 was the first open-weight model to match the reasoning performance of OpenAI's proprietary o1 across mathematics, coding, and scientific reasoning tasks. The model uses a Mixture of Experts architecture with 671 billion total parameters, of which 37 billion are activated per forward pass, keeping computational costs manageable during inference.[1][2]
DeepSeek-R1's release triggered what became known as the "DeepSeek shock," a market event on January 27, 2025 that erased over $1 trillion from U.S. technology stocks in a single trading session. Nvidia alone lost approximately $589 billion in market capitalization, the largest single-day loss for any company in stock market history. The shock stemmed from the revelation that a small Chinese startup with roughly 160 employees had trained a reasoning model competitive with the world's most expensive AI systems while reporting an RL-stage training cost of only $294,000, a fraction of the hundreds of millions typically spent by Western labs.[3][4][22]
Beyond its market impact, DeepSeek-R1 was scientifically significant for demonstrating that complex reasoning behaviors could emerge from pure reinforcement learning without supervised fine-tuning. The companion model DeepSeek-R1-Zero, trained entirely through RL with rule-based rewards, developed chain-of-thought reasoning, self-reflection, and error correction spontaneously during training, a finding that challenged assumptions about how reasoning capabilities must be instilled in language models. The accompanying paper became the first major open-weight LLM to pass independent peer review, appearing on the cover of Nature on September 18, 2025.[1][5][22]
DeepSeek's path to R1 began with a hedge fund, not an AI lab. High-Flyer, a Chinese quantitative trading firm co-founded in February 2016 by Liang Wenfeng and two of his classmates from Zhejiang University, accumulated tens of thousands of Nvidia GPUs over the late 2010s for stock-prediction and high-frequency-trading workloads. By 2020 the firm had built one of the largest private AI training clusters in China. Liang spun the AI research arm into an independent company, DeepSeek, in May 2023, and seeded it with engineers experienced in squeezing performance out of large GPU pools. Unlike most well-funded AI startups, DeepSeek was bootstrapped from hedge fund profits and had no external investors at the time of R1's release.[6][23]
DeepSeek had been building toward R1 throughout 2024. The company released DeepSeek-V2 in May 2024 and DeepSeek-V3 in December 2024, both using Mixture of Experts architectures that prioritized computational efficiency. V3 served as the base model for R1's training, providing a strong foundation of general language capabilities and world knowledge. V3 itself was trained on roughly 14.8 trillion tokens for an estimated $5.576 million in GPU rental cost, a figure that would later be added to discussions of R1's true total cost.[2][6][22]
The broader context for R1's development was the emergence of inference-time reasoning as a new paradigm in AI. OpenAI's o1, released in September 2024, had demonstrated that training models with reinforcement learning to "think before answering" could dramatically improve performance on difficult tasks. OpenAI did not publish technical details about how o1 was trained, leaving the broader research community to guess at the recipe. DeepSeek's contribution was to show that this approach could be replicated with open-source models at a fraction of the cost, and that the reasoning behaviors could emerge more naturally than previously assumed. The paper made the recipe public.[1]
DeepSeek-R1 is built on top of DeepSeek-V3, which uses a Mixture of Experts transformer architecture. The key architectural features include:[2][6]
The MoE architecture is central to R1's efficiency story. By activating only 37 billion of its 671 billion parameters for each token, the model achieves inference costs comparable to a much smaller dense model while maintaining the knowledge capacity of its full parameter count. MLA further reduces memory pressure during long-context inference, which matters for reasoning models that emit thousands of intermediate tokens before answering.
DeepSeek-R1's training followed a four-stage pipeline that combined supervised learning and reinforcement learning:[1][5]
Stage 1: Cold start. The DeepSeek-V3 base model was fine-tuned on a small set of curated chain-of-thought reasoning examples (a few thousand long-CoT samples). This cold-start data provided the model with initial examples of structured reasoning, addressing issues like repetitive loops and poor readability that occurred when applying RL directly to the base model in the R1-Zero experiment.
Stage 2: Reasoning-oriented reinforcement learning. Large-scale RL was applied using Group Relative Policy Optimization (GRPO), focused on tasks with verifiable answers (mathematics, coding, logic problems). The model learned to generate extended chains of thought and was rewarded based solely on the correctness of its final answers. A language-consistency reward was also added to discourage the language mixing seen in R1-Zero.
Stage 3: Rejection sampling and supervised fine-tuning. The RL-trained model generated a large set of reasoning traces. High-quality traces were selected through rejection sampling and combined with non-reasoning data to produce a curated dataset of approximately 800,000 samples (about 600,000 reasoning samples and 200,000 general samples covering writing, factual QA, self-cognition, and translation). DeepSeek-V3-Base was then fine-tuned on this dataset for two epochs.[1][24]
Stage 4: Reinforcement learning for all scenarios. A final round of RL was applied across both reasoning and general tasks, optimizing for helpfulness and harmlessness using a combination of rule-based and model-based reward signals. This stage tightened the model's behavior on conversational tasks where rule-based rewards were not available.
The rejection-sampling step in Stage 3 also fed the recipe used to train all six distilled variants, since the same 800K-sample dataset was used to fine-tune smaller open-source base models.[1]
In the Nature publication of September 2025, DeepSeek disclosed that the reinforcement learning portion of R1's training used 512 Nvidia H800 GPUs for approximately 80 hours, at an estimated rental cost of $294,000 (assuming a then-current $2 per GPU-hour rental rate). Supplementary materials acknowledged for the first time that DeepSeek also owned A100 GPUs and used them for preparatory experiments at smaller scale.[22]
This $294,000 figure refers narrowly to the RL stage that turned DeepSeek-V3 into R1. It does not include the cost of training V3 itself (around $5.6 million for the base model), the cost of generating cold-start data, the cost of distillation, salaries, or the depreciated cost of the GPU cluster. Several outlets, including The Register and CNN Business, noted that the true end-to-end cost of producing R1 was roughly an order of magnitude higher than the headline number, though still dramatically below the budgets of comparable Western reasoning models.[22][25]
GRPO is the reinforcement learning algorithm used to train both R1-Zero and R1. Originally proposed by DeepSeek for their earlier DeepSeek-Math model in a February 2024 paper, GRPO simplifies the RL training process compared to Proximal Policy Optimization (PPO), which had been the standard approach for language model RL training (as used in RLHF).[1][5][16]
The key innovation of GRPO is eliminating the need for a separate critic (value) model. In standard PPO-based RLHF, two models must be maintained during training: the policy model being optimized and a value model that estimates expected returns. The value model alone can be as large as the policy model, effectively doubling the computational requirements of training. GRPO removes this requirement by using a simpler baseline derived from group statistics.[1][16]
The algorithm works as follows:[1][5][16]
\boxed{...} directive) matches the ground truth. For code, it means whether the generated program passes a hidden test suite.This group-relative approach has several advantages. By normalizing rewards within each group, GRPO reduces the impact of reward scale differences across different problem types. The elimination of the value model cuts training memory requirements roughly in half, allowing the same hardware to train larger models. The algorithm is also simpler to implement and tune than PPO with a learned value function.[16]
DeepSeek used a deliberately simple reward design: an accuracy reward (right or wrong on a verifiable answer) plus a format reward (the model is required to enclose its reasoning in <think>...</think> tags and its final answer in a designated location). No reward model based on human preference data was used during the reasoning-oriented stages, which sidestepped one of the most expensive components of conventional RLHF and one of the harder failure modes (reward hacking against a learned reward model).[1][5]
Since R1's release, GRPO has become widely adopted in the language model community. Hugging Face's TRL library added native GRPO support, and numerous research groups have used it to train their own reasoning models. The algorithm's combination of simplicity, efficiency, and effectiveness made it particularly attractive for smaller teams and academic researchers who could not afford the memory overhead of full PPO. By the end of 2025, GRPO and its derivatives (DAPO, GRPO+ from Qwen, REINFORCE++ variants) had become the de facto standard for training open-source reasoning models, displacing PPO with DPO as the preferred recipe for post-training reasoning behaviors.[16]
Before training R1, DeepSeek conducted an experiment called R1-Zero that became one of the most discussed results in the paper. R1-Zero was trained by applying reinforcement learning directly to the DeepSeek-V3 base model, without any supervised fine-tuning or curated reasoning examples. The model was simply given problems and rewarded for producing correct answers.[1][5]
R1-Zero used a deliberately minimal prompt template that asked the base model to enclose its reasoning inside <think>...</think> tags and its final answer inside <answer>...</answer> tags. No examples of reasoning were provided. The base model began by emitting essentially random text inside the think tags, but the GRPO training loop pushed it toward producing reasoning content that actually helped it answer correctly. Over tens of thousands of training steps, that pressure produced increasingly structured reasoning behaviors.[1]
Despite receiving no explicit training on how to reason, R1-Zero spontaneously developed several sophisticated reasoning behaviors during RL training:[1][5]
The paper reported a striking accuracy trajectory on the AIME 2024 mathematics olympiad. R1-Zero's pass@1 score climbed from 15.6% at the start of RL training to 71.0% by the end of training. With self-consistency (majority voting over 64 samples) the score reached 86.7%. By the end of the run, R1-Zero on its own had matched o1-preview-level scores on AIME using nothing more than RL on a base model that had never seen a reasoning example.[1][22]
Researchers tracked the emergence of reflective reasoning behaviors across training by measuring the frequency of specific terms in the model's outputs. The results showed a clear phase transition:[1][5][17]
| Training stage | Reflective term frequency | Behavior |
|---|---|---|
| Steps 0-4,000 | Virtually absent | Model generates linear, non-reflective solutions |
| Steps 4,000-7,000 | Sporadic appearance | Occasional use of "wait," "but," "however" |
| Steps 8,000+ | Marked increase | Systematic self-monitoring and error correction |
Specific reflective terms tracked included "wait," "mistake," "however," "but," "retry," "error," "verify," "wrong," "evaluate," and "check." These terms were virtually absent in the early stages of training, appeared sporadically in the middle stages, and showed a marked increase after step 8,000, suggesting the emergence of temporal reasoning or self-monitoring behavior.[17]
The model also showed a clear increase in the length of its reasoning chains over the course of training. Early in RL training, the model generated short, direct answers. As training progressed, the average response length grew steadily from a few hundred tokens to several thousand, with the model learning to allocate more thinking time to harder problems. This adaptive allocation of test-time compute was not explicitly trained but emerged naturally from the optimization process.[1][5]
DeepSeek's paper highlighted what they called an "aha moment" during R1-Zero's training. At a certain point in RL training, the model showed a sudden increase in the use of reflective language (particularly the word "wait") during its reasoning chains. The paper printed an excerpt of one such moment, where the model interrupts itself in the middle of a math problem with the phrase "Wait, wait. Wait. That's an aha moment I can flag here." before backtracking to a different approach and getting the right answer. This marked a qualitative shift in the model's reasoning patterns, where it began systematically re-evaluating and correcting its own work rather than simply proceeding linearly through a solution.[1][5]
The aha moment became widely discussed in the AI research community. DeepSeek described it as evidence of "the self-evolution process" of the model, suggesting that reinforcement learning could induce genuinely emergent cognitive strategies. However, subsequent research by other groups has debated whether these behaviors were truly emergent or whether traces of reflective reasoning were already present in the base model's pre-training data. A study by Sea AI Lab titled "There May Not be Aha Moment in R1-Zero-like Training" argued that the observed behaviors could be attributed to pre-existing patterns in the training data rather than genuine emergence, and replicated similar trajectories starting from base models that had been pre-trained on web text containing reasoning-style writing.[5][7]
Despite its impressive emergent behaviors, R1-Zero had practical limitations that motivated the development of the full R1 model. Its outputs often suffered from poor readability, with reasoning chains that mixed Chinese and English mid-sentence, repeated phrases endlessly, or failed to clearly delineate the final answer. The model also struggled with tasks outside of mathematics and coding, where the lack of supervised fine-tuning left it without appropriate response formats. These issues were addressed in R1 through the cold-start data, the language-consistency reward in Stage 2, and the multi-stage training pipeline.[1]
Even with these flaws, R1-Zero on its own represented a notable scientific result: pure RL on a strong base model produced a competitive reasoning model. R1-Zero was released alongside R1 under the same MIT license so that researchers could study the unfiltered behavior of an RL-only reasoning model.
DeepSeek-R1 achieved performance competitive with OpenAI's o1 across major reasoning benchmarks.
| Benchmark | DeepSeek-R1 | OpenAI o1 (Dec 2024) | GPT-4o | Description |
|---|---|---|---|---|
| AIME 2024 (pass@1) | 79.8% | 79.2% | 13.4% | American Invitational Mathematics Exam |
| MATH-500 | 97.3% | 96.4% | 60.3% | Mathematical problem solving |
| GPQA Diamond | 71.5% | 75.7% | 53.6% | Graduate-level science questions |
| Codeforces (Elo / percentile) | 2,029 / 96.3 | 1,891 / 93.4 | n/a | Competitive programming rating |
| MMLU | 90.8% | 91.8% | 87.2% | Multitask language understanding |
| MMLU-Pro | 84.0% | 81.9% | 73.3% | Harder MMLU variant |
| LiveCodeBench (CoT) | 65.9% | 63.4% | 33.4% | Real-world coding tasks |
| SWE-bench Verified | 49.2% | 48.9% | 33.2% | Software engineering tasks |
| HumanEval | 85.4% | 92.4% | 90.2% | Code generation |
| AlpacaEval 2.0 (LC) | 87.6% | n/a | 51.1% | Open-ended instruction following |
| ArenaHard | 92.3% | n/a | 80.4% | Adversarial chat eval |
The results showed that R1 matched or exceeded o1 on most mathematical and coding benchmarks while trailing slightly on graduate-level science (GPQA Diamond) and short-form code generation (HumanEval). The fact that an open-source model could achieve these results, trained at a fraction of the cost, was the central claim that drove both the scientific interest and the market reaction. Importantly, R1's chat-style benchmarks (AlpacaEval, ArenaHard) showed that the multi-stage training preserved instruction-following quality even as it added reasoning capability.
Alongside R1, DeepSeek released six smaller distilled models created through knowledge distillation, where R1's reasoning capabilities were transferred to smaller, more efficient base models. DeepSeek used R1 as a teacher model to generate the same 800,000 high-quality reasoning traces used to fine-tune R1 itself in Stage 3, then applied them as supervised fine-tuning data on smaller bases from the Qwen2.5 and Llama 3 families. No additional RL was applied to the distilled models in this initial release.[1][2]
| Distilled model | Base model | Parameters | License | AIME 2024 | MATH-500 | GPQA Diamond |
|---|---|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 1.5B | Apache 2.0 + MIT | 28.9% | 83.9% | 33.8% |
| DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 7B | Apache 2.0 + MIT | 55.5% | 92.8% | 49.1% |
| DeepSeek-R1-Distill-Llama-8B | Llama 3.1-8B | 8B | Llama 3 + MIT | 50.4% | 89.1% | 49.0% |
| DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 14B | Apache 2.0 + MIT | 69.7% | 93.9% | 59.1% |
| DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 32B | Apache 2.0 + MIT | 72.6% | 94.3% | 62.1% |
| DeepSeek-R1-Distill-Llama-70B | Llama 3.3-70B-Instruct | 70B | Llama 3 + MIT | 70.0% | 94.5% | 65.2% |
The distilled models were a major part of R1's impact. The smallest model, Qwen-1.5B, outperformed GPT-4o and Claude 3.5 Sonnet on math benchmarks despite being small enough to run on consumer hardware. The 32B and 70B distilled models set new state-of-the-art results among dense (non-MoE) open-source models on reasoning benchmarks, outperforming the contemporaneous QwQ-32B-Preview by substantial margins. Notably, the 32B distillation reached 72.6% on AIME 2024, beating OpenAI's o1-mini (63.6%) by nearly nine points.[1]
Distilled models inherited the licenses of their base checkpoints. Qwen-based distills are released under Apache 2.0 (with the fine-tuning weights themselves under MIT), while Llama-based distills are governed by their respective Meta Llama community licenses with the fine-tuning weights under MIT.[2]
The distilled models enabled local deployment on consumer hardware, which became a major driver of community adoption. The hardware requirements for running different distilled models are as follows:[18]
| Distilled model | Minimum VRAM | Recommended GPU | Performance notes |
|---|---|---|---|
| 1.5B / 7B / 8B | 8 GB | NVIDIA RTX 3060 12GB | Runs efficiently at standard quantization |
| 14B | 12-16 GB | NVIDIA RTX 4070 Ti 16GB | Fits in VRAM at 4-bit quantization |
| 32B | 20-24 GB | NVIDIA RTX 3090/4090 24GB | Smooth performance at 4-bit quantization |
| 70B | 40-48 GB | 2x NVIDIA RTX 3090 or A100 | Requires multi-GPU or offloading |
The 32B distilled model hit a particularly attractive sweet spot, offering performance comparable to OpenAI's o1-mini on several benchmarks while running on a single consumer-grade RTX 4090 GPU. Running locally eliminated API costs, kept data private, removed rate limits, and provided offline access to explicit chain-of-thought reasoning.[18]
The distilled models proved especially popular with the open-source community. Within weeks of release, hundreds of derivative models were created on Hugging Face, fine-tuned for specific use cases ranging from medical reasoning to financial analysis. Hobbyists ran the 8B and 14B distills on Apple Silicon laptops via llama.cpp and MLX, providing the first time many users had access to a fully local reasoning model that could explain its work.
DeepSeek included an ablation in the paper comparing two ways of giving small models reasoning capabilities: distilling traces from a strong reasoning teacher (R1) versus running GRPO directly on a small base model. Distillation won decisively. The 32B Qwen base distilled from R1 outperformed the same base trained with R1's RL recipe directly. The takeaway was that the cheap path for small reasoning models is to distill from a large reasoning teacher rather than run RL from scratch, a finding that influenced subsequent open-source training recipes.[1]
DeepSeek released R1, R1-Zero, and all six distilled models under the MIT license, one of the most permissive open-source licenses available. The license explicitly permits commercial use, modification, redistribution, model distillation, and the use of API outputs to train other models. The full model weights, training code references, and technical paper were all made publicly available on GitHub and Hugging Face.[1][2]
This explicit grant of distillation rights was unusual. Most proprietary AI providers either prohibit using their outputs to train competing models or leave the question ambiguous. DeepSeek's terms made it clear that researchers could legally train students on R1's outputs, which removed legal friction from the wave of follow-on work.
The open-source strategy was a deliberate choice that amplified R1's impact far beyond what a proprietary release would have achieved. Within a month of launch, over 700 community-built models derived from R1 appeared on Hugging Face, collectively downloaded more than 5 million times. Major cloud providers including Microsoft Azure, Amazon Web Services, and Nvidia's inference platforms quickly added support for R1, making it accessible through familiar enterprise interfaces.[2][9]
DeepSeek-R1 became the most-liked model on Hugging Face among nearly 1.5 million models on the platform, surpassing 10,000 likes within weeks of release. The variant versions collectively exceeded 10 million downloads. R1's release also catalyzed a broader shift in the open-source AI ecosystem: the number of competitive Chinese organizations releasing models increased dramatically, with Baidu going from zero releases on Hugging Face in 2024 to over 100 in 2025, and ByteDance and Tencent each increasing releases by eight to nine times.[19]
The MIT license also enabled a wave of academic research building on R1's approach. Researchers at universities and smaller labs could study the model's reasoning traces, replicate the RL training methodology, and test hypotheses about emergent reasoning that would have been impossible with a proprietary model. Several groups, including Hugging Face's Open-R1 project, Berkeley's NovaSky team (Sky-T1), and the Together AI / Stanford collaboration on TinyZero, attempted full replications of the R1 training pipeline using only public data and open base models.
The market reaction to R1's release became a defining financial event of early 2025. On January 27, 2025, one week after R1's public release, U.S. technology stocks experienced their steepest single-day decline in history.[3][4]
The sell-off was triggered by a sudden reassessment of the AI investment thesis. For years, the market had priced technology companies, especially chipmakers and cloud providers, on the assumption that building frontier AI required massive and growing capital expenditures. DeepSeek's demonstration that a 160-person Chinese startup could produce competitive results undermined that assumption.[3][4]
Nvidia's stock fell nearly 17% in a single session, closing at $118.58 and losing approximately $589 billion in market capitalization. This was the largest single-day market value loss for any company in history. Other semiconductor companies including Broadcom, Marvell, Micron, and TSMC also fell sharply. The Nasdaq composite lost roughly $1 trillion in value by the end of the day. Meta and Alphabet (Google's parent company) also declined significantly. Apple briefly retook the title of world's most valuable company as Nvidia fell to roughly $2.8 trillion in market cap.[3][4]
The DeepSeek mobile app reached number one on the Apple App Store in the United States on January 27, displacing ChatGPT. That ranking became part of the news cycle around the stock drop, with retail investors and journalists pointing to the consumer ranking as a tangible sign that something had changed.
Marc Andreessen, the prominent technology investor, described the event as "AI's Sputnik moment," drawing a parallel to the 1957 Soviet satellite launch that shocked the United States into accelerating its space program. The comparison captured the sense that a competitor working with far fewer resources had achieved something that the established players, with their billions in investment, had assumed only they could do.[4][10] President Donald Trump, speaking at a Republican retreat the same week, called R1 a "wake-up call for our industries that we need to be laser-focused on competing to win."
The market impact extended beyond the immediate sell-off. Chinese AI companies entered an aggressive price war, with some cutting API prices by up to 97% in the weeks following R1's release. In the United States, the event forced a public debate about whether the hundreds of billions being invested in AI data centers and chip manufacturing were truly necessary, or whether architectural innovation could substitute for raw compute.[3][4]
Stanford HAI faculty noted that DeepSeek's open releases represented "a significant step in democratizing AI," enabling smaller companies and individual developers to build on frontier-capable models without massive compute budgets.[9] Within days, multiple Western labs publicly accelerated their own reasoning-model roadmaps. OpenAI shipped o3-mini on January 31, 2025, and Anthropic added an extended thinking mode to Claude 3.7 Sonnet in February.
On May 28, 2025, DeepSeek released a major update to R1 designated R1-0528. Despite being described as a "minor upgrade" in official communications, the update delivered substantial improvements across all major benchmarks.[11][12]
| Benchmark | R1 (Jan 2025) | R1-0528 (May 2025) | Change |
|---|---|---|---|
| AIME 2024 | 79.8% | 91.4% | +11.6 |
| AIME 2025 | 70.0% | 87.5% | +17.5 |
| HMMT 2025 | 41.7% | 79.4% | +37.7 |
| CNMO 2024 | 78.8% | 86.9% | +8.1 |
| LiveCodeBench (2408-2505) | 63.5% | 73.3% | +9.8 |
| Codeforces-Div1 rating | ~1,530 | ~1,930 | +400 |
| SWE-bench Verified | 49.2% | 57.6% | +8.4 |
| Aider-Polyglot | 53.3% | 71.6% | +18.3 |
| MMLU-Redux | 92.9% | 93.4% | +0.5 |
| MMLU-Pro | 84.0% | 85.0% | +1.0 |
| GPQA Diamond | 71.5% | 81.0% | +9.5 |
| Humanity's Last Exam | 8.5% | 17.7% | +9.2 |
| FRAMES | 82.5% | 83.0% | +0.5 |
The AIME 2025 improvement from 70% to 87.5% was particularly notable, bringing R1-0528 into competitive range with OpenAI's o3 (88.9% on AIME 2025). The Codeforces rating jump of approximately 400 points reflected dramatically improved code generation and problem-solving ability. The HMMT 2025 improvement of nearly 38 points was the single largest gain, reflecting deeper engagement with multi-step competition mathematics.[11][26]
R1-0528 also added two tool-use benchmarks where the original R1 had not reported numbers: BFCL_v3 (Berkeley Function Calling Leaderboard, multi-turn) at 37.0% and Tau-Bench (airline 53.5%, retail 63.9%), reflecting the new function-calling capability.[26]
R1-0528 demonstrated deeper chain-of-thought reasoning than its predecessor. On AIME 2025 problems, the model averaged approximately 23,000 thinking tokens per query, compared to roughly 12,000 for the original R1, a roughly 92% increase in reasoning depth. This near-doubling, enabled by additional algorithmic optimization during post-training, contributed to the accuracy improvements.[12][26]
DeepSeek also reported that the rate of hallucinations (false or misleading outputs) was reduced by approximately 45 to 50% in scenarios such as rewriting and summarization.[12]
The update added several capabilities requested by the developer community:[12]
<think>\n formatting to trigger the thinking modeDeepSeek also released a distilled model from R1-0528: DeepSeek-R1-0528-Qwen3-8B, which achieved state-of-the-art performance among open-source 8B models on AIME 2024 at 86.0%, surpassing the base Qwen3-8B by 10 percentage points and matching the performance of the much larger Qwen3-235B-Thinking on the same benchmark.[12][26]
DeepSeek offered R1 through its API at prices dramatically lower than competing reasoning models, advertised as model=deepseek-reasoner.
| Model | Input (per 1M tokens, cache miss) | Input (per 1M tokens, cache hit) | Output (per 1M tokens) |
|---|---|---|---|
| DeepSeek-R1 | $0.55 | $0.14 | $2.19 |
| OpenAI o1 | $15.00 | $7.50 | $60.00 |
| OpenAI o3 (June 2025 cut) | $2.00 | $0.50 | $8.00 |
| Anthropic Claude 3.7 Sonnet (extended thinking) | $3.00 | $0.30 | $15.00 |
The pricing differential was stark: R1 was roughly 27 times cheaper than o1 for both input and output tokens. Even after OpenAI's June 2025 price cuts brought o3 down to $2/$8, R1 remained approximately 3 to 4 times cheaper. Combined with the MIT license allowing self-hosting (eliminating API costs entirely for organizations with their own compute), R1's economics were a core part of its appeal.
Third-party API providers including Together AI, Fireworks AI, Groq, OpenRouter, Hyperbolic, and Lambda all offered hosted endpoints for R1 within days of release, often at competitive prices and sometimes with faster inference than DeepSeek's own API. Groq in particular advertised R1-Distill-Llama-70B running on its LPU hardware at over 200 tokens per second, several times faster than typical GPU-based deployments.
On September 18, 2025, the DeepSeek-R1 paper appeared on the cover of Nature (volume 645, issue 8081, pages 633 to 638), becoming the first major open-weight large language model to be the subject of a peer-reviewed Nature paper. The corresponding author was Liang Wenfeng, with 199 co-authors from DeepSeek-AI listed.[5][22]
The peer-reviewed version added information that had not appeared in the January arXiv preprint:[22]
The peer review was unusually public for an AI paper. Nature published the reviewer comments and DeepSeek's responses alongside the article, an editorial choice that was widely welcomed in the AI research community as a step toward more rigorous publication practices for industrial AI work.[5]
DeepSeek-R1's impact extended well beyond its benchmark scores and market disruption. The model fundamentally changed several assumptions about AI development.
Before R1, the prevailing assumption was that training frontier reasoning models required resources available only to a handful of well-funded Western labs. DeepSeek showed that a combination of architectural innovation (MoE, MLA), efficient training algorithms (GRPO, FP8 mixed precision), and clever engineering could produce competitive results at dramatically lower cost. This finding had practical consequences: smaller companies and research institutions began building on R1's open weights rather than training models from scratch.[1][9]
R1 demonstrated that open-source AI models could match proprietary frontier models in at least one important capability dimension (reasoning). This intensified the debate within the AI industry about the relative merits of open and closed development approaches. Labs that had been reluctant to open-source their models faced renewed pressure to justify keeping weights proprietary, while the open-source community gained a powerful new proof point for their approach. Meta's Yann LeCun, a vocal advocate of open-source AI, repeatedly cited R1 as evidence that open weights would catch closed weights in capability.[9]
R1-Zero's spontaneous development of reasoning strategies through pure RL contributed to the ongoing scientific debate about emergent abilities in language models. The result suggested that reasoning was not something that needed to be explicitly taught through supervised learning but could arise naturally from optimization pressure on task performance. This finding influenced subsequent research across multiple labs exploring RL-based training for reasoning.[1][5]
R1's release accelerated development timelines across the industry. Several months after R1, multiple labs released improved reasoning models: OpenAI shipped o3-mini in January 2025, o3 in April 2025, and o4-mini in mid-2025; Google released Gemini 2.5 Pro with extended thinking in March 2025; Anthropic added extended thinking to Claude 3.7 Sonnet in February 2025 and Claude Opus 4 in May 2025; Alibaba released Qwen3 with native hybrid reasoning in April 2025 and the open-source QwQ-32B reasoning model. The competitive dynamic R1 created pushed the entire field forward at a faster pace than might otherwise have occurred.
The distillation recipe in particular was widely copied. Microsoft's Phi-4-Reasoning, Berkeley's Sky-T1-32B, Hugging Face's Open-R1, NVIDIA's OpenReasoning-Nemotron, and dozens of community models all used variants of R1's rejection-sampling-then-SFT recipe to bootstrap reasoning capabilities into smaller bases.
As a model developed by a Chinese company, DeepSeek-R1 faced regulatory scrutiny in multiple Western countries. Concerns centered on data privacy (DeepSeek's servers are located in China, subject to Chinese data laws), potential content alignment with Chinese government positions, and national security implications of a widely deployed Chinese AI model.[15]
The US government response to DeepSeek was swift and multi-pronged. On February 6, 2025, Representatives Josh Gottheimer and Darin LaHood introduced the bipartisan "No DeepSeek on Government Devices Act," which specifically targeted the DeepSeek mobile application and API for prohibition on federal government devices. The bill passed in August 2025, banning federal employees from using the app on government-issued devices.[15][20]
Additional legislative efforts included Representative Mark Green's China Technology Transfer Control Act and Senator Josh Hawley's "Decoupling America's Artificial Intelligence Capabilities from China Act," introduced on January 29, 2025. Gottheimer and LaHood also wrote to all 50 US governors urging them to implement similar bans at the state level.[15][20]
Multiple government agencies independently restricted or banned use of DeepSeek products:[15][27]
| Agency / government | Action | Date |
|---|---|---|
| U.S. Navy | Issued warnings against use | Late January 2025 |
| NASA | Reinforced security concerns | January 31, 2025 |
| Italy (Garante) | Ordered limitation on processing of Italian users' data; app removed from Apple/Google stores | January 30, 2025 |
| Texas | First U.S. state ban on government systems | February 2025 |
| Virginia | Banned on government systems | February 2025 |
| New York | Banned on government systems | February 2025 |
| U.S. Congress | Restricted on congressional devices | February 2025 |
| Pentagon | Restricted usage | February 2025 |
| Australia | Government agency restrictions; Home Affairs Minister Tony Burke cited national security | February 4, 2025 |
| South Korea | Government restrictions; later removed from app stores | February 2025 |
| Taiwan (MODA) | Banned on government devices and critical infrastructure | February 2, 2025 |
| India (Finance Ministry) | Restricted on government devices | February 2025 |
The Fiscal Year 2026 National Defense Authorization Act, signed in December 2025, included provisions restricting DeepSeek usage within the Department of Defense and Intelligence Community.[15]
Separately, OpenAI accused DeepSeek of improperly distilling from OpenAI models. The accusation surfaced publicly within days of R1's release, with OpenAI claiming it had "some evidence" that DeepSeek used outputs from OpenAI APIs to train R1, in violation of OpenAI's terms of service. In a February 2026 Bloomberg report, OpenAI escalated the claim in a memo to U.S. lawmakers, alleging that DeepSeek had developed methods to circumvent OpenAI's access restrictions through obfuscated third-party routers and other means. DeepSeek did not formally admit to using distillation in training R1's reasoning capabilities. The peer-reviewed Nature paper acknowledged that R1's training data, scraped from the open web, would inevitably contain text generated by other LLMs, while denying targeted distillation of OpenAI's reasoning traces.[22][28][29]
The accusation raised broader legal and ethical questions about "distillation" as a practice. Many in the open-source community pointed out that OpenAI itself had been sued for training on copyrighted material scraped from the web, making the position uncomfortable. The episode also illustrated a structural issue in modern LLM development: as long as proprietary models are accessible via API, their outputs can be used to train competitors, and detection is technically difficult.
Content analysis documented instances of refusal or evasion related to politically sensitive topics, including the 1989 Tiananmen Square massacre, the political status of Taiwan, the treatment of Uyghurs in Xinjiang, and the comparison of Xi Jinping to other figures. DeepSeek consistently avoided substantive engagement with these subjects when accessed through chat.deepseek.com or the official API. Behavior on the open weights was more nuanced: when the same questions were posed to a self-hosted instance of R1 through bare-metal inference, the model often produced fuller answers, suggesting that some of the refusal behavior was implemented as a server-side filter rather than baked into the weights themselves.[15][20][30]
A May 2025 academic paper titled "R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model" mapped the topology of refusals across hundreds of prompts. Most Type 2 refusals (full silence rather than alignment-style explanations) clustered around Tiananmen Square, suggesting that this topic remained particularly sensitive across both the chat product and the model weights.[30]
Technical security analyses identified several concerns with DeepSeek's infrastructure. Reports cited hidden code linking to China Mobile servers in the mobile app, the collection of keystroke timing data, data storage on Chinese servers subject to Chinese government access requests, and several cybersecurity test failures including jailbreak resistance below industry baselines. These restrictions applied primarily to government use. Commercial and individual use of R1's open weights remained unrestricted in most jurisdictions, and the model continued to be widely deployed through cloud providers and self-hosted infrastructure. The open-source nature of the model weights meant that security concerns about DeepSeek's servers could be mitigated by self-hosting, though the underlying questions about training data and possible weight-level alignment with Chinese government positions remained.[15][20]
R1 has well-documented limitations that practitioners learned to work around:
\boxed{...} markers, requiring downstream parsing logic that handled missing or malformed markers.R1 was the first reasoning model in what became a steady release cadence from DeepSeek through 2025 and into 2026.
| Release | Date | Notes |
|---|---|---|
| DeepSeek-R1 | Jan 20, 2025 | Initial release; companion R1-Zero; six distilled variants |
| DeepSeek-R1-0528 | May 28, 2025 | Major update; deeper thinking; function calling; JSON; system prompts; R1-0528-Qwen3-8B distill |
| DeepSeek-V3.1 | Aug 19, 2025 | First hybrid model: chat and reasoning in one set of weights with a thinking-mode toggle |
| DeepSeek-V3.2-Exp | Sep 29, 2025 | Experimental release introducing DeepSeek Sparse Attention |
| DeepSeek-OCR | Oct 2025 | Vision-language OCR model |
| DeepSeek-V3.2 | Dec 1, 2025 | Production hybrid model with thinking integrated into tool use |
| DeepSeek-V3.2-Speciale | Q1 2026 | High-compute variant; gold-medal results on IMO 2025, IOI 2025, ICPC World Finals |
| DeepSeek-V4 (anticipated) | April 2026 (rumored) | Next-generation flagship; expected to fold reasoning, tool-use, and multimodal into single weights |
The V3.1 hybrid release in August 2025 effectively absorbed R1's role: a single set of weights could now serve both as a fast chat model and (with a thinking-mode toggle) as a reasoning model. V3.1's deep-thinking mode achieved roughly 90 to 95% of R1-0528's performance on reasoning benchmarks while sharing weights with a normal chat model, removing the need to load a separate reasoning model in production. By the time V3.2 launched in December 2025, R1 was no longer DeepSeek's recommended model for new applications, though it remained widely cited and deployed because of its open-source release and well-understood behavior.[31]
As of April 2026, DeepSeek-R1 and its derivatives remain among the most widely studied open-source reasoning models, even though DeepSeek's own product line has moved on to the V3.x hybrid family and the anticipated V4. R1-0528 continues to be available through the DeepSeek API at the original prices and through every major third-party inference provider. The 32B and 70B distilled models remain popular as locally hostable reasoning baselines.
The model's legacy is best measured by its influence on the field. R1 proved that reasoning-capable language models could be built openly and cheaply, that reinforcement learning could induce genuine reasoning behaviors without supervised examples, and that a small team with limited resources could compete with the largest AI labs in the world. The recipe it published, GRPO with rule-based rewards on verifiable tasks, has become the dominant approach for training reasoning models across both open-source and commercial labs. Most reasoning models released in 2025 and 2026, from Qwen's QwQ family to Microsoft's Phi-Reasoning to community-trained models on Hugging Face, used some variant of the R1 recipe.
R1 also reset expectations for what a model release should look like. The combination of a permissive license, a detailed training recipe, peer-reviewed publication, six pre-distilled variants, and aggressive API pricing became a de facto template that other open-source labs were measured against. When subsequent releases (Qwen3, Llama 4 Reasoning, Mistral Magistral) were perceived as stinting on documentation or imposing restrictive licenses, the comparison was usually to R1.