DeepSeek-R1
Last reviewed
May 8, 2026
Sources
45 citations
Review status
Source-backed
Revision
v8 ยท 11,108 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
45 citations
Review status
Source-backed
Revision
v8 ยท 11,108 words
Add missing citations, update stale details, or suggest a clearer explanation.
| DeepSeek-R1 | |
|---|---|
| Developer | DeepSeek |
| Release date | January 20, 2025 |
| Type | Large language model (reasoning) |
| Architecture | Mixture of Experts (MoE), Transformer |
| Parameters | 671 billion total; 37 billion active per token |
| Context length | 128,000 tokens |
| Training method | Reinforcement learning with GRPO |
| Training compute | 512 Nvidia H800 GPUs, ~80 hours |
| Reported training cost | $294,000 (R1 RL stage only) |
| License | MIT |
| Updated version | DeepSeek-R1-0528 (May 28, 2025) |
| Successor lineage | DeepSeek-V3.1 (Aug 2025), V3.2 (Dec 2025) |
| Paper | arXiv:2501.12948; Nature 645, 633-638 (Sept 18, 2025) |
DeepSeek-R1 is an open-source reasoning-focused large language model developed by DeepSeek, a Chinese artificial intelligence company spun out of the High-Flyer quantitative hedge fund. Released on January 20, 2025 under the MIT license, R1 was the first open-weight model to match the reasoning performance of OpenAI's proprietary o1 across mathematics, coding, and scientific reasoning tasks. The model uses a Mixture of Experts architecture with 671 billion total parameters, of which 37 billion are activated per forward pass, keeping computational costs manageable during inference.[1][2]
DeepSeek-R1's release triggered what became known as the "DeepSeek shock," a market event on January 27, 2025 that erased over $1 trillion from U.S. technology stocks in a single trading session. Nvidia alone lost approximately $589 billion in market capitalization, the largest single-day loss for any company in stock market history. The shock stemmed from the revelation that a small Chinese startup with roughly 160 employees had trained a reasoning model competitive with the world's most expensive AI systems while reporting an RL-stage training cost of only $294,000, a fraction of the hundreds of millions typically spent by Western labs.[3][4][22]
Beyond its market impact, DeepSeek-R1 was scientifically significant for demonstrating that complex reasoning behaviors could emerge from pure reinforcement learning without supervised fine-tuning. The companion model DeepSeek-R1-Zero, trained entirely through RL with rule-based rewards, developed chain-of-thought reasoning, self-reflection, and error correction spontaneously during training, a finding that challenged assumptions about how reasoning capabilities must be instilled in language models. The accompanying paper became the first major open-weight LLM to pass independent peer review, appearing on the cover of Nature on September 18, 2025.[1][5][22]
DeepSeek's path to R1 began with a hedge fund, not an AI lab. High-Flyer, a Chinese quantitative trading firm co-founded in February 2016 by Liang Wenfeng and two of his classmates from Zhejiang University, accumulated tens of thousands of Nvidia GPUs over the late 2010s for stock-prediction and high-frequency-trading workloads. By 2020 the firm had built one of the largest private AI training clusters in China. Liang spun the AI research arm into an independent company, DeepSeek, in May 2023, and seeded it with engineers experienced in squeezing performance out of large GPU pools. Unlike most well-funded AI startups, DeepSeek was bootstrapped from hedge fund profits and had no external investors at the time of R1's release.[6][23]
DeepSeek had been building toward R1 throughout 2024. The company released DeepSeek-V2 in May 2024 and DeepSeek-V3 in December 2024, both using Mixture of Experts architectures that prioritized computational efficiency. V3 served as the base model for R1's training, providing a strong foundation of general language capabilities and world knowledge. V3 itself was trained on roughly 14.8 trillion tokens for an estimated $5.576 million in GPU rental cost, a figure that would later be added to discussions of R1's true total cost.[2][6][22]
The broader context for R1's development was the emergence of inference-time reasoning as a new paradigm in AI. OpenAI's o1, released in September 2024, had demonstrated that training models with reinforcement learning to "think before answering" could dramatically improve performance on difficult tasks. OpenAI did not publish technical details about how o1 was trained, leaving the broader research community to guess at the recipe. DeepSeek's contribution was to show that this approach could be replicated with open-source models at a fraction of the cost, and that the reasoning behaviors could emerge more naturally than previously assumed. The paper made the recipe public.[1]
DeepSeek-R1 is built on top of DeepSeek-V3, which uses a Mixture of Experts transformer architecture. The key architectural features include:[2][6]
The MoE architecture is central to R1's efficiency story. By activating only 37 billion of its 671 billion parameters for each token, the model achieves inference costs comparable to a much smaller dense model while maintaining the knowledge capacity of its full parameter count. MLA further reduces memory pressure during long-context inference, which matters for reasoning models that emit thousands of intermediate tokens before answering.
DeepSeek-R1's training followed a four-stage pipeline that combined supervised learning and reinforcement learning:[1][5]
Stage 1: Cold start. The DeepSeek-V3 base model was fine-tuned on a small set of curated chain-of-thought reasoning examples (a few thousand long-CoT samples). This cold-start data provided the model with initial examples of structured reasoning, addressing issues like repetitive loops and poor readability that occurred when applying RL directly to the base model in the R1-Zero experiment.
Stage 2: Reasoning-oriented reinforcement learning. Large-scale RL was applied using Group Relative Policy Optimization (GRPO), focused on tasks with verifiable answers (mathematics, coding, logic problems). The model learned to generate extended chains of thought and was rewarded based solely on the correctness of its final answers. A language-consistency reward was also added to discourage the language mixing seen in R1-Zero.
Stage 3: Rejection sampling and supervised fine-tuning. The RL-trained model generated a large set of reasoning traces. High-quality traces were selected through rejection sampling and combined with non-reasoning data to produce a curated dataset of approximately 800,000 samples (about 600,000 reasoning samples and 200,000 general samples covering writing, factual QA, self-cognition, and translation). DeepSeek-V3-Base was then fine-tuned on this dataset for two epochs.[1][24]
Stage 4: Reinforcement learning for all scenarios. A final round of RL was applied across both reasoning and general tasks, optimizing for helpfulness and harmlessness using a combination of rule-based and model-based reward signals. This stage tightened the model's behavior on conversational tasks where rule-based rewards were not available.
The rejection-sampling step in Stage 3 also fed the recipe used to train all six distilled variants, since the same 800K-sample dataset was used to fine-tune smaller open-source base models.[1]
In the Nature publication of September 2025, DeepSeek disclosed that the reinforcement learning portion of R1's training used 512 Nvidia H800 GPUs for approximately 80 hours, at an estimated rental cost of $294,000 (assuming a then-current $2 per GPU-hour rental rate). Supplementary materials acknowledged for the first time that DeepSeek also owned A100 GPUs and used them for preparatory experiments at smaller scale.[22]
This $294,000 figure refers narrowly to the RL stage that turned DeepSeek-V3 into R1. It does not include the cost of training V3 itself (around $5.6 million for the base model), the cost of generating cold-start data, the cost of distillation, salaries, or the depreciated cost of the GPU cluster. Several outlets, including The Register and CNN Business, noted that the true end-to-end cost of producing R1 was roughly an order of magnitude higher than the headline number, though still dramatically below the budgets of comparable Western reasoning models.[22][25]
The number that traveled fastest through press coverage in late January 2025 was not $294,000 but $5.576 million. That figure originally appeared in the DeepSeek-V3 technical report, where it was presented as the GPU-rental cost of training the V3 base model on roughly 14.8 trillion tokens (2.788 million H800 GPU-hours at $2 per hour). DeepSeek explicitly noted in V3's report that the figure excluded the cost of architecture research, ablations, salaries, and prior data work. When R1 launched five weeks later, many secondary outlets re-reported the $5.576 million number as if it were the total cost of building R1, which it was not.[2][22][25]
Three distinct cost figures circulated in early 2025 coverage, often used interchangeably:[22][25]
| Reported figure | What it covers | What it excludes |
|---|---|---|
| $294,000 | The final RL training run that converted V3 into R1 (512 H800 GPUs over roughly 80 hours) | Base model training, cold-start data, distillation, salaries, hardware depreciation |
| $5.576 million | GPU rental for training V3 from scratch on 14.8T tokens | Architecture research, prior models, salaries, the cluster itself |
| Estimates of $50 million to $1.6 billion+ | Various attempts to reconstruct DeepSeek's full development budget including the GPU cluster, headcount, and prior models | Speculative; DeepSeek does not disclose figures at this level |
SemiAnalysis published the most cited reconstruction of DeepSeek's full operating costs in late January 2025, estimating that the underlying GPU cluster alone (around 50,000 Hopper-class GPUs accumulated by High-Flyer over several years) would have cost roughly $1.6 billion if rented at commercial rates and that DeepSeek's annual operational expenditure was closer to $1.3 billion. SemiAnalysis emphasized that this did not contradict DeepSeek's narrower per-run cost figures; it simply reframed them. The DeepSeek narrative of efficiency held up at the level of marginal training cost; the framing of a $6 million startup competing with $100 million labs was harder to defend at the cluster-and-headcount level.[22][25][33]
The fairest reading of all three numbers is that R1's marginal training cost was very low (the published $294,000 plus the V3 base it built on), but the total cost of producing R1 included a significant prior investment in a GPU cluster, a research team, and the V2 and V3 model line. The technical achievement, that a small dedicated RL stage could produce a frontier reasoning model from a strong general base, remains intact even after the cost reframing.[22][25]
GRPO is the reinforcement learning algorithm used to train both R1-Zero and R1. Originally proposed by DeepSeek for their earlier DeepSeek-Math model in a February 2024 paper, GRPO simplifies the RL training process compared to Proximal Policy Optimization (PPO), which had been the standard approach for language model RL training (as used in RLHF).[1][5][16]
The key innovation of GRPO is eliminating the need for a separate critic (value) model. In standard PPO-based RLHF, two models must be maintained during training: the policy model being optimized and a value model that estimates expected returns. The value model alone can be as large as the policy model, effectively doubling the computational requirements of training. GRPO removes this requirement by using a simpler baseline derived from group statistics.[1][16]
The algorithm works as follows:[1][5][16]
\boxed{...} directive) matches the ground truth. For code, it means whether the generated program passes a hidden test suite.This group-relative approach has several advantages. By normalizing rewards within each group, GRPO reduces the impact of reward scale differences across different problem types. The elimination of the value model cuts training memory requirements roughly in half, allowing the same hardware to train larger models. The algorithm is also simpler to implement and tune than PPO with a learned value function.[16]
DeepSeek used a deliberately simple reward design: an accuracy reward (right or wrong on a verifiable answer) plus a format reward (the model is required to enclose its reasoning in <think>...</think> tags and its final answer in a designated location). No reward model based on human preference data was used during the reasoning-oriented stages, which sidestepped one of the most expensive components of conventional RLHF and one of the harder failure modes (reward hacking against a learned reward model).[1][5]
Since R1's release, GRPO has become widely adopted in the language model community. Hugging Face's TRL library added native GRPO support, and numerous research groups have used it to train their own reasoning models. The algorithm's combination of simplicity, efficiency, and effectiveness made it particularly attractive for smaller teams and academic researchers who could not afford the memory overhead of full PPO. By the end of 2025, GRPO and its derivatives (DAPO, GRPO+ from Qwen, REINFORCE++ variants) had become the de facto standard for training open-source reasoning models, displacing PPO with DPO as the preferred recipe for post-training reasoning behaviors.[16]
Before training R1, DeepSeek conducted an experiment called R1-Zero that became one of the most discussed results in the paper. R1-Zero was trained by applying reinforcement learning directly to the DeepSeek-V3 base model, without any supervised fine-tuning or curated reasoning examples. The model was simply given problems and rewarded for producing correct answers.[1][5]
R1-Zero used a deliberately minimal prompt template that asked the base model to enclose its reasoning inside <think>...</think> tags and its final answer inside <answer>...</answer> tags. No examples of reasoning were provided. The base model began by emitting essentially random text inside the think tags, but the GRPO training loop pushed it toward producing reasoning content that actually helped it answer correctly. Over tens of thousands of training steps, that pressure produced increasingly structured reasoning behaviors.[1]
Despite receiving no explicit training on how to reason, R1-Zero spontaneously developed several sophisticated reasoning behaviors during RL training:[1][5]
The paper reported a striking accuracy trajectory on the AIME 2024 mathematics olympiad. R1-Zero's pass@1 score climbed from 15.6% at the start of RL training to 71.0% by the end of training. With self-consistency (majority voting over 64 samples) the score reached 86.7%. By the end of the run, R1-Zero on its own had matched o1-preview-level scores on AIME using nothing more than RL on a base model that had never seen a reasoning example.[1][22]
Researchers tracked the emergence of reflective reasoning behaviors across training by measuring the frequency of specific terms in the model's outputs. The results showed a clear phase transition:[1][5][17]
| Training stage | Reflective term frequency | Behavior |
|---|---|---|
| Steps 0-4,000 | Virtually absent | Model generates linear, non-reflective solutions |
| Steps 4,000-7,000 | Sporadic appearance | Occasional use of "wait," "but," "however" |
| Steps 8,000+ | Marked increase | Systematic self-monitoring and error correction |
Specific reflective terms tracked included "wait," "mistake," "however," "but," "retry," "error," "verify," "wrong," "evaluate," and "check." These terms were virtually absent in the early stages of training, appeared sporadically in the middle stages, and showed a marked increase after step 8,000, suggesting the emergence of temporal reasoning or self-monitoring behavior.[17]
The model also showed a clear increase in the length of its reasoning chains over the course of training. Early in RL training, the model generated short, direct answers. As training progressed, the average response length grew steadily from a few hundred tokens to several thousand, with the model learning to allocate more thinking time to harder problems. This adaptive allocation of test-time compute was not explicitly trained but emerged naturally from the optimization process.[1][5]
DeepSeek's paper highlighted what they called an "aha moment" during R1-Zero's training. At a certain point in RL training, the model showed a sudden increase in the use of reflective language (particularly the word "wait") during its reasoning chains. The paper printed an excerpt of one such moment, where the model interrupts itself in the middle of a math problem with the phrase "Wait, wait. Wait. That's an aha moment I can flag here." before backtracking to a different approach and getting the right answer. This marked a qualitative shift in the model's reasoning patterns, where it began systematically re-evaluating and correcting its own work rather than simply proceeding linearly through a solution.[1][5]
The aha moment became widely discussed in the AI research community. DeepSeek described it as evidence of "the self-evolution process" of the model, suggesting that reinforcement learning could induce genuinely emergent cognitive strategies. However, subsequent research by other groups has debated whether these behaviors were truly emergent or whether traces of reflective reasoning were already present in the base model's pre-training data. A study by Sea AI Lab titled "There May Not be Aha Moment in R1-Zero-like Training" argued that the observed behaviors could be attributed to pre-existing patterns in the training data rather than genuine emergence, and replicated similar trajectories starting from base models that had been pre-trained on web text containing reasoning-style writing.[5][7]
Despite its impressive emergent behaviors, R1-Zero had practical limitations that motivated the development of the full R1 model. Its outputs often suffered from poor readability, with reasoning chains that mixed Chinese and English mid-sentence, repeated phrases endlessly, or failed to clearly delineate the final answer. The model also struggled with tasks outside of mathematics and coding, where the lack of supervised fine-tuning left it without appropriate response formats. These issues were addressed in R1 through the cold-start data, the language-consistency reward in Stage 2, and the multi-stage training pipeline.[1]
Even with these flaws, R1-Zero on its own represented a notable scientific result: pure RL on a strong base model produced a competitive reasoning model. R1-Zero was released alongside R1 under the same MIT license so that researchers could study the unfiltered behavior of an RL-only reasoning model.
DeepSeek-R1 achieved performance competitive with OpenAI's o1 across major reasoning benchmarks.
| Benchmark | DeepSeek-R1 | OpenAI o1 (Dec 2024) | GPT-4o | Description |
|---|---|---|---|---|
| AIME 2024 (pass@1) | 79.8% | 79.2% | 13.4% | American Invitational Mathematics Exam |
| MATH-500 | 97.3% | 96.4% | 60.3% | Mathematical problem solving |
| GPQA Diamond | 71.5% | 75.7% | 53.6% | Graduate-level science questions |
| Codeforces (Elo / percentile) | 2,029 / 96.3 | 1,891 / 93.4 | n/a | Competitive programming rating |
| MMLU | 90.8% | 91.8% | 87.2% | Multitask language understanding |
| MMLU-Pro | 84.0% | 81.9% | 73.3% | Harder MMLU variant |
| LiveCodeBench (CoT) | 65.9% | 63.4% | 33.4% | Real-world coding tasks |
| SWE-bench Verified | 49.2% | 48.9% | 33.2% | Software engineering tasks |
| HumanEval | 85.4% | 92.4% | 90.2% | Code generation |
| AlpacaEval 2.0 (LC) | 87.6% | n/a | 51.1% | Open-ended instruction following |
| ArenaHard | 92.3% | n/a | 80.4% | Adversarial chat eval |
The results showed that R1 matched or exceeded o1 on most mathematical and coding benchmarks while trailing slightly on graduate-level science (GPQA Diamond) and short-form code generation (HumanEval). The fact that an open-source model could achieve these results, trained at a fraction of the cost, was the central claim that drove both the scientific interest and the market reaction. Importantly, R1's chat-style benchmarks (AlpacaEval, ArenaHard) showed that the multi-stage training preserved instruction-following quality even as it added reasoning capability.
Alongside R1, DeepSeek released six smaller distilled models created through knowledge distillation, where R1's reasoning capabilities were transferred to smaller, more efficient base models. DeepSeek used R1 as a teacher model to generate the same 800,000 high-quality reasoning traces used to fine-tune R1 itself in Stage 3, then applied them as supervised fine-tuning data on smaller bases from the Qwen2.5 and Llama 3 families. No additional RL was applied to the distilled models in this initial release.[1][2]
| Distilled model | Base model | Parameters | License | AIME 2024 | MATH-500 | GPQA Diamond |
|---|---|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 1.5B | Apache 2.0 + MIT | 28.9% | 83.9% | 33.8% |
| DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 7B | Apache 2.0 + MIT | 55.5% | 92.8% | 49.1% |
| DeepSeek-R1-Distill-Llama-8B | Llama 3.1-8B | 8B | Llama 3 + MIT | 50.4% | 89.1% | 49.0% |
| DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 14B | Apache 2.0 + MIT | 69.7% | 93.9% | 59.1% |
| DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 32B | Apache 2.0 + MIT | 72.6% | 94.3% | 62.1% |
| DeepSeek-R1-Distill-Llama-70B | Llama 3.3-70B-Instruct | 70B | Llama 3 + MIT | 70.0% | 94.5% | 65.2% |
The distilled models were a major part of R1's impact. The smallest model, Qwen-1.5B, outperformed GPT-4o and Claude 3.5 Sonnet on math benchmarks despite being small enough to run on consumer hardware. The 32B and 70B distilled models set new state-of-the-art results among dense (non-MoE) open-source models on reasoning benchmarks, outperforming the contemporaneous QwQ-32B-Preview by substantial margins. Notably, the 32B distillation reached 72.6% on AIME 2024, beating OpenAI's o1-mini (63.6%) by nearly nine points.[1]
Distilled models inherited the licenses of their base checkpoints. Qwen-based distills are released under Apache 2.0 (with the fine-tuning weights themselves under MIT), while Llama-based distills are governed by their respective Meta Llama community licenses with the fine-tuning weights under MIT.[2]
The distilled models enabled local deployment on consumer hardware, which became a major driver of community adoption. The hardware requirements for running different distilled models are as follows:[18]
| Distilled model | Minimum VRAM | Recommended GPU | Performance notes |
|---|---|---|---|
| 1.5B / 7B / 8B | 8 GB | NVIDIA RTX 3060 12GB | Runs efficiently at standard quantization |
| 14B | 12-16 GB | NVIDIA RTX 4070 Ti 16GB | Fits in VRAM at 4-bit quantization |
| 32B | 20-24 GB | NVIDIA RTX 3090/4090 24GB | Smooth performance at 4-bit quantization |
| 70B | 40-48 GB | 2x NVIDIA RTX 3090 or A100 | Requires multi-GPU or offloading |
The 32B distilled model hit a particularly attractive sweet spot, offering performance comparable to OpenAI's o1-mini on several benchmarks while running on a single consumer-grade RTX 4090 GPU. Running locally eliminated API costs, kept data private, removed rate limits, and provided offline access to explicit chain-of-thought reasoning.[18]
The distilled models proved especially popular with the open-source community. Within weeks of release, hundreds of derivative models were created on Hugging Face, fine-tuned for specific use cases ranging from medical reasoning to financial analysis. Hobbyists ran the 8B and 14B distills on Apple Silicon laptops via llama.cpp and MLX, providing the first time many users had access to a fully local reasoning model that could explain its work.
DeepSeek included an ablation in the paper comparing two ways of giving small models reasoning capabilities: distilling traces from a strong reasoning teacher (R1) versus running GRPO directly on a small base model. Distillation won decisively. The 32B Qwen base distilled from R1 outperformed the same base trained with R1's RL recipe directly. The takeaway was that the cheap path for small reasoning models is to distill from a large reasoning teacher rather than run RL from scratch, a finding that influenced subsequent open-source training recipes.[1]
DeepSeek released R1, R1-Zero, and all six distilled models under the MIT license, one of the most permissive open-source licenses available. The license explicitly permits commercial use, modification, redistribution, model distillation, and the use of API outputs to train other models. The full model weights, training code references, and technical paper were all made publicly available on GitHub and Hugging Face.[1][2]
This explicit grant of distillation rights was unusual. Most proprietary AI providers either prohibit using their outputs to train competing models or leave the question ambiguous. DeepSeek's terms made it clear that researchers could legally train students on R1's outputs, which removed legal friction from the wave of follow-on work.
The open-source strategy was a deliberate choice that amplified R1's impact far beyond what a proprietary release would have achieved. Within a month of launch, over 700 community-built models derived from R1 appeared on Hugging Face, collectively downloaded more than 5 million times. Major cloud providers including Microsoft Azure, Amazon Web Services, and Nvidia's inference platforms quickly added support for R1, making it accessible through familiar enterprise interfaces.[2][9]
DeepSeek-R1 became the most-liked model on Hugging Face among nearly 1.5 million models on the platform, surpassing 10,000 likes within weeks of release. The variant versions collectively exceeded 10 million downloads. R1's release also catalyzed a broader shift in the open-source AI ecosystem: the number of competitive Chinese organizations releasing models increased dramatically, with Baidu going from zero releases on Hugging Face in 2024 to over 100 in 2025, and ByteDance and Tencent each increasing releases by eight to nine times.[19]
The MIT license also enabled a wave of academic research building on R1's approach. Researchers at universities and smaller labs could study the model's reasoning traces, replicate the RL training methodology, and test hypotheses about emergent reasoning that would have been impossible with a proprietary model. Several groups, including Hugging Face's Open-R1 project, Berkeley's NovaSky team (Sky-T1), and the Together AI / Stanford collaboration on TinyZero, attempted full replications of the R1 training pipeline using only public data and open base models.
The market reaction to R1's release became a defining financial event of early 2025. On January 27, 2025, one week after R1's public release, U.S. technology stocks experienced their steepest single-day decline in history.[3][4]
The sell-off was triggered by a sudden reassessment of the AI investment thesis. For years, the market had priced technology companies, especially chipmakers and cloud providers, on the assumption that building frontier AI required massive and growing capital expenditures. DeepSeek's demonstration that a 160-person Chinese startup could produce competitive results undermined that assumption.[3][4]
Nvidia's stock fell nearly 17% in a single session, closing at $118.58 and losing approximately $589 billion in market capitalization. This was the largest single-day market value loss for any company in history. Other semiconductor companies including Broadcom, Marvell, Micron, and TSMC also fell sharply. The Nasdaq composite lost roughly $1 trillion in value by the end of the day. Meta and Alphabet (Google's parent company) also declined significantly. Apple briefly retook the title of world's most valuable company as Nvidia fell to roughly $2.8 trillion in market cap.[3][4]
The DeepSeek mobile app reached number one on the Apple App Store in the United States on January 27, displacing ChatGPT. That ranking became part of the news cycle around the stock drop, with retail investors and journalists pointing to the consumer ranking as a tangible sign that something had changed.
Marc Andreessen, the prominent technology investor, described the event as "AI's Sputnik moment," drawing a parallel to the 1957 Soviet satellite launch that shocked the United States into accelerating its space program. The comparison captured the sense that a competitor working with far fewer resources had achieved something that the established players, with their billions in investment, had assumed only they could do.[4][10] President Donald Trump, speaking at a Republican retreat the same week, called R1 a "wake-up call for our industries that we need to be laser-focused on competing to win."
The market impact extended beyond the immediate sell-off. Chinese AI companies entered an aggressive price war, with some cutting API prices by up to 97% in the weeks following R1's release. In the United States, the event forced a public debate about whether the hundreds of billions being invested in AI data centers and chip manufacturing were truly necessary, or whether architectural innovation could substitute for raw compute.[3][4]
Stanford HAI faculty noted that DeepSeek's open releases represented "a significant step in democratizing AI," enabling smaller companies and individual developers to build on frontier-capable models without massive compute budgets.[9] Within days, multiple Western labs publicly accelerated their own reasoning-model roadmaps. OpenAI shipped o3-mini on January 31, 2025, and Anthropic added an extended thinking mode to Claude 3.7 Sonnet in February.
On May 28, 2025, DeepSeek released a major update to R1 designated R1-0528. Despite being described as a "minor upgrade" in official communications, the update delivered substantial improvements across all major benchmarks.[11][12]
| Benchmark | R1 (Jan 2025) | R1-0528 (May 2025) | Change |
|---|---|---|---|
| AIME 2024 | 79.8% | 91.4% | +11.6 |
| AIME 2025 | 70.0% | 87.5% | +17.5 |
| HMMT 2025 | 41.7% | 79.4% | +37.7 |
| CNMO 2024 | 78.8% | 86.9% | +8.1 |
| LiveCodeBench (2408-2505) | 63.5% | 73.3% | +9.8 |
| Codeforces-Div1 rating | ~1,530 | ~1,930 | +400 |
| SWE-bench Verified | 49.2% | 57.6% | +8.4 |
| Aider-Polyglot | 53.3% | 71.6% | +18.3 |
| MMLU-Redux | 92.9% | 93.4% | +0.5 |
| MMLU-Pro | 84.0% | 85.0% | +1.0 |
| GPQA Diamond | 71.5% | 81.0% | +9.5 |
| Humanity's Last Exam | 8.5% | 17.7% | +9.2 |
| FRAMES | 82.5% | 83.0% | +0.5 |
The AIME 2025 improvement from 70% to 87.5% was particularly notable, bringing R1-0528 into competitive range with OpenAI's o3 (88.9% on AIME 2025). The Codeforces rating jump of approximately 400 points reflected dramatically improved code generation and problem-solving ability. The HMMT 2025 improvement of nearly 38 points was the single largest gain, reflecting deeper engagement with multi-step competition mathematics.[11][26]
R1-0528 also added two tool-use benchmarks where the original R1 had not reported numbers: BFCL_v3 (Berkeley Function Calling Leaderboard, multi-turn) at 37.0% and Tau-Bench (airline 53.5%, retail 63.9%), reflecting the new function-calling capability.[26]
R1-0528 demonstrated deeper chain-of-thought reasoning than its predecessor. On AIME 2025 problems, the model averaged approximately 23,000 thinking tokens per query, compared to roughly 12,000 for the original R1, a roughly 92% increase in reasoning depth. This near-doubling, enabled by additional algorithmic optimization during post-training, contributed to the accuracy improvements.[12][26]
DeepSeek also reported that the rate of hallucinations (false or misleading outputs) was reduced by approximately 45 to 50% in scenarios such as rewriting and summarization.[12]
The update added several capabilities requested by the developer community:[12]
<think>\n formatting to trigger the thinking modeDeepSeek also released a distilled model from R1-0528: DeepSeek-R1-0528-Qwen3-8B, which achieved state-of-the-art performance among open-source 8B models on AIME 2024 at 86.0%, surpassing the base Qwen3-8B by 10 percentage points and matching the performance of the much larger Qwen3-235B-Thinking on the same benchmark.[12][26]
DeepSeek offered R1 through its API at prices dramatically lower than competing reasoning models, advertised as model=deepseek-reasoner.
| Model | Input (per 1M tokens, cache miss) | Input (per 1M tokens, cache hit) | Output (per 1M tokens) |
|---|---|---|---|
| DeepSeek-R1 | $0.55 | $0.14 | $2.19 |
| OpenAI o1 | $15.00 | $7.50 | $60.00 |
| OpenAI o3 (June 2025 cut) | $2.00 | $0.50 | $8.00 |
| Anthropic Claude 3.7 Sonnet (extended thinking) | $3.00 | $0.30 | $15.00 |
The pricing differential was stark: R1 was roughly 27 times cheaper than o1 for both input and output tokens. Even after OpenAI's June 2025 price cuts brought o3 down to $2/$8, R1 remained approximately 3 to 4 times cheaper. Combined with the MIT license allowing self-hosting (eliminating API costs entirely for organizations with their own compute), R1's economics were a core part of its appeal.
Third-party API providers including Together AI, Fireworks AI, Groq, OpenRouter, Hyperbolic, and Lambda all offered hosted endpoints for R1 within days of release, often at competitive prices and sometimes with faster inference than DeepSeek's own API. Groq in particular advertised R1-Distill-Llama-70B running on its LPU hardware at over 200 tokens per second, several times faster than typical GPU-based deployments.
On September 18, 2025, the DeepSeek-R1 paper appeared on the cover of Nature (volume 645, issue 8081, pages 633 to 638), becoming the first major open-weight large language model to be the subject of a peer-reviewed Nature paper. The corresponding author was Liang Wenfeng, with 199 co-authors from DeepSeek-AI listed.[5][22]
The peer-reviewed version added information that had not appeared in the January arXiv preprint:[22]
The peer review was unusually public for an AI paper. Nature published the reviewer comments and DeepSeek's responses alongside the article, an editorial choice that was widely welcomed in the AI research community as a step toward more rigorous publication practices for industrial AI work.[5]
DeepSeek-R1's impact extended well beyond its benchmark scores and market disruption. The model fundamentally changed several assumptions about AI development.
Before R1, the prevailing assumption was that training frontier reasoning models required resources available only to a handful of well-funded Western labs. DeepSeek showed that a combination of architectural innovation (MoE, MLA), efficient training algorithms (GRPO, FP8 mixed precision), and clever engineering could produce competitive results at dramatically lower cost. This finding had practical consequences: smaller companies and research institutions began building on R1's open weights rather than training models from scratch.[1][9]
R1 demonstrated that open-source AI models could match proprietary frontier models in at least one important capability dimension (reasoning). This intensified the debate within the AI industry about the relative merits of open and closed development approaches. Labs that had been reluctant to open-source their models faced renewed pressure to justify keeping weights proprietary, while the open-source community gained a powerful new proof point for their approach. Meta's Yann LeCun, a vocal advocate of open-source AI, repeatedly cited R1 as evidence that open weights would catch closed weights in capability.[9]
R1-Zero's spontaneous development of reasoning strategies through pure RL contributed to the ongoing scientific debate about emergent abilities in language models. The result suggested that reasoning was not something that needed to be explicitly taught through supervised learning but could arise naturally from optimization pressure on task performance. This finding influenced subsequent research across multiple labs exploring RL-based training for reasoning.[1][5]
R1's release accelerated development timelines across the industry. Several months after R1, multiple labs released improved reasoning models: OpenAI shipped o3-mini in January 2025, o3 in April 2025, and o4-mini in mid-2025; Google released Gemini 2.5 Pro with extended thinking in March 2025; Anthropic added extended thinking to Claude 3.7 Sonnet in February 2025 and Claude Opus 4 in May 2025; Alibaba released Qwen3 with native hybrid reasoning in April 2025 and the open-source QwQ-32B reasoning model. The competitive dynamic R1 created pushed the entire field forward at a faster pace than might otherwise have occurred.
The distillation recipe in particular was widely copied. Microsoft's Phi-4-Reasoning, Berkeley's Sky-T1-32B, Hugging Face's Open-R1, NVIDIA's OpenReasoning-Nemotron, and dozens of community models all used variants of R1's rejection-sampling-then-SFT recipe to bootstrap reasoning capabilities into smaller bases.
As a model developed by a Chinese company, DeepSeek-R1 faced regulatory scrutiny in multiple Western countries. Concerns centered on data privacy (DeepSeek's servers are located in China, subject to Chinese data laws), potential content alignment with Chinese government positions, and national security implications of a widely deployed Chinese AI model.[15]
The US government response to DeepSeek was swift and multi-pronged. On February 6, 2025, Representatives Josh Gottheimer and Darin LaHood introduced the bipartisan "No DeepSeek on Government Devices Act," which specifically targeted the DeepSeek mobile application and API for prohibition on federal government devices. The bill passed in August 2025, banning federal employees from using the app on government-issued devices.[15][20]
Additional legislative efforts included Representative Mark Green's China Technology Transfer Control Act and Senator Josh Hawley's "Decoupling America's Artificial Intelligence Capabilities from China Act," introduced on January 29, 2025. Gottheimer and LaHood also wrote to all 50 US governors urging them to implement similar bans at the state level.[15][20]
Multiple government agencies independently restricted or banned use of DeepSeek products:[15][27]
| Agency / government | Action | Date |
|---|---|---|
| U.S. Navy | Issued warnings against use | Late January 2025 |
| NASA | Reinforced security concerns | January 31, 2025 |
| Italy (Garante) | Ordered limitation on processing of Italian users' data; app removed from Apple/Google stores | January 30, 2025 |
| Texas | First U.S. state ban on government systems | February 2025 |
| Virginia | Banned on government systems | February 2025 |
| New York | Banned on government systems | February 2025 |
| U.S. Congress | Restricted on congressional devices | February 2025 |
| Pentagon | Restricted usage | February 2025 |
| Australia | Government agency restrictions; Home Affairs Minister Tony Burke cited national security | February 4, 2025 |
| South Korea | Government restrictions; later removed from app stores | February 2025 |
| Taiwan (MODA) | Banned on government devices and critical infrastructure | February 2, 2025 |
| India (Finance Ministry) | Restricted on government devices | February 2025 |
The Fiscal Year 2026 National Defense Authorization Act, signed in December 2025, included provisions restricting DeepSeek usage within the Department of Defense and Intelligence Community.[15]
Separately, OpenAI accused DeepSeek of improperly distilling from OpenAI models. The accusation surfaced publicly within days of R1's release, with OpenAI claiming it had "some evidence" that DeepSeek used outputs from OpenAI APIs to train R1, in violation of OpenAI's terms of service. In a February 2026 Bloomberg report, OpenAI escalated the claim in a memo to U.S. lawmakers, alleging that DeepSeek had developed methods to circumvent OpenAI's access restrictions through obfuscated third-party routers and other means. DeepSeek did not formally admit to using distillation in training R1's reasoning capabilities. The peer-reviewed Nature paper acknowledged that R1's training data, scraped from the open web, would inevitably contain text generated by other LLMs, while denying targeted distillation of OpenAI's reasoning traces.[22][28][29]
The accusation raised broader legal and ethical questions about "distillation" as a practice. Many in the open-source community pointed out that OpenAI itself had been sued for training on copyrighted material scraped from the web, making the position uncomfortable. The episode also illustrated a structural issue in modern LLM development: as long as proprietary models are accessible via API, their outputs can be used to train competitors, and detection is technically difficult.
Anthropic escalated the distillation issue further in February 2026 with a public blog post alleging that DeepSeek, Moonshot AI, and MiniMax had together used roughly 24,000 fake accounts to generate more than 16 million exchanges with Claude. Anthropic argued that the campaigns were systematic rather than incidental: scripted multi-turn sessions designed to elicit reasoning traces, code, and tool-use patterns that could be used as supervised fine-tuning data. The post specifically cited regional access restrictions and terms-of-service violations, and Anthropic stated that the accounts were taken down as soon as they were detected.[28][34]
Neither DeepSeek, Moonshot, nor MiniMax publicly responded to the specific allegations beyond general statements that all training data was obtained legally. Independent verification was difficult because Anthropic's technical evidence was not published in full. The episode sat alongside the OpenAI memo as one of the most prominent industry-level claims of cross-lab distillation, and it became part of the political case that several U.S. lawmakers used to argue for stronger restrictions on Chinese labs' API access to Western frontier models.[28][34]
The Trump administration's America's AI Action Plan, released on July 23, 2025, used R1's launch as a recurring framing device for the case that the United States was at risk of losing the AI race. The phrase "DeepSeek shock" appeared in the plan's preamble and in several supporting executive orders signed alongside it; the plan named open-source competition from China and the V3 / R1 release in particular as evidence that "innovation is no longer a US monopoly," and proposed accelerated permitting for AI data centers, expanded chip export restrictions, and new federal procurement preferences for American models. R1 thus became, indirectly, the policy event that justified the largest US AI infrastructure push of 2025.[35][36]
President Trump's January 27, 2025 reference to R1 as a "wake-up call" was reused throughout 2025 administration communications. By the time the International AI Safety Report was updated in early 2026 under Yoshua Bengio's chair, R1 was treated as the canonical example of "rapid catch-up dynamics" between leading and following labs, and the report explicitly used DeepSeek's recipe as evidence that frontier capabilities propagated faster than previous safety assumptions had expected.[36]
In February 2026, a senior US government official told reporters that DeepSeek had trained an upcoming model (then assumed to be a successor to V3.2) on Nvidia Blackwell chips, the company's most advanced AI accelerator. Blackwell falls under strict US export controls, and the chips were not legally available for sale in mainland China. Subsequent reporting by The New York Times and Bloomberg tied the allegation to a DeepSeek-affiliated data center in Inner Mongolia and to the longstanding pattern of restricted hardware reaching Chinese labs through third-country resellers. DeepSeek has not publicly commented on Blackwell access. The episode reignited the export-control debate that R1 had originally provoked: if a small lab could close most of the gap to OpenAI on H800-class hardware, the implications of Chinese labs gaining access to current-generation Nvidia chips were significant.[37][38]
Content analysis documented instances of refusal or evasion related to politically sensitive topics, including the 1989 Tiananmen Square massacre, the political status of Taiwan, the treatment of Uyghurs in Xinjiang, and the comparison of Xi Jinping to other figures. DeepSeek consistently avoided substantive engagement with these subjects when accessed through chat.deepseek.com or the official API. Behavior on the open weights was more nuanced: when the same questions were posed to a self-hosted instance of R1 through bare-metal inference, the model often produced fuller answers, suggesting that some of the refusal behavior was implemented as a server-side filter rather than baked into the weights themselves.[15][20][30]
A May 2025 academic paper titled "R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model" mapped the topology of refusals across hundreds of prompts. Most Type 2 refusals (full silence rather than alignment-style explanations) clustered around Tiananmen Square, suggesting that this topic remained particularly sensitive across both the chat product and the model weights.[30]
Technical security analyses identified several concerns with DeepSeek's infrastructure. Reports cited hidden code linking to China Mobile servers in the mobile app, the collection of keystroke timing data, data storage on Chinese servers subject to Chinese government access requests, and several cybersecurity test failures including jailbreak resistance below industry baselines. These restrictions applied primarily to government use. Commercial and individual use of R1's open weights remained unrestricted in most jurisdictions, and the model continued to be widely deployed through cloud providers and self-hosted infrastructure. The open-source nature of the model weights meant that security concerns about DeepSeek's servers could be mitigated by self-hosting, though the underlying questions about training data and possible weight-level alignment with Chinese government positions remained.[15][20]
R1 has well-documented limitations that practitioners learned to work around:
\boxed{...} markers, requiring downstream parsing logic that handled missing or malformed markers.R1 was the first reasoning model in what became a steady release cadence from DeepSeek through 2025 and into 2026.
| Release | Date | Notes |
|---|---|---|
| DeepSeek-R1 | Jan 20, 2025 | Initial release; companion R1-Zero; six distilled variants |
| DeepSeek-R1-0528 | May 28, 2025 | Major update; deeper thinking; function calling; JSON; system prompts; R1-0528-Qwen3-8B distill |
| DeepSeek-V3.1 | Aug 19, 2025 | First hybrid model: chat and reasoning in one set of weights with a thinking-mode toggle |
| DeepSeek-V3.2-Exp | Sep 29, 2025 | Experimental release introducing DeepSeek Sparse Attention |
| DeepSeek-OCR | Oct 2025 | Vision-language OCR model |
| DeepSeek-V3.2 | Dec 1, 2025 | Production hybrid model with thinking integrated into tool use |
| DeepSeek-V3.2-Speciale | Q1 2026 | High-compute variant; gold-medal results on IMO 2025, IOI 2025, ICPC World Finals |
| DeepSeek-V4 (anticipated) | April 2026 (rumored) | Next-generation flagship; expected to fold reasoning, tool-use, and multimodal into single weights |
The V3.1 hybrid release in August 2025 effectively absorbed R1's role: a single set of weights could now serve both as a fast chat model and (with a thinking-mode toggle) as a reasoning model. V3.1's deep-thinking mode achieved roughly 90 to 95% of R1-0528's performance on reasoning benchmarks while sharing weights with a normal chat model, removing the need to load a separate reasoning model in production. By the time V3.2 launched in December 2025, R1 was no longer DeepSeek's recommended model for new applications, though it remained widely cited and deployed because of its open-source release and well-understood behavior.[31]
DeepSeek did not release a model branded R2 in 2025 or in the first five months of 2026. Reuters reported in March 2025 that DeepSeek was racing to ship a successor to R1, and Chinese-language tech outlets carried multiple "R2 imminent" rumors throughout the spring and summer of 2025, citing anonymous sources and partial leaks of model weights. None of these matured into an actual release. Instead, the May 2025 update was branded R1-0528, and the August 2025 hybrid model was branded V3.1 rather than R2. As of May 2026, DeepSeek's reasoning capability lives inside the V3.x hybrid line; whether the next reasoning-focused branded release will be called R2, V4 with a reasoning toggle, or something else has not been announced.[32][39]
The closest thing to an "R2" in a research sense was the second-generation R1 distilled model, R1-0528-Qwen3-8B. It demonstrated that the same recipe applied to the newer Qwen3 base produced larger gains than the original R1 distills had achieved on Qwen2.5, but it was packaged as a distill rather than a new flagship.[12][26]
Liang Wenfeng, DeepSeek's founder and CEO, was unusually quiet for the founder of a model that had moved $1 trillion of US market capitalization. He gave no major Western media interview in the immediate aftermath of R1's release. The most cited public statements came from earlier 2024 interviews with Chinese outlet 36Kr and a small set of remarks reported through Fortune, Wired, and Reuters in late January 2025. Liang's framing emphasized scientific curiosity, an unwillingness to take outside investment that might compromise long-horizon research, and a belief that "innovation gaps are narrow but they take a long time to close." He did meet with Premier Li Qiang in Beijing on January 20, 2025, the same day as R1's release, an event that was widely covered as a tacit signal of state attention but that did not visibly change the company's research direction.[6][23][40]
In a rare 2025 essay published on the company's website, Liang argued that the most valuable contribution of R1 was not the benchmark numbers themselves but the published recipe; he wrote that the open release was intended to "convert proprietary research into shared infrastructure." That essay, combined with DeepSeek's continued open releases through V3.1 and V3.2, became part of the case that other Chinese labs (Alibaba's Qwen, Baidu, ByteDance, MiniMax) used to defend their own open-source strategies during 2025.[19][40]
DeepSeek's headcount remained small by frontier-lab standards. Reporting through 2025 placed the team at roughly 150 to 200 employees, most of them young researchers hired directly out of Chinese universities. The lab declined repeated approaches from outside investors during the year and continued to be funded entirely through High-Flyer's profits and an internal split with the parent quant fund. The combination of small headcount, no outside capital, and a steady release cadence became one of the more durable narratives around DeepSeek in the year after R1.[6][23]
Within China, R1 and its successors became the de facto open-source reasoning baseline for most of the largest technology platforms during 2025. Tencent integrated R1 into its Yuanbao consumer app within weeks of release, eventually offering both R1 and R1-0528 as user-selectable reasoning modes alongside Tencent's own Hunyuan models. Tencent also reported running R1 inside parts of the WeChat search experience, a deployment that significantly amplified the model's everyday user base.[41]
Alibaba Cloud added R1 to its Bailian model platform in late January 2025 and offered it as a managed endpoint at prices closely tracking DeepSeek's own. This was notable because Alibaba's own Qwen3 was the most direct domestic competitor to R1; Alibaba's willingness to host the rival model was treated as evidence that Chinese cloud customers explicitly demanded R1 access regardless of the host platform's own model line. Baidu, ByteDance, and several smaller Chinese clouds followed similar patterns through 2025.[41][42]
Enterprise adoption also moved into industries with limited prior AI exposure. Reports from Chinese state-affiliated outlets through 2025 covered R1 deployments in legal-document analysis, manufacturing process optimization, hospital triage chatbots, and government service portals. While many of these case studies were promotional, the volume of independent confirmation through hiring listings and procurement notices was substantial. By the second half of 2025, "DeepSeek-compatible" had become a recognizable procurement category in Chinese government IT bids.[41][42]
Chinese smartphone vendors integrated R1 distilled models directly on-device. Xiaomi added the 7B and 14B distills to the HyperOS 2 assistant in February 2025; Honor and Oppo followed with similar integrations during the spring. By mid-2025 several flagship Android handsets shipped a local reasoning mode powered by an R1-Distill checkpoint, the first time a frontier-grade reasoning model ran natively on consumer phones.[19][41]
R1's performance was tracked through 2025 and into 2026 across several independent benchmarking aggregators, which gave a clearer picture of how it held up against successive frontier releases.
| Aggregator | Methodology | R1 launch position | R1-0528 position (mid-2025) | Position by April 2026 |
|---|---|---|---|---|
| Vellum LLM Leaderboard | Composite score across reasoning, coding, instruction-following | Top 5 overall, leading open-source | Top 10 overall after o3 and Gemini 2.5 Pro launched | Outside top 15 after GPT-5.2 and Claude Opus 4.7 |
| Artificial Analysis Quality Index | Aggregate score over reasoning, math, coding | Score around 60 at launch, leading open-source | Score around 68 with R1-0528 | Score around 70, behind newer hybrid models including DeepSeek's own V3.2-Speciale |
| LMArena (Chatbot Arena) | Pairwise human vote ELO | Around 1320 at launch, top-five overall | Around 1370 in June 2025, top-three open-source | Around 1380, behind GPT-5-class and Claude Opus 4-class models |
| Aider polyglot coding bench | End-to-end coding edit pass rate | 53.3% at launch | 71.6% with R1-0528 | Surpassed by GPT-5.1 Codex variants and Claude Sonnet 4.5 |
The trajectory tells a consistent story: R1 was at or near the frontier at launch, R1-0528 closed most of the gap that had opened up with the spring 2025 reasoning releases, and then later 2025 and early 2026 frontier models (GPT-5, GPT-5.1, GPT-5.2, Claude Sonnet 4.5, Claude Opus 4.7, Gemini 2.5 Pro and Gemini 3) opened a new frontier that the original R1 line did not chase. R1 remained dominant only in the "best open-source reasoning model under MIT license" niche, and even that niche was contested through 2025 by Qwen3, Mistral Magistral, and OpenAI's own gpt-oss family.[19][43][44]
The table below summarizes R1 (and its successor R1-0528) against the other reasoning-capable frontier models that defined the field through 2025 and early 2026. All numbers reflect each model's position at the time it was the headline release.
| Model | Release | License | AIME 2024 (or closest) | SWE-bench Verified | Context length | Notable property |
|---|---|---|---|---|---|---|
| OpenAI o1 | Sep / Dec 2024 | Proprietary | 79.2% | 48.9% | 128K | First widely deployed reasoning model |
| DeepSeek-R1 | Jan 20, 2025 | MIT | 79.8% | 49.2% | 128K | First open-weight match for o1 |
| OpenAI o3 | Jan / Apr 2025 | Proprietary | 91.6% | 71.7% | 200K | First model to clear 90% on AIME 2024 |
| DeepSeek-R1-0528 | May 28, 2025 | MIT | 91.4% | 57.6% | 128K | Closed most of the o3 gap on math |
| Gemini 2.5 Pro | Mar / Jun 2025 | Proprietary | 88.0% | 63.2% | 1M | "Thinking by default" across the line |
| Claude Sonnet 4.5 | Sep 29, 2025 | Proprietary | 87.0% | 77.2% | 1M | New SWE-bench Verified leader at launch |
| GPT-5 | Aug 7, 2025 | Proprietary | 94.6% | 74.9% | 400K | Unified router architecture |
| GPT-5.1 | Nov 12, 2025 | Proprietary | 94.0%+ | 76.3% | 400K | Bifurcated Instant / Thinking variants |
| GPT-5.2 | Dec 11, 2025 | Proprietary | 96%+ | 78.0%+ | 400K | New ARC-AGI-2 leader (52.9%) |
| Claude Opus 4.7 | Apr 16, 2026 | Proprietary | High 90s | High 70s | 1M | Opus-tier flagship; paired with Project Glasswing |
| gpt-oss | Aug 5, 2025 | Apache 2.0 | High 80s | Mid 50s | 131K | OpenAI's first open-weight model since GPT-2 |
R1's position in this lineage is structural rather than purely technical. It established the bar for what "open-weight reasoning model" meant, and most subsequent open-weight reasoning models (Qwen3-Thinking, Llama 4 Reasoning, Magistral, gpt-oss) were measured against it. By 2026 the open-weight tier had largely caught up to R1's specific recipe, but the broader principle that frontier reasoning could ship under a permissive license remained, and that principle is generally credited to R1.[19][43]
R1's published recipe became one of the most-cited methodological papers of 2025 in the language-model literature. Three threads of follow-on research stand out.
The first is methodological work on GRPO itself. Variants including DAPO (Decoupled Advantage Policy Optimization), GRPO+ (used by Alibaba's Qwen team), and REINFORCE++ have appeared in 2025 papers, all of which build on the basic group-relative advantage trick R1 popularized. By the end of 2025, GRPO and its derivatives were the dominant RL recipe in published reasoning-model papers, displacing the earlier PPO-based RLHF pipeline that had been standard since InstructGPT. Hugging Face's TRL library and Allen AI's TRLX both shipped GRPO implementations within weeks of R1's release.[16][17]
The second thread is the "R1 replication" project. Hugging Face's Open-R1, Berkeley NovaSky's Sky-T1, the Together AI / Stanford TinyZero work, the SimpleRL-Reason project, and the BeReal-style "small base + GRPO" experiments all attempted to reproduce R1's training trajectory using only public data and open base models. None of these matched the original R1 on benchmarks, but several reproduced the qualitative emergence of reflection behaviors, including the "wait, let me reconsider" patterns described in the original paper. The Sea AI Lab "There May Not be Aha Moment" paper sat in this thread as a critical counterpoint, arguing that some of the apparent emergence was inherited from base model pre-training rather than from RL itself.[7][16][17]
The third thread is application-specific reasoning models distilled from R1. Medical reasoning (Med-R1, Apollo-R1), legal reasoning (Lawyer-R1), embodied agents (R1-Robot), and scientific-discovery agents (Sci-R1) all use R1 or one of its distillations as the teacher model. The MIT license made these projects possible without licensing negotiations, and by mid-2026 several of them had moved beyond research into commercial deployment.[19][45]
R1 is also one of the few language models to have a major IMO (International Mathematical Olympiad) problem set named after it: the 2025 community-organized "R1-IMO" prize, which evaluated several open and closed reasoning models on previously unseen olympiad problems. R1-0528 placed third in that evaluation behind o3-pro and Gemini 2.5 Pro.[44]
As of May 2026, DeepSeek-R1 and its derivatives remain among the most widely studied open-source reasoning models, even though DeepSeek's own product line has moved on to the V3.x hybrid family and the anticipated V4. R1-0528 continues to be available through the DeepSeek API at the original prices and through every major third-party inference provider. The 32B and 70B distilled models remain popular as locally hostable reasoning baselines, and the smaller distills (1.5B, 7B, 8B) are widely used as base models for further fine-tuning rather than as deployment endpoints.
The model's legacy is best measured by its influence on the field. R1 proved that reasoning-capable language models could be built openly and cheaply, that reinforcement learning could induce genuine reasoning behaviors without supervised examples, and that a small team with limited resources could compete with the largest AI labs in the world. The recipe it published, GRPO with rule-based rewards on verifiable tasks, has become the dominant approach for training reasoning models across both open-source and commercial labs. Most reasoning models released in 2025 and 2026, from Qwen's QwQ family to Microsoft's Phi-Reasoning to community-trained models on Hugging Face, used some variant of the R1 recipe.
R1 also reset expectations for what a model release should look like. The combination of a permissive license, a detailed training recipe, peer-reviewed publication, six pre-distilled variants, and aggressive API pricing became a de facto template that other open-source labs were measured against. When subsequent releases (Qwen3, Llama 4 Reasoning, Mistral Magistral) were perceived as stinting on documentation or imposing restrictive licenses, the comparison was usually to R1.
The financial and policy aftershocks lasted longer than the model itself. The "DeepSeek shock" of January 27, 2025 is now treated as the canonical market event of the AI boom, alongside ChatGPT's November 2022 launch and OpenAI's GPT-5 keynote in August 2025. It catalyzed the United States' AI Action Plan, accelerated US chip export controls, prompted the Anthropic and OpenAI public claims of cross-lab distillation, and put open-weight reasoning permanently inside the policy conversation. Even after the cost numbers were reframed, the directional finding (that frontier reasoning capability had become cheap enough for a focused team to reach) has held up.