DeepSeek-R1
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v10 · 6,041 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v10 · 6,041 words
Add missing citations, update stale details, or suggest a clearer explanation.
| DeepSeek-R1 | |
|---|---|
| Developer | DeepSeek |
| Release date | January 20, 2025 |
| Type | Large language model (reasoning model) |
| Architecture | Mixture of Experts (MoE), Transformer with MLA |
| Base model | DeepSeek-V3-Base (671B / 37B MoE) |
| Parameters | 671 billion total; 37 billion active per token |
| Context length | 128,000 tokens |
| Training algorithm | Group Relative Policy Optimization (GRPO) |
| Reward signal | Rule-based (verifiable math, code) plus format reward |
| Reported RL compute | 512 Nvidia H800 GPUs for ~80 hours |
| Reported RL rental cost | ~$294,000 (R1 RL stage only, disclosed in Nature Sept 2025) |
| License | MIT (weights, distills, and derived outputs) |
| Companion model | DeepSeek-R1-Zero (RL-only from V3-Base, no SFT) |
| Distilled variants | 6 dense models: Qwen 1.5B / 7B / 14B / 32B, Llama 8B / 70B |
| Updated version | DeepSeek-R1-0528 (May 28, 2025) |
| Paper | "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv:2501.12948; published Nature 645, 633-638 (Sept 18, 2025) |
DeepSeek-R1 is an open-weight reasoning-focused large language model developed by DeepSeek, a Chinese artificial intelligence laboratory spun out of the High-Flyer quantitative hedge fund. Released on January 20, 2025 under the MIT license, R1 was the first major open-weight model to match the reasoning performance of OpenAI's proprietary o1 across mathematics, coding, and graduate-level science benchmarks. The model is built on the DeepSeek-V3-Base Mixture of Experts architecture, with 671 billion total parameters of which 37 billion are activated per forward pass, inheriting V3's 128,000-token context window.[1][2]
DeepSeek shipped three related artifacts on the same day. DeepSeek-R1-Zero was trained by applying reinforcement learning directly to the V3 base model with no supervised fine-tuning, and demonstrated that chain-of-thought reasoning, self-reflection, and error correction could emerge spontaneously from rule-based rewards alone. DeepSeek-R1 itself used a multi-stage pipeline of cold-start supervised fine-tuning followed by reinforcement learning from verifiable rewards, producing a production-ready reasoning model. DeepSeek-R1-Distill consisted of six smaller dense models, distilled from R1's reasoning traces onto Qwen 2.5 and Llama 3 base checkpoints rather than DeepSeek's own backbone.[1][2]
R1's release triggered a market event widely called the "DeepSeek shock." On January 27, 2025, Nvidia's stock fell roughly 17%, losing approximately $589 billion in market capitalization in a single session, the largest single-day market value loss for any company in U.S. stock market history. The DeepSeek mobile app also briefly displaced ChatGPT atop the U.S. Apple App Store. The shock stemmed from the revelation that a small Chinese laboratory had produced a frontier reasoning model using a reported RL-stage rental cost of about $294,000 on export-restricted H800 GPUs, undermining the assumption that frontier AI required tens of billions in capital expenditure. The accompanying technical paper became the first major open-weight LLM paper to pass independent peer review, appearing on the cover of Nature on September 18, 2025.[3][4][5]
DeepSeek's path to R1 began inside a hedge fund. High-Flyer, a Chinese quantitative trading firm co-founded in 2016 by Liang Wenfeng and his Zhejiang University classmates, accumulated tens of thousands of Nvidia GPUs through the late 2010s for stock-prediction workloads. By 2020 it operated one of the largest private AI training clusters in China. Liang spun the AI research arm into an independent company, DeepSeek, in July 2023, seeded with engineers experienced in squeezing performance out of large GPU pools. DeepSeek was bootstrapped from hedge fund profits and took no outside investment before R1's release.[6][7]
DeepSeek built toward R1 throughout 2024. The company released DeepSeek-V2 in May 2024 and DeepSeek-V3 on December 26, 2024, both MoE models prioritizing computational efficiency. V3 served as the base model for R1, providing a strong general foundation. V3 itself was trained on roughly 14.8 trillion tokens at a reported GPU-rental cost of about $5.576 million, a figure that would later become entangled in cost debates around R1.[2][7]
The immediate scientific context was the emergence of inference-time reasoning. OpenAI's o1, previewed in September 2024 and released in December 2024, demonstrated that training models with reinforcement learning to "think before answering" could dramatically improve performance on difficult math, science, and coding tasks. OpenAI published no technical details on the recipe. DeepSeek's contribution was to show that the approach could be replicated with open weights at a small fraction of the apparent cost, and to publish the full training methodology.[1]
DeepSeek-R1 is best understood as three distinct but related releases that together formed the announcement of January 20, 2025.[1][2]
DeepSeek-R1-Zero was trained by applying large-scale reinforcement learning directly to the DeepSeek-V3-Base model, with no supervised fine-tuning and no curated reasoning examples. The model was simply given problems and rewarded for producing correct answers under a minimal template that required reasoning inside <think>...</think> tags and the final answer inside <answer>...</answer> tags.[1]
Despite never seeing a reasoning demonstration, R1-Zero spontaneously developed several reasoning behaviors during training: multi-step chain-of-thought decomposition, self-reflection ("wait, let me reconsider"), error detection and correction, alternative-strategy exploration, and adaptive allocation of thinking time on harder problems. The paper reported a striking trajectory on AIME 2024, where R1-Zero's pass@1 accuracy rose from 15.6% at the start of RL training to 71.0% by the end, reaching 86.7% under majority voting over 64 samples, matching OpenAI o1-0912's performance using only RL on a base model.[1][5]
DeepSeek highlighted what it called an "aha moment" during training, when the model began interrupting itself with phrases like "Wait, wait. Wait. That's an aha moment I can flag here" before backtracking. The paper interpreted this as evidence of a self-evolution process induced by optimization pressure. Subsequent work, including a Sea AI Lab study titled "There May Not be Aha Moment in R1-Zero-like Training," argued that some of these behaviors may have been inherited from reflective patterns already present in the base model's pre-training data, and the Nature version of the R1 paper engaged with these critiques directly.[5][8]
R1-Zero's outputs suffered from poor readability, language mixing between Chinese and English, and inconsistent answer formatting. These limitations motivated the multi-stage training pipeline used for R1 proper. R1-Zero was released alongside R1 under the same MIT license so that researchers could study the unfiltered behavior of an RL-only reasoning model.[1]
DeepSeek-R1 itself was produced by a four-stage pipeline built on DeepSeek-V3-Base.[1][5]
Stage 1: Cold start. A small set of curated long chain-of-thought examples (a few thousand samples) was used to supervised-fine-tune V3-Base, addressing the readability and language-mixing failures observed in R1-Zero.
Stage 2: Reasoning-oriented RL. Large-scale GRPO training was applied on verifiable tasks (mathematics, coding, logic) using rule-based rewards. A language-consistency reward was added to suppress mid-response language switching.
Stage 3: Rejection sampling and SFT. The RL-trained model generated a large pool of reasoning traces. High-quality traces were selected by rejection sampling and combined with non-reasoning data to produce roughly 800,000 samples (about 600,000 reasoning and 200,000 general). V3-Base was then fine-tuned on this dataset for two epochs.
Stage 4: All-scenario RL. A final RL stage covered both reasoning and general tasks, combining rule-based rewards with model-based rewards for helpfulness and harmlessness.
The same 800,000-sample dataset created in Stage 3 was reused to fine-tune all six distilled variants.[1]
DeepSeek-R1-Distill is a family of six dense (non-MoE) models created by supervised fine-tuning open-source base checkpoints on R1's 800K reasoning trace dataset. No additional reinforcement learning was applied in the initial release. The distillation targets used base models from outside DeepSeek's own line: four from Alibaba's Qwen 2.5 family and two from Meta's Llama 3 family.[1][2]
| Distilled model | Base model | Parameters | License (weights) |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 1.5B | Apache 2.0 / MIT |
| DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 7B | Apache 2.0 / MIT |
| DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | 8B | Llama 3 / MIT |
| DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 14B | Apache 2.0 / MIT |
| DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 32B | Apache 2.0 / MIT |
| DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 70B | Llama 3 / MIT |
The smallest distill (1.5B) outperformed GPT-4o on mathematical benchmarks while small enough to run on consumer hardware. The 32B Qwen distill became the most widely deployed open-source reasoning model of 2025, fitting on a single 24 GB consumer GPU and reaching 72.6% on AIME 2024, beating OpenAI o1-mini (63.6%). DeepSeek included an ablation in the paper showing that distilling from R1 outperformed running GRPO directly on the same small base, a finding that shaped subsequent open-source training recipes.[1][2]
These three models are commonly confused in coverage. R1-Zero is the scientific demonstration (no SFT, pure RL). R1 is the production reasoning model (SFT plus RL with cold start). R1-Distill is the small-model family, built on Qwen and Llama bases rather than on DeepSeek's own backbone.[1]
DeepSeek-R1 inherits its architecture from DeepSeek-V3-Base unchanged.[2][7]
The MoE design is central to R1's serving economics. Activating only 37 billion of its 671 billion parameters keeps per-token inference costs comparable to a much smaller dense model while retaining the knowledge capacity of the full parameter count. MLA further reduces the memory pressure that would otherwise make long reasoning chains expensive to serve, which matters for a model that routinely emits thousands of intermediate tokens before answering. None of these architectural choices were novel to R1; the contribution was post-training, not architecture.[2][7]
The distilled models are dense (non-MoE) transformers using the architectures of their respective Qwen 2.5 and Llama 3 base checkpoints, which is why their licenses inherit from the bases rather than being purely MIT.
GRPO is the reinforcement learning algorithm used to train both R1-Zero and R1. GRPO was originally introduced by DeepSeek in the February 2024 DeepSeekMath paper (arXiv:2402.03300) and refined for use in R1. It differs from Proximal Policy Optimization (PPO), the algorithm used in classical RLHF pipelines, in one critical way: GRPO eliminates the separate value (critic) model.[9][10]
In PPO-based RLHF, two models must be maintained during training: the policy being optimized and a value model that estimates expected returns. The value model can be as large as the policy itself, effectively doubling memory requirements. GRPO replaces the value model with a baseline computed from group statistics.[9][10]
The algorithm works as follows:[1][9][10]
advantage_i = (reward_i - mean) / std.The group-relative approach normalizes rewards within each problem, reduces the impact of reward scale differences across problem types, cuts training memory roughly in half compared to PPO, and is simpler to implement and tune. Since R1's release, GRPO has become the de facto standard for training open-source reasoning models, displacing PPO with DPO as the preferred recipe. Hugging Face's TRL library, Allen AI's TRLX, and several other RL libraries shipped native GRPO support within weeks of R1's release.[9][10]
DeepSeek used a deliberately simple reward design: an accuracy reward (right or wrong on a verifiable answer) plus a format reward (the model is required to enclose its reasoning in <think>...</think> tags). No human-preference reward model was used during the reasoning-oriented stages, sidestepping both the cost of preference annotation and the failure mode of reward hacking against a learned reward signal. This pattern, RL on verifiable rewards rather than learned reward models, is now commonly called reinforcement learning from verifiable rewards (RLVR) and is a direct legacy of R1.[1][5]
For R1 (as distinct from R1-Zero), the first SFT stage was deliberately small: a few thousand long-CoT samples written or curated to demonstrate clean reasoning structure and consistent language usage. This cold start gave the RL stage a more readable starting point than V3-Base would have provided. Subsequent stages stacked: reasoning-oriented RL, then rejection-sampling SFT on the resulting traces, then a final all-scenario RL pass that broadened behavior to non-reasoning tasks.[1][5]
The same 800K-sample dataset assembled in R1's Stage 3 was used directly to fine-tune the six distilled variants. The choice to use Qwen and Llama bases rather than DeepSeek's own architectures meant the distills could ride on widely-deployed open-weight ecosystems with mature tooling (vLLM, SGLang, llama.cpp, MLX). DeepSeek noted in the paper that running RL directly on these small bases produced worse results than distillation, a finding subsequent open-source projects have largely replicated.[1]
In the Nature publication of September 2025, DeepSeek disclosed that the reinforcement learning portion of R1's training used 512 Nvidia H800 GPUs for approximately 80 hours, at an estimated rental cost of about $294,000 assuming $2 per GPU-hour. The supplementary materials acknowledged for the first time that DeepSeek also owned A100 GPUs and used them for preparatory experiments at smaller scale.[5][11]
The $294,000 figure refers only to the RL stage that converted V3-Base into R1. It excludes the cost of training V3 itself (about $5.576 million in rented compute), the cost of generating cold-start data, the cost of distillation, salaries, and depreciation of the underlying GPU cluster. The Register and CNN Business both noted that the end-to-end cost of producing R1 was roughly an order of magnitude larger than the headline figure, though still dramatically below comparable Western reasoning-model budgets. SemiAnalysis's January 2025 reconstruction estimated DeepSeek's underlying cluster (around 50,000 Hopper-class GPUs accumulated by High-Flyer) at roughly $1.6 billion in retail value, with annual operational expenditure closer to $1.3 billion. The narrower per-run cost figures held up; the framing of a "$6 million startup" did not.[11][12][13]
DeepSeek-R1 reported performance competitive with OpenAI o1 across the major reasoning benchmarks of January 2025.[1][2]
| Benchmark | DeepSeek-R1 | OpenAI o1 (Dec 2024) | GPT-4o |
|---|---|---|---|
| AIME 2024 (pass@1) | 79.8% | 79.2% | 13.4% |
| MATH-500 | 97.3% | 96.4% | 60.3% |
| GPQA Diamond | 71.5% | 75.7% | 53.6% |
| Codeforces (rating / percentile) | 2,029 / 96.3 | 2,061 / 96.6 | n/a |
| MMLU | 90.8% | 91.8% | 87.2% |
| MMLU-Pro | 84.0% | 81.9% | 73.3% |
| LiveCodeBench (CoT) | 65.9% | 63.4% | 33.4% |
| SWE-bench Verified | 49.2% | 48.9% | 33.2% |
| AlpacaEval 2.0 (LC) | 87.6% | n/a | 51.1% |
| ArenaHard | 92.3% | n/a | 80.4% |
R1 matched or exceeded o1 on most math and coding benchmarks while trailing slightly on graduate-level science (GPQA Diamond) and short-form code generation. The combination of those benchmark numbers with an open-weight, MIT-licensed release was the central technical claim that drove both scientific interest and the market reaction.[1][2]
The distilled models posted their own state-of-the-art scores for dense open-source models. The 32B Qwen distill reached 72.6% on AIME 2024 and 94.3% on MATH-500, beating OpenAI's o1-mini (63.6% AIME) by nearly nine points. The 70B Llama distill reached 70.0% AIME and 94.5% MATH-500. The 1.5B Qwen-Math distill, despite being small enough to run on a laptop, outperformed GPT-4o and Claude 3.5 Sonnet on math benchmarks.[1][2]
DeepSeek released R1, R1-Zero, and all six distilled models under the MIT License with one important addition: the license explicitly permits using API outputs to train other models, that is, distillation is expressly allowed. Most proprietary AI providers either prohibit using their outputs to train competing models or leave the question ambiguous; DeepSeek's terms removed legal friction from the wave of follow-on work.[1][2]
The distilled models inherit the upstream base licenses. Qwen-based distills are governed by Apache 2.0 on the base weights with the fine-tuning delta released under MIT; Llama-based distills are governed by the Llama community license on the base weights with the delta under MIT. In practice, this distinction rarely matters for research use but matters for production deployments that may need to comply with the Llama community license's monthly-active-user thresholds and use-policy restrictions.[2]
DeepSeek made all weights publicly available on Hugging Face (deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero, and the six DeepSeek-R1-Distill checkpoints), with model cards, configuration files, and reference inference code. The DeepSeek API offered R1 as deepseek-reasoner at pricing dramatically lower than competing reasoning models.[1][2][14]
| Model | Input (cache miss) | Input (cache hit) | Output |
|---|---|---|---|
| DeepSeek-R1 | $0.55 / 1M | $0.14 / 1M | $2.19 / 1M |
| OpenAI o1 | $15.00 / 1M | $7.50 / 1M | $60.00 / 1M |
| Anthropic Claude 3.7 Sonnet (thinking) | $3.00 / 1M | $0.30 / 1M | $15.00 / 1M |
R1 was roughly 27 times cheaper than o1 on a per-token basis. Within days of release, third-party inference providers including Together AI, Fireworks AI, Groq, OpenRouter, Hyperbolic, Lambda, and SambaNova all offered hosted endpoints for R1 or the distilled models, frequently at competitive prices and sometimes with significantly faster throughput than DeepSeek's own infrastructure. Major cloud providers, including Microsoft Azure, Amazon Web Services, and Nvidia's NIM inference platform, added R1 within weeks.[14][15]
The market reaction to R1's release became a defining financial event of early 2025. On January 27, 2025, the Monday after R1's Friday release went viral, U.S. technology stocks experienced their steepest single-day decline in history. The sell-off was triggered by a sudden reassessment of the AI investment thesis. For years, markets had priced semiconductor and cloud companies on the assumption that frontier AI required massive and growing capital expenditure. DeepSeek's demonstration that a roughly 160-person Chinese laboratory could produce a competitive reasoning model undermined that assumption.[3][4][16]
Nvidia's stock fell about 17% in a single session, closing at $118.58 and losing approximately $589 billion in market capitalization. This was the largest single-day market value loss for any company in U.S. stock market history; rounded press coverage frequently quoted it as "nearly $600 billion." Other semiconductor companies including Broadcom, Marvell, Micron, and TSMC also fell sharply. The Nasdaq Composite lost roughly $1 trillion in value over the session. Apple briefly retook the title of world's most valuable company as Nvidia's market cap dropped to roughly $2.8 trillion.[3][4][16]
The DeepSeek mobile app reached number one on the Apple App Store in the United States on January 27, displacing ChatGPT. The consumer ranking became part of the news cycle around the stock drop, and journalists pointed to it as a tangible signal that something had changed. By the end of January 2025, DeepSeek-R1's open weights had been downloaded more than 5 million times across Hugging Face mirrors.[3][4]
Marc Andreessen described the event as "AI's Sputnik moment," a comparison to the 1957 Soviet satellite launch that became the canonical framing in subsequent coverage. President Donald Trump, speaking at a Republican retreat the same week, called R1 a "wake-up call for our industries that we need to be laser-focused on competing to win." Chinese AI providers entered an aggressive price war in the weeks that followed, with some cutting API prices by up to 97%.[4][17]
A central element of R1's narrative was that it had been trained under U.S. export controls. The Biden administration's October 2022 chip export rules, tightened in October 2023, prohibited the sale of Nvidia's flagship A100 and H100 GPUs to China. Nvidia responded by creating export-compliant variants, the A800 and H800, that matched the flagship chips in raw compute but were bandwidth-limited to fall under the export thresholds. The H800 used by DeepSeek for R1's RL stage was such an export-compliant variant.[11][16][18]
The implication that a small lab could close most of the reasoning gap to OpenAI on bandwidth-restricted hardware became the dominant policy narrative around R1. It both reinvigorated calls for tighter export controls (since the H800 had clearly not been restrictive enough) and provided ammunition to skeptics who argued that the controls had not slowed Chinese AI progress at all. The Trump administration's America's AI Action Plan, released July 23, 2025, repeatedly cited R1 as the policy event justifying expanded chip export restrictions and accelerated federal permitting for AI data centers.[16][18]
R1's reception spanned scientific, commercial, and political dimensions in ways unusual for a single model release.
Within days, multiple Western labs publicly accelerated their reasoning-model roadmaps. OpenAI shipped o3-mini on January 31, 2025; Anthropic added an extended thinking mode to Claude 3.7 Sonnet in February 2025; Google released Gemini 2.5 Pro with thinking-by-default in March 2025; Alibaba released Qwen3 with native hybrid reasoning in April 2025. The competitive dynamic R1 created pushed the entire field forward at a faster pace than was widely anticipated.[4][17][19]
Sam Altman publicly acknowledged the result, posting in late January 2025 that R1 was "an impressive model, particularly around what they're able to deliver for the price" and conceding that "we will obviously deliver much better models." Yann LeCun cited R1 repeatedly as evidence that "open-source models are surpassing proprietary ones." Stanford HAI faculty described DeepSeek's open releases as "a significant step in democratizing AI," enabling smaller laboratories and individual developers to build on frontier-capable models without massive compute budgets.[17][20]
Within China, R1 was integrated within weeks into Tencent's Yuanbao consumer app, Alibaba Cloud's Bailian platform, and Baidu's deployment stack. Chinese smartphone vendors including Xiaomi, Honor, and Oppo added R1-Distill checkpoints (typically the 7B or 14B variants) to on-device AI assistants through 2025. By the second half of 2025, "DeepSeek-compatible" had become a recognizable procurement category in Chinese government IT bids.[19][21]
The MIT-licensed weights and the published recipe combined to produce one of the largest single-event impacts on the open-source AI ecosystem since the original Llama leak in 2023.
Within a month of launch, over 700 community-built models derived from R1 appeared on Hugging Face, collectively downloaded more than 5 million times. DeepSeek-R1 became the most-liked model on Hugging Face among more than 1.5 million models on the platform, surpassing 10,000 likes within weeks. The variant tree of R1-Distill checkpoints, fine-tuned for medical reasoning, legal analysis, embodied agents, scientific discovery, and dozens of other vertical applications, exceeded 10 million cumulative downloads by mid-2025.[15][17]
DeepSeek-R1-Distill-Qwen-32B became the default open-source reasoning baseline of 2025. It fit on a single 24 GB consumer GPU at 4-bit quantization, ran at usable speeds on a Mac Studio M2 Ultra via MLX or llama.cpp, and offered o1-mini-comparable accuracy on math and code with no API costs and no data egress. The model became one of the most-fine-tuned bases on Hugging Face throughout 2025 and was the teacher of choice for dozens of small reasoning models trained by university labs and independent researchers.[1][15]
Several formal replication projects attempted to reproduce R1's training trajectory using only public data and open base models: Hugging Face's Open-R1, Berkeley NovaSky's Sky-T1, the Together AI / Stanford TinyZero work, the SimpleRL-Reason project, and Allen AI's Tülu 3 follow-up. None matched the original R1 on absolute benchmarks, but several reproduced the qualitative emergence of reflective reasoning behaviors. Microsoft's Phi-4-Reasoning, NVIDIA's OpenReasoning-Nemotron, and dozens of community models used variants of R1's rejection-sampling-then-SFT recipe to bootstrap reasoning capabilities into smaller bases.[10][22]
DeepSeek-R1-0528 was released on May 28, 2025 as a major update labeled by DeepSeek as a "minor upgrade" despite delivering substantial improvements across all major benchmarks. The model was a refresh rather than a new architecture, applying additional post-training to the same V3-based backbone.[23][24]
| Benchmark | R1 (Jan 2025) | R1-0528 (May 2025) |
|---|---|---|
| AIME 2024 | 79.8% | 91.4% |
| AIME 2025 | 70.0% | 87.5% |
| HMMT 2025 | 41.7% | 79.4% |
| LiveCodeBench (2408-2505) | 63.5% | 73.3% |
| Codeforces-Div1 rating | ~1,530 | ~1,930 |
| SWE-bench Verified | 49.2% | 57.6% |
| Aider-Polyglot | 53.3% | 71.6% |
| GPQA Diamond | 71.5% | 81.0% |
| Humanity's Last Exam | 8.5% | 17.7% |
R1-0528 also added function calling, JSON output, and system-prompt support that the original R1 had lacked. The model averaged roughly 23,000 thinking tokens per query on AIME 2025, up from about 12,000 for the original R1, with the deeper reasoning correlating with the accuracy gains. DeepSeek reported a 45-50% reduction in hallucination rates on rewriting and summarization tasks. A companion distill, DeepSeek-R1-0528-Qwen3-8B, achieved 86.0% on AIME 2024, surpassing the base Qwen3-8B by 10 percentage points and matching the much larger Qwen3-235B-Thinking on the same benchmark.[23][24]
DeepSeek-R2 was widely rumored throughout 2025 but never released as a model under that brand name. Reuters reported in March 2025 that DeepSeek was racing to ship a successor to R1, and Chinese-language tech outlets carried multiple "R2 imminent" rumors through the spring and summer, citing anonymous sources and partial leaks. None matured into an actual release. Instead, the May 2025 refresh was branded R1-0528, and the August 2025 successor was branded V3.1 rather than R2, folding R1's reasoning capability into a hybrid model that could toggle thinking mode on or off within a single set of weights.[25][26]
As of May 2026, DeepSeek's reasoning capability lives inside the V3.x and V4 hybrid line. The DeepSeek-V4 Preview released April 24, 2026 ships V4-Pro (1.6T total / 49B active) and V4-Flash (284B / 13B) with native thinking-mode toggles. Whether the next reasoning-focused release will be branded R2, V5, or absorbed entirely into the hybrid family has not been announced.[25][27]
| Release | Date | Notes |
|---|---|---|
| DeepSeek-R1 | Jan 20, 2025 | Initial release with R1-Zero and six distilled variants |
| DeepSeek-R1-0528 | May 28, 2025 | Major update; deeper thinking; function calling; JSON; Qwen3-8B distill |
| DeepSeek-V3.1 | Aug 19, 2025 | First hybrid model: chat and reasoning in one set of weights with thinking-mode toggle |
| DeepSeek-V3.2-Exp | Sep 29, 2025 | Experimental release introducing DeepSeek Sparse Attention |
| DeepSeek-OCR | Oct 20, 2025 | Vision-language OCR model |
| DeepSeek-V3.2 | Dec 1, 2025 | Production hybrid; thinking integrated into tool use |
| DeepSeek-V4 Preview | Apr 24, 2026 | V4-Pro and V4-Flash; 1M context; native hybrid reasoning |
V3.1 effectively absorbed R1's role: a single set of weights served as both a fast chat model and (with a thinking-mode toggle) as a reasoning model, reaching roughly 90-95% of R1-0528's performance on reasoning benchmarks while sharing weights with a normal chat model. By V4's April 2026 launch, R1 was no longer DeepSeek's recommended model for new applications, though it remained widely cited and deployed because of its open-source release and well-understood behavior.[26][27]
On September 18, 2025, the DeepSeek-R1 paper appeared on the cover of Nature (volume 645, issue 8081, pages 633-638), becoming the first major open-weight large language model to be the subject of a peer-reviewed Nature paper. The corresponding author was Liang Wenfeng, with 199 co-authors from DeepSeek-AI.[5][11]
The peer-reviewed version added several disclosures absent from the January arXiv preprint: the $294,000 RL-stage training cost on 512 H800 GPUs over roughly 80 hours; an acknowledgment that DeepSeek owned A100 GPUs used for preliminary experiments; expanded ablation studies including a direct response to the "There May Not be Aha Moment" critique; a more detailed quantitative AIME accuracy trajectory; and a response to OpenAI's distillation accusations stating that R1's training data was scraped from the open web (which inevitably included LLM-generated text) but that it had not specifically distilled from OpenAI APIs for the reasoning capability itself. Nature published the reviewer comments and DeepSeek's responses alongside the article, an unusual choice for an AI paper that was widely welcomed in the research community.[5][11]
As a model from a Chinese laboratory, DeepSeek-R1 attracted regulatory scrutiny across multiple Western countries. The hosted DeepSeek API and the consumer chat app stored data on Chinese servers subject to Chinese data laws and reportedly applied server-side filters around politically sensitive topics including Tiananmen Square, the status of Taiwan, and the treatment of Uyghurs in Xinjiang. Behavior on self-hosted instances of the open weights was more nuanced; many refusals were implemented at the server-filter level rather than baked into the weights themselves, though a May 2025 academic paper titled "R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model" found that certain refusal patterns (particularly around Tiananmen) remained in the weights.[28]
On February 6, 2025, U.S. Representatives Josh Gottheimer and Darin LaHood introduced the bipartisan "No DeepSeek on Government Devices Act," which passed in August 2025 banning federal employees from using the app on government-issued devices. Texas, Virginia, and New York banned DeepSeek on government systems in February 2025; the U.S. Navy, NASA, and the Pentagon issued internal restrictions; Italy's Garante ordered the app removed from Apple and Google stores on January 30, 2025; Taiwan, South Korea, Australia, and India followed with varying restrictions. The Fiscal Year 2026 National Defense Authorization Act, signed in December 2025, included provisions restricting DeepSeek usage within the Department of Defense and Intelligence Community.[17][28][29]
OpenAI accused DeepSeek of improperly distilling from OpenAI models within days of R1's release, claiming "some evidence" that DeepSeek had used outputs from OpenAI APIs to train R1 in violation of OpenAI's terms of service. A February 2026 Bloomberg report quoted an OpenAI memo to U.S. lawmakers alleging DeepSeek had developed methods to circumvent access restrictions through obfuscated third-party routers. Anthropic escalated the issue in February 2026 with a public blog post alleging that DeepSeek, Moonshot AI, and MiniMax had together used roughly 24,000 fake accounts to generate more than 16 million exchanges with Claude. DeepSeek did not publicly admit to using distillation in training R1's reasoning capability; the Nature paper acknowledged that web-scraped training data would inevitably contain text generated by other LLMs but denied targeted distillation of OpenAI's reasoning traces.[11][30][31]
As of May 2026, DeepSeek-R1 and its derivatives remain among the most widely studied open-source reasoning models even though DeepSeek's own product line has moved on to the V3.x and V4 hybrid families. R1-0528 continues to be available through the DeepSeek API at the original prices and through every major third-party inference provider. The 32B and 70B distilled models remain popular as locally hostable reasoning baselines; the smaller distills (1.5B, 7B, 8B) are widely used as base models for further fine-tuning rather than as deployment endpoints. Legacy aliases deepseek-reasoner and deepseek-chat are scheduled for deprecation on July 24, 2026.[14][27]
The model's legacy is best measured by its influence on the field. R1 proved that reasoning-capable language models could be built openly and cheaply, that reinforcement learning could induce genuine reasoning behaviors without supervised examples, and that a small team with limited resources could compete with the largest AI labs in the world. The recipe it published, GRPO with rule-based rewards on verifiable tasks, became the dominant approach for training reasoning models across both open-source and commercial labs. Most reasoning models released through 2025 and 2026 (Qwen QwQ, Microsoft Phi-Reasoning, Mistral Magistral, OpenAI gpt-oss, Nvidia OpenReasoning-Nemotron) used some variant of the R1 recipe.[1][10][22]
R1 also reset expectations for what a model release should look like. The combination of a permissive MIT license, a detailed published recipe, peer-reviewed publication, six pre-distilled variants, and aggressive API pricing became a de facto template against which other open-source releases were measured. When subsequent releases were perceived as stinting on documentation or imposing restrictive licenses, the comparison was usually to R1.
The financial and policy aftershocks lasted longer than the model itself. The "DeepSeek shock" of January 27, 2025 is now treated as the canonical market event of the AI boom, alongside ChatGPT's November 2022 launch. It catalyzed the United States' America's AI Action Plan, accelerated U.S. chip export controls, prompted the OpenAI and Anthropic public claims of cross-lab distillation, and put open-weight reasoning permanently inside the policy conversation. Even after the cost numbers were reframed, the directional finding, that frontier reasoning capability had become cheap enough for a focused team to reach, has held up.[16][32]