DeepSeek-R1

Chinese AI Large Language Models Reasoning Models

32 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v12 · 6,317 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepSeek-R1
Developer	DeepSeek
Release date	January 20, 2025
Type	Large language model (reasoning model)
Architecture	Mixture of Experts (MoE), Transformer with MLA
Base model	DeepSeek-V3-Base (671B / 37B MoE)
Parameters	671 billion total; 37 billion active per token
Context length	128,000 tokens
Training algorithm	Group Relative Policy Optimization (GRPO)
Reward signal	Rule-based (verifiable math, code) plus format reward
Reported RL compute	512 Nvidia H800 GPUs for ~80 hours
Reported RL rental cost	~$294,000 (R1 RL stage only, disclosed in Nature Sept 2025)
License	MIT (weights, distills, and derived outputs)
Companion model	DeepSeek-R1-Zero (RL-only from V3-Base, no SFT)
Distilled variants	6 dense models: Qwen 1.5B / 7B / 14B / 32B, Llama 8B / 70B
Updated version	DeepSeek-R1-0528 (May 28, 2025)
Paper	"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning", arXiv:2501.12948; published Nature 645, 633-638 (Sept 18, 2025)

DeepSeek-R1 is an open-weight reasoning-focused large language model released on January 20, 2025 by DeepSeek, a Chinese laboratory spun out of the High-Flyer quantitative hedge fund, and was the first major open-weight model to match the reasoning performance of OpenAI's proprietary o1, scoring 79.8% on AIME 2024 versus o1's 79.2% and 97.3% on MATH-500 versus 96.4%, while being released under the permissive MIT license.^[1]^[2] DeepSeek, a Chinese artificial intelligence laboratory, built the model on the DeepSeek-V3-Base Mixture of Experts architecture, with 671 billion total parameters of which 37 billion are activated per forward pass, inheriting V3's 128,000-token context window. The peer-reviewed Nature version of the paper later disclosed that the reinforcement-learning stage that produced R1 cost approximately $294,000, using 512 Nvidia H800 GPUs for about 80 hours.^[1]^[2]^[5] The paper states plainly: "Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories."^[1]

DeepSeek shipped three related artifacts on the same day. DeepSeek-R1-Zero was trained by applying reinforcement learning directly to the V3 base model with no supervised fine-tuning, and demonstrated that chain-of-thought reasoning, self-reflection, and error correction could emerge spontaneously from rule-based rewards alone. As the paper put it, "it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT."^[1] DeepSeek-R1 itself used a multi-stage pipeline of cold-start supervised fine-tuning followed by reinforcement learning from verifiable rewards, producing a production-ready reasoning model. DeepSeek-R1-Distill consisted of six smaller dense models, distilled from R1's reasoning traces onto Qwen 2.5 and Llama 3 base checkpoints rather than DeepSeek's own backbone.^[1]^[2]

R1's release triggered a market event widely called the "DeepSeek shock." On January 27, 2025, Nvidia's stock fell roughly 17% (closing down 16.86%), losing approximately $589 billion in market capitalization in a single session, the largest single-day market value loss for any company in U.S. stock market history.^[3]^[16] The DeepSeek mobile app also briefly displaced ChatGPT atop the U.S. Apple App Store. The shock stemmed from the revelation that a small Chinese laboratory had produced a frontier reasoning model using a reported RL-stage rental cost of about $294,000 on export-restricted H800 GPUs, undermining the assumption that frontier AI required tens of billions in capital expenditure. The accompanying technical paper became the first major open-weight LLM paper to pass independent peer review, appearing on the cover of Nature on September 18, 2025.^[3]^[4]^[5]

What is DeepSeek-R1?

DeepSeek-R1 is a reasoning model: a large language model trained to generate a long internal chain-of-thought before producing a final answer, trading extra inference-time computation for higher accuracy on math, code, and science problems. Unlike OpenAI's contemporaneous o1, which shipped as a closed API with an undisclosed training recipe, R1 was released with open weights, a permissive MIT license, and a fully published methodology, making it the first reasoning model that the wider research community could download, inspect, fine-tune, and replicate. Its headline claim was that reasoning behavior could be induced by reinforcement learning against automatically verifiable rewards (correct math answers, passing code tests) rather than by expensive human demonstration, and that this could be done at a small fraction of the apparent cost of Western frontier labs.^[1]^[2]

Background

From hedge fund to frontier lab

DeepSeek's path to R1 began inside a hedge fund. High-Flyer, a Chinese quantitative trading firm co-founded in 2016 by Liang Wenfeng and his Zhejiang University classmates, accumulated tens of thousands of Nvidia GPUs through the late 2010s for stock-prediction workloads. By 2020 it operated one of the largest private AI training clusters in China. Liang spun the AI research arm into an independent company, DeepSeek, in July 2023, seeded with engineers experienced in squeezing performance out of large GPU pools. DeepSeek was bootstrapped from hedge fund profits and took no outside investment before R1's release.^[6]^[7]

DeepSeek built toward R1 throughout 2024. The company released DeepSeek-V2 in May 2024 and DeepSeek-V3 on December 26, 2024, both MoE models prioritizing computational efficiency. V3 served as the base model for R1, providing a strong general foundation. V3 itself was trained on roughly 14.8 trillion tokens at a reported GPU-rental cost of about $5.576 million, a figure that would later become entangled in cost debates around R1.^[2]^[7]

The reasoning paradigm

The immediate scientific context was the emergence of inference-time reasoning. OpenAI's o1, previewed in September 2024 and released in December 2024, demonstrated that training models with reinforcement learning to "think before answering" could dramatically improve performance on difficult math, science, and coding tasks. OpenAI published no technical details on the recipe. DeepSeek's contribution was to show that the approach could be replicated with open weights at a small fraction of the apparent cost, and to publish the full training methodology.^[1]

The three models

DeepSeek-R1 is best understood as three distinct but related releases that together formed the announcement of January 20, 2025.^[1]^[2]

DeepSeek-R1-Zero (RL-only from V3-Base)

DeepSeek-R1-Zero was trained by applying large-scale reinforcement learning directly to the DeepSeek-V3-Base model, with no supervised fine-tuning and no curated reasoning examples. The model was simply given problems and rewarded for producing correct answers under a minimal template that required reasoning inside <think>...</think> tags and the final answer inside <answer>...</answer> tags.^[1]

Despite never seeing a reasoning demonstration, R1-Zero spontaneously developed several reasoning behaviors during training: multi-step chain-of-thought decomposition, self-reflection ("wait, let me reconsider"), error detection and correction, alternative-strategy exploration, and adaptive allocation of thinking time on harder problems. The paper reported a striking trajectory on AIME 2024, where R1-Zero's pass@1 accuracy rose from 15.6% at the start of RL training to 71.0% by the end, reaching 86.7% under majority voting over 64 samples, matching OpenAI o1-0912's performance using only RL on a base model.^[1]^[5]

DeepSeek highlighted what it called an "aha moment" during training, when the model began interrupting itself with phrases like "Wait, wait. Wait. That's an aha moment I can flag here" before backtracking. The paper interpreted this as evidence of a self-evolution process induced by optimization pressure. Subsequent work, including a Sea AI Lab study titled "There May Not be Aha Moment in R1-Zero-like Training," argued that some of these behaviors may have been inherited from reflective patterns already present in the base model's pre-training data, and the Nature version of the R1 paper engaged with these critiques directly.^[5]^[8]

R1-Zero's outputs suffered from poor readability, language mixing between Chinese and English, and inconsistent answer formatting. These limitations motivated the multi-stage training pipeline used for R1 proper. R1-Zero was released alongside R1 under the same MIT license so that researchers could study the unfiltered behavior of an RL-only reasoning model.^[1]

DeepSeek-R1 (SFT + RL with cold start)

DeepSeek-R1 itself was produced by a four-stage pipeline built on DeepSeek-V3-Base.^[1]^[5]

Stage 1: Cold start. A small set of curated long chain-of-thought examples (a few thousand samples) was used to supervised-fine-tune V3-Base, addressing the readability and language-mixing failures observed in R1-Zero.

Stage 2: Reasoning-oriented RL. Large-scale GRPO training was applied on verifiable tasks (mathematics, coding, logic) using rule-based rewards. A language-consistency reward was added to suppress mid-response language switching.

Stage 3: Rejection sampling and SFT. The RL-trained model generated a large pool of reasoning traces. High-quality traces were selected by rejection sampling and combined with non-reasoning data to produce roughly 800,000 samples (about 600,000 reasoning and 200,000 general). V3-Base was then fine-tuned on this dataset for two epochs.

Stage 4: All-scenario RL. A final RL stage covered both reasoning and general tasks, combining rule-based rewards with model-based rewards for helpfulness and harmlessness.

The same 800,000-sample dataset created in Stage 3 was reused to fine-tune all six distilled variants.^[1]

DeepSeek-R1-Distill (smaller dense models)

DeepSeek-R1-Distill is a family of six dense (non-MoE) models created by supervised fine-tuning open-source base checkpoints on R1's 800K reasoning trace dataset. No additional reinforcement learning was applied in the initial release. The distillation targets used base models from outside DeepSeek's own line: four from Alibaba's Qwen 2.5 family and two from Meta's Llama 3 family.^[1]^[2]

Distilled model	Base model	Parameters	License (weights)
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	1.5B	Apache 2.0 / MIT
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	7B	Apache 2.0 / MIT
DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B	8B	Llama 3 / MIT
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	14B	Apache 2.0 / MIT
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	32B	Apache 2.0 / MIT
DeepSeek-R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	70B	Llama 3 / MIT

The smallest distill (1.5B) outperformed GPT-4o on mathematical benchmarks while small enough to run on consumer hardware. The 32B Qwen distill became the most widely deployed open-source reasoning model of 2025, fitting on a single 24 GB consumer GPU and reaching 72.6% on AIME 2024, beating OpenAI o1-mini (63.6%). DeepSeek included an ablation in the paper showing that distilling from R1 outperformed running GRPO directly on the same small base, a finding that shaped subsequent open-source training recipes.^[1]^[2]

These three models are commonly confused in coverage. R1-Zero is the scientific demonstration (no SFT, pure RL). R1 is the production reasoning model (SFT plus RL with cold start). R1-Distill is the small-model family, built on Qwen and Llama bases rather than on DeepSeek's own backbone.^[1]

Architecture

DeepSeek-R1 inherits its architecture from DeepSeek-V3-Base unchanged.^[2]^[7]

671 billion total parameters across all experts
37 billion active parameters per forward pass (only a subset of experts activates per token)
256 routed experts per layer plus one shared expert that is always active
Multi-head Latent Attention (MLA), which compresses the key-value cache to roughly 5-13% of standard attention, enabling efficient long-context inference
128,000-token context window
Auxiliary-loss-free load balancing across experts
FP8 mixed-precision training kernels

The MoE design is central to R1's serving economics. Activating only 37 billion of its 671 billion parameters keeps per-token inference costs comparable to a much smaller dense model while retaining the knowledge capacity of the full parameter count. MLA further reduces the memory pressure that would otherwise make long reasoning chains expensive to serve, which matters for a model that routinely emits thousands of intermediate tokens before answering. None of these architectural choices were novel to R1; the contribution was post-training, not architecture.^[2]^[7]

The distilled models are dense (non-MoE) transformers using the architectures of their respective Qwen 2.5 and Llama 3 base checkpoints, which is why their licenses inherit from the bases rather than being purely MIT.

Training methodology

Group Relative Policy Optimization

GRPO is the reinforcement learning algorithm used to train both R1-Zero and R1. GRPO was originally introduced by DeepSeek in the February 2024 DeepSeekMath paper (arXiv:2402.03300) and refined for use in R1. It differs from Proximal Policy Optimization (PPO), the algorithm used in classical RLHF pipelines, in one critical way: GRPO eliminates the separate value (critic) model.^[9]^[10]

In PPO-based RLHF, two models must be maintained during training: the policy being optimized and a value model that estimates expected returns. The value model can be as large as the policy itself, effectively doubling memory requirements. GRPO replaces the value model with a baseline computed from group statistics.^[9]^[10]

The algorithm works as follows:^[1]^[9]^[10]

For each prompt, the current policy samples a group of G responses (typically G = 16 to 64).
Each response is scored by a reward function. For math: whether the final boxed answer matches ground truth. For code: whether the program passes a hidden test suite.
The mean and standard deviation of rewards within the group are computed.
Each response's advantage is its normalized reward: advantage_i = (reward_i - mean) / std.
The policy is updated to increase the probability of high-advantage responses and decrease the probability of low-advantage ones, subject to a KL divergence constraint against a frozen reference model.

The group-relative approach normalizes rewards within each problem, reduces the impact of reward scale differences across problem types, cuts training memory roughly in half compared to PPO, and is simpler to implement and tune. Since R1's release, GRPO has become the de facto standard for training open-source reasoning models, displacing PPO with DPO as the preferred recipe. Hugging Face's TRL library, Allen AI's TRLX, and several other RL libraries shipped native GRPO support within weeks of R1's release.^[9]^[10]

Rule-based rewards

DeepSeek used a deliberately simple reward design: an accuracy reward (right or wrong on a verifiable answer) plus a format reward (the model is required to enclose its reasoning in <think>...</think> tags). No human-preference reward model was used during the reasoning-oriented stages, sidestepping both the cost of preference annotation and the failure mode of reward hacking against a learned reward signal. This pattern, RL on verifiable rewards rather than learned reward models, is now commonly called reinforcement learning from verifiable rewards (RLVR) and is a direct legacy of R1.^[1]^[5]

Cold-start data and the RL stages

For R1 (as distinct from R1-Zero), the first SFT stage was deliberately small: a few thousand long-CoT samples written or curated to demonstrate clean reasoning structure and consistent language usage. This cold start gave the RL stage a more readable starting point than V3-Base would have provided. Subsequent stages stacked: reasoning-oriented RL, then rejection-sampling SFT on the resulting traces, then a final all-scenario RL pass that broadened behavior to non-reasoning tasks.^[1]^[5]

Distillation to smaller bases

The same 800K-sample dataset assembled in R1's Stage 3 was used directly to fine-tune the six distilled variants. The choice to use Qwen and Llama bases rather than DeepSeek's own architectures meant the distills could ride on widely-deployed open-weight ecosystems with mature tooling (vLLM, SGLang, llama.cpp, MLX). DeepSeek noted in the paper that running RL directly on these small bases produced worse results than distillation, a finding subsequent open-source projects have largely replicated.^[1]

How much did DeepSeek-R1 cost to train?

In the Nature publication of September 2025, DeepSeek disclosed that the reinforcement learning portion of R1's training used 512 Nvidia H800 GPUs for approximately 80 hours, at an estimated rental cost of about $294,000 assuming $2 per GPU-hour. The supplementary materials acknowledged for the first time that DeepSeek also owned A100 GPUs and used them for preparatory experiments at smaller scale.^[5]^[11]

The $294,000 figure refers only to the RL stage that converted V3-Base into R1. It excludes the cost of training V3 itself (about $5.576 million in rented compute), the cost of generating cold-start data, the cost of distillation, salaries, and depreciation of the underlying GPU cluster. The Register and CNN Business both noted that the end-to-end cost of producing R1 was roughly an order of magnitude larger than the headline figure, though still dramatically below comparable Western reasoning-model budgets. SemiAnalysis's January 2025 reconstruction estimated DeepSeek's underlying cluster (around 50,000 Hopper-class GPUs accumulated by High-Flyer) at roughly $1.6 billion in retail value, with annual operational expenditure closer to $1.3 billion. The narrower per-run cost figures held up; the framing of a "$6 million startup" did not.^[11]^[12]^[13]

How does DeepSeek-R1 compare to OpenAI o1?

DeepSeek-R1 reported performance competitive with OpenAI o1 across the major reasoning benchmarks of January 2025.^[1]^[2]

Benchmark	DeepSeek-R1	OpenAI o1 (Dec 2024)	GPT-4o
AIME 2024 (pass@1)	79.8%	79.2%	13.4%
MATH-500	97.3%	96.4%	60.3%
GPQA Diamond	71.5%	75.7%	53.6%
Codeforces (rating / percentile)	2,029 / 96.3	2,061 / 96.6	n/a
MMLU	90.8%	91.8%	87.2%
MMLU-Pro	84.0%	81.9%	73.3%
LiveCodeBench (CoT)	65.9%	63.4%	33.4%
SWE-bench Verified	49.2%	48.9%	33.2%
AlpacaEval 2.0 (LC)	87.6%	n/a	51.1%
ArenaHard	92.3%	n/a	80.4%

R1 matched or exceeded o1 on most math and coding benchmarks while trailing slightly on graduate-level science (GPQA Diamond) and short-form code generation. On AIME 2024, R1's 79.8% pass@1 edged out OpenAI o1-1217's 79.2%, and on MATH-500 R1's 97.3% beat o1's 96.4%.^[1] The combination of those benchmark numbers with an open-weight, MIT-licensed release was the central technical claim that drove both scientific interest and the market reaction.^[1]^[2]

The distilled models posted their own state-of-the-art scores for dense open-source models. The 32B Qwen distill reached 72.6% on AIME 2024 and 94.3% on MATH-500, beating OpenAI's o1-mini (63.6% AIME) by nearly nine points. The 70B Llama distill reached 70.0% AIME and 94.5% MATH-500. The 1.5B Qwen-Math distill, despite being small enough to run on a laptop, outperformed GPT-4o and Claude 3.5 Sonnet on math benchmarks.^[1]^[2]

Is DeepSeek-R1 open source?

DeepSeek released R1, R1-Zero, and all six distilled models under the MIT License with one important addition: the license explicitly permits using API outputs to train other models, that is, distillation is expressly allowed. Most proprietary AI providers either prohibit using their outputs to train competing models or leave the question ambiguous; DeepSeek's terms removed legal friction from the wave of follow-on work.^[1]^[2]

The distilled models inherit the upstream base licenses. Qwen-based distills are governed by Apache 2.0 on the base weights with the fine-tuning delta released under MIT; Llama-based distills are governed by the Llama community license on the base weights with the delta under MIT. In practice, this distinction rarely matters for research use but matters for production deployments that may need to comply with the Llama community license's monthly-active-user thresholds and use-policy restrictions.^[2]

Availability

DeepSeek made all weights publicly available on Hugging Face (deepseek-ai/DeepSeek-R1, deepseek-ai/DeepSeek-R1-Zero, and the six DeepSeek-R1-Distill checkpoints), with model cards, configuration files, and reference inference code. The DeepSeek API offered R1 as deepseek-reasoner at pricing dramatically lower than competing reasoning models.^[1]^[2]^[14]

Model	Input (cache miss)	Input (cache hit)	Output
DeepSeek-R1	$0.55 / 1M	$0.14 / 1M	$2.19 / 1M
OpenAI o1	$15.00 / 1M	$7.50 / 1M	$60.00 / 1M
Anthropic Claude 3.7 Sonnet (thinking)	$3.00 / 1M	$0.30 / 1M	$15.00 / 1M

R1 was roughly 27 times cheaper than o1 on a per-token basis. Within days of release, third-party inference providers including Together AI, Fireworks AI, Groq, OpenRouter, Hyperbolic, Lambda, and SambaNova all offered hosted endpoints for R1 or the distilled models, frequently at competitive prices and sometimes with significantly faster throughput than DeepSeek's own infrastructure. Major cloud providers, including Microsoft Azure, Amazon Web Services, and Nvidia's NIM inference platform, added R1 within weeks.^[14]^[15]

Why did DeepSeek-R1 crash Nvidia's stock?

The market reaction to R1's release became a defining financial event of early 2025. On January 27, 2025, the Monday after R1's Friday release went viral, U.S. technology stocks experienced their steepest single-day decline in history. The sell-off was triggered by a sudden reassessment of the AI investment thesis. For years, markets had priced semiconductor and cloud companies on the assumption that frontier AI required massive and growing capital expenditure. DeepSeek's demonstration that a roughly 160-person Chinese laboratory could produce a competitive reasoning model undermined that assumption.^[3]^[4]^[16]

Nvidia's stock fell about 17% in a single session, closing down 16.86% at $118.58 and losing approximately $589 billion in market capitalization. This was the largest single-day market value loss for any company in U.S. stock market history, eclipsing the prior record of roughly $279 billion; rounded press coverage frequently quoted it as "nearly $600 billion."^[3]^[16] Other semiconductor companies including Broadcom, Marvell, Micron, and TSMC also fell sharply. The Nasdaq Composite lost roughly $1 trillion in value over the session. Apple briefly retook the title of world's most valuable company as Nvidia's market cap dropped to roughly $2.8 trillion.^[3]^[4]^[16]

The DeepSeek mobile app reached number one on the Apple App Store in the United States on January 27, displacing ChatGPT. The consumer ranking became part of the news cycle around the stock drop, and journalists pointed to it as a tangible signal that something had changed. By the end of January 2025, DeepSeek-R1's open weights had been downloaded more than 5 million times across Hugging Face mirrors.^[3]^[4]

Marc Andreessen described the event as "AI's Sputnik moment," a comparison to the 1957 Soviet satellite launch that became the canonical framing in subsequent coverage. President Donald Trump, speaking at a Republican retreat the same week, called R1 a "wake-up call for our industries that we need to be laser-focused on competing to win." Chinese AI providers entered an aggressive price war in the weeks that followed, with some cutting API prices by up to 97%.^[4]^[17]

US chip export sanctions and the H800

A central element of R1's narrative was that it had been trained under U.S. export controls. The Biden administration's October 2022 chip export rules, tightened in October 2023, prohibited the sale of Nvidia's flagship A100 and H100 GPUs to China. Nvidia responded by creating export-compliant variants, the A800 and H800, that matched the flagship chips in raw compute but were bandwidth-limited to fall under the export thresholds. The H800 used by DeepSeek for R1's RL stage was such an export-compliant variant.^[11]^[16]^[18]

The implication that a small lab could close most of the reasoning gap to OpenAI on bandwidth-restricted hardware became the dominant policy narrative around R1. It both reinvigorated calls for tighter export controls (since the H800 had clearly not been restrictive enough) and provided ammunition to skeptics who argued that the controls had not slowed Chinese AI progress at all. The Trump administration's America's AI Action Plan, released July 23, 2025, repeatedly cited R1 as the policy event justifying expanded chip export restrictions and accelerated federal permitting for AI data centers.^[16]^[18]

Reception and impact

R1's reception spanned scientific, commercial, and political dimensions in ways unusual for a single model release.

Within days, multiple Western labs publicly accelerated their reasoning-model roadmaps. OpenAI shipped o3-mini on January 31, 2025; Anthropic added an extended thinking mode to Claude 3.7 Sonnet in February 2025; Google released Gemini 2.5 Pro with thinking-by-default in March 2025; Alibaba released Qwen3 with native hybrid reasoning in April 2025. The competitive dynamic R1 created pushed the entire field forward at a faster pace than was widely anticipated.^[4]^[17]^[19]

Sam Altman publicly acknowledged the result, posting in late January 2025 that R1 was "an impressive model, particularly around what they're able to deliver for the price" and conceding that "we will obviously deliver much better models." Yann LeCun cited R1 repeatedly as evidence that "open-source models are surpassing proprietary ones." Stanford HAI faculty described DeepSeek's open releases as "a significant step in democratizing AI," enabling smaller laboratories and individual developers to build on frontier-capable models without massive compute budgets.^[17]^[20]

Within China, R1 was integrated within weeks into Tencent's Yuanbao consumer app, Alibaba Cloud's Bailian platform, and Baidu's deployment stack. Chinese smartphone vendors including Xiaomi, Honor, and Oppo added R1-Distill checkpoints (typically the 7B or 14B variants) to on-device AI assistants through 2025. By the second half of 2025, "DeepSeek-compatible" had become a recognizable procurement category in Chinese government IT bids.^[19]^[21]

Open-weight ecosystem impact

The MIT-licensed weights and the published recipe combined to produce one of the largest single-event impacts on the open-source AI ecosystem since the original Llama leak in 2023.

Within a month of launch, over 700 community-built models derived from R1 appeared on Hugging Face, collectively downloaded more than 5 million times. DeepSeek-R1 became the most-liked model on Hugging Face among more than 1.5 million models on the platform, surpassing 10,000 likes within weeks. The variant tree of R1-Distill checkpoints, fine-tuned for medical reasoning, legal analysis, embodied agents, scientific discovery, and dozens of other vertical applications, exceeded 10 million cumulative downloads by mid-2025.^[15]^[17]

DeepSeek-R1-Distill-Qwen-32B became the default open-source reasoning baseline of 2025. It fit on a single 24 GB consumer GPU at 4-bit quantization, ran at usable speeds on a Mac Studio M2 Ultra via MLX or llama.cpp, and offered o1-mini-comparable accuracy on math and code with no API costs and no data egress. The model became one of the most-fine-tuned bases on Hugging Face throughout 2025 and was the teacher of choice for dozens of small reasoning models trained by university labs and independent researchers.^[1]^[15]

Several formal replication projects attempted to reproduce R1's training trajectory using only public data and open base models: Hugging Face's Open-R1, Berkeley NovaSky's Sky-T1, the Together AI / Stanford TinyZero work, the SimpleRL-Reason project, and Allen AI's Tülu 3 follow-up. None matched the original R1 on absolute benchmarks, but several reproduced the qualitative emergence of reflective reasoning behaviors. Microsoft's Phi-4-Reasoning, NVIDIA's OpenReasoning-Nemotron, and dozens of community models used variants of R1's rejection-sampling-then-SFT recipe to bootstrap reasoning capabilities into smaller bases.^[10]^[22]

Successors

DeepSeek-R1-0528 (May 28, 2025)

DeepSeek-R1-0528 was released on May 28, 2025 as a major update labeled by DeepSeek as a "minor upgrade" despite delivering substantial improvements across all major benchmarks. The model was a refresh rather than a new architecture, applying additional post-training to the same V3-based backbone.^[23]^[24]

Benchmark	R1 (Jan 2025)	R1-0528 (May 2025)
AIME 2024	79.8%	91.4%
AIME 2025	70.0%	87.5%
HMMT 2025	41.7%	79.4%
LiveCodeBench (2408-2505)	63.5%	73.3%
Codeforces-Div1 rating	~1,530	~1,930
SWE-bench Verified	49.2%	57.6%
Aider-Polyglot	53.3%	71.6%
GPQA Diamond	71.5%	81.0%
Humanity's Last Exam	8.5%	17.7%

R1-0528 also added function calling, JSON output, and system-prompt support that the original R1 had lacked. The model averaged roughly 23,000 thinking tokens per query on AIME 2025, up from about 12,000 for the original R1, with the deeper reasoning correlating with the accuracy gains. DeepSeek reported a 45-50% reduction in hallucination rates on rewriting and summarization tasks. A companion distill, DeepSeek-R1-0528-Qwen3-8B, achieved 86.0% on AIME 2024, surpassing the base Qwen3-8B by 10 percentage points and matching the much larger Qwen3-235B-Thinking on the same benchmark.^[23]^[24]

What happened to "DeepSeek-R2"

DeepSeek-R2 was widely rumored throughout 2025 but never released as a model under that brand name. Reuters reported in March 2025 that DeepSeek was racing to ship a successor to R1, and Chinese-language tech outlets carried multiple "R2 imminent" rumors through the spring and summer, citing anonymous sources and partial leaks. None matured into an actual release. Instead, the May 2025 refresh was branded R1-0528, and the August 2025 successor was branded V3.1 rather than R2, folding R1's reasoning capability into a hybrid model that could toggle thinking mode on or off within a single set of weights.^[25]^[26]

As of May 2026, DeepSeek's reasoning capability lives inside the V3.x and V4 hybrid line. The DeepSeek-V4 Preview released April 24, 2026 ships V4-Pro (1.6T total / 49B active) and V4-Flash (284B / 13B) with native thinking-mode toggles. Whether the next reasoning-focused release will be branded R2, V5, or absorbed entirely into the hybrid family has not been announced.^[25]^[27]

Later DeepSeek releases

Release	Date	Notes
DeepSeek-R1	Jan 20, 2025	Initial release with R1-Zero and six distilled variants
DeepSeek-R1-0528	May 28, 2025	Major update; deeper thinking; function calling; JSON; Qwen3-8B distill
DeepSeek-V3.1	Aug 19, 2025	First hybrid model: chat and reasoning in one set of weights with thinking-mode toggle
DeepSeek-V3.2-Exp	Sep 29, 2025	Experimental release introducing DeepSeek Sparse Attention
DeepSeek-OCR	Oct 20, 2025	Vision-language OCR model
DeepSeek-V3.2	Dec 1, 2025	Production hybrid; thinking integrated into tool use
DeepSeek-V4 Preview	Apr 24, 2026	V4-Pro and V4-Flash; 1M context; native hybrid reasoning

V3.1 effectively absorbed R1's role: a single set of weights served as both a fast chat model and (with a thinking-mode toggle) as a reasoning model, reaching roughly 90-95% of R1-0528's performance on reasoning benchmarks while sharing weights with a normal chat model. By V4's April 2026 launch, R1 was no longer DeepSeek's recommended model for new applications, though it remained widely cited and deployed because of its open-source release and well-understood behavior.^[26]^[27]

Nature publication and peer review

On September 18, 2025, the DeepSeek-R1 paper appeared on the cover of Nature (volume 645, issue 8081, pages 633-638), becoming the first major open-weight large language model to be the subject of a peer-reviewed Nature paper. The corresponding author was Liang Wenfeng, with 199 co-authors from DeepSeek-AI.^[5]^[11]

The peer-reviewed version added several disclosures absent from the January arXiv preprint: the $294,000 RL-stage training cost on 512 H800 GPUs over roughly 80 hours; an acknowledgment that DeepSeek owned A100 GPUs used for preliminary experiments; expanded ablation studies including a direct response to the "There May Not be Aha Moment" critique; a more detailed quantitative AIME accuracy trajectory; and a response to OpenAI's distillation accusations stating that R1's training data was scraped from the open web (which inevitably included LLM-generated text) but that it had not specifically distilled from OpenAI APIs for the reasoning capability itself. Nature published the reviewer comments and DeepSeek's responses alongside the article, an unusual choice for an AI paper that was widely welcomed in the research community.^[5]^[11]

Security, regulatory, and political concerns

As a model from a Chinese laboratory, DeepSeek-R1 attracted regulatory scrutiny across multiple Western countries. The hosted DeepSeek API and the consumer chat app stored data on Chinese servers subject to Chinese data laws and reportedly applied server-side filters around politically sensitive topics including Tiananmen Square, the status of Taiwan, and the treatment of Uyghurs in Xinjiang. Behavior on self-hosted instances of the open weights was more nuanced; many refusals were implemented at the server-filter level rather than baked into the weights themselves, though a May 2025 academic paper titled "R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model" found that certain refusal patterns (particularly around Tiananmen) remained in the weights.^[28]

On February 6, 2025, U.S. Representatives Josh Gottheimer and Darin LaHood introduced the bipartisan "No DeepSeek on Government Devices Act," which passed in August 2025 banning federal employees from using the app on government-issued devices. Texas, Virginia, and New York banned DeepSeek on government systems in February 2025; the U.S. Navy, NASA, and the Pentagon issued internal restrictions; Italy's Garante ordered the app removed from Apple and Google stores on January 30, 2025; Taiwan, South Korea, Australia, and India followed with varying restrictions. The Fiscal Year 2026 National Defense Authorization Act, signed in December 2025, included provisions restricting DeepSeek usage within the Department of Defense and Intelligence Community.^[17]^[28]^[29]

OpenAI accused DeepSeek of improperly distilling from OpenAI models within days of R1's release, claiming "some evidence" that DeepSeek had used outputs from OpenAI APIs to train R1 in violation of OpenAI's terms of service. A February 2026 Bloomberg report quoted an OpenAI memo to U.S. lawmakers alleging DeepSeek had developed methods to circumvent access restrictions through obfuscated third-party routers. Anthropic escalated the issue in February 2026 with a public blog post alleging that DeepSeek, Moonshot AI, and MiniMax had together used roughly 24,000 fake accounts to generate more than 16 million exchanges with Claude. DeepSeek did not publicly admit to using distillation in training R1's reasoning capability; the Nature paper acknowledged that web-scraped training data would inevitably contain text generated by other LLMs but denied targeted distillation of OpenAI's reasoning traces.^[11]^[30]^[31]

Legacy and current status

As of May 2026, DeepSeek-R1 and its derivatives remain among the most widely studied open-source reasoning models even though DeepSeek's own product line has moved on to the V3.x and V4 hybrid families. R1-0528 continues to be available through the DeepSeek API at the original prices and through every major third-party inference provider. The 32B and 70B distilled models remain popular as locally hostable reasoning baselines; the smaller distills (1.5B, 7B, 8B) are widely used as base models for further fine-tuning rather than as deployment endpoints. Legacy aliases deepseek-reasoner and deepseek-chat are scheduled for deprecation on July 24, 2026.^[14]^[27]

The model's legacy is best measured by its influence on the field. R1 proved that reasoning-capable language models could be built openly and cheaply, that reinforcement learning could induce genuine reasoning behaviors without supervised examples, and that a small team with limited resources could compete with the largest AI labs in the world. The recipe it published, GRPO with rule-based rewards on verifiable tasks, became the dominant approach for training reasoning models across both open-source and commercial labs. Most reasoning models released through 2025 and 2026 (Qwen QwQ, Microsoft Phi-Reasoning, Mistral Magistral, OpenAI gpt-oss, Nvidia OpenReasoning-Nemotron) used some variant of the R1 recipe.^[1]^[10]^[22]

R1 also reset expectations for what a model release should look like. The combination of a permissive MIT license, a detailed published recipe, peer-reviewed publication, six pre-distilled variants, and aggressive API pricing became a de facto template against which other open-source releases were measured. When subsequent releases were perceived as stinting on documentation or imposing restrictive licenses, the comparison was usually to R1.

The financial and policy aftershocks lasted longer than the model itself. The "DeepSeek shock" of January 27, 2025 is now treated as the canonical market event of the AI boom, alongside ChatGPT's November 2022 launch. It catalyzed the United States' America's AI Action Plan, accelerated U.S. chip export controls, prompted the OpenAI and Anthropic public claims of cross-lab distillation, and put open-weight reasoning permanently inside the policy conversation. Even after the cost numbers were reframed, the directional finding, that frontier reasoning capability had become cheap enough for a focused team to reach, has held up.^[16]^[32]

References

"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." DeepSeek-AI, arXiv:2501.12948, January 22, 2025. https://arxiv.org/abs/2501.12948 ↩
"DeepSeek-R1 model card." DeepSeek-AI, Hugging Face. https://huggingface.co/deepseek-ai/DeepSeek-R1 ↩
"Nvidia sheds almost $600 billion in market cap, biggest drop ever." CNBC, January 27, 2025. https://www.cnbc.com/2025/01/27/nvidia-sheds-almost-600-billion-in-market-cap-biggest-drop-ever.html ↩
"A shocking Chinese AI advancement called DeepSeek is sending US stocks plunging." CNN Business, January 27, 2025. https://www.cnn.com/2025/01/27/tech/deepseek-stocks-ai-china ↩
"DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." DeepSeek-AI et al., *Nature* 645, 633-638, September 18, 2025. https://www.nature.com/articles/s41586-025-09422-z ↩
"Meet DeepSeek founder Liang Wenfeng, a hedge fund manager." Fortune, January 27, 2025. https://fortune.com/2025/01/27/deepseek-founder-liang-wenfeng-hedge-fund-manager-high-flyer-quant-trading/ ↩
"DeepSeek-V3 Technical Report." DeepSeek-AI, arXiv:2412.19437, December 27, 2024. https://arxiv.org/abs/2412.19437 ↩
"There May Not be Aha Moment in R1-Zero-like Training: A Pilot Study." Sea AI Lab, 2025. https://sail.sea.com/blog/articles/62 ↩
"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." DeepSeek-AI, arXiv:2402.03300, February 2024. https://arxiv.org/abs/2402.03300 ↩
"Group Relative Policy Optimization (GRPO)." Cameron R. Wolfe, Substack, 2025. https://cameronrwolfe.substack.com/p/grpo ↩
"DeepSeek didn't really train its flagship model for $294,000." The Register, September 19, 2025. https://www.theregister.com/2025/09/19/deepseek_cost_train/ ↩
"China's DeepSeek shook the tech world. Its developer just revealed the cost of training the AI model." CNN Business, September 19, 2025. https://www.cnn.com/2025/09/19/business/deepseek-ai-training-cost-china-intl ↩
"DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts." SemiAnalysis, January 31, 2025. https://semianalysis.com/2025/01/31/deepseek-debates/ ↩
"Models and pricing." DeepSeek API Docs. https://api-docs.deepseek.com/quick_start/pricing ↩
"State of Open Source on Hugging Face: Spring 2026." Hugging Face, 2026. https://huggingface.co/blog/huggingface/state-of-os-hf-spring-2026 ↩
"Nvidia loses $589 billion as DeepSeek batters stock." Bloomberg, January 27, 2025. https://www.bloomberg.com/news/newsletters/2025-01-27/nvidia-loses-589-billion-as-deepseek-batters-stock-evening-briefing-americas ↩
"How disruptive is DeepSeek? Stanford HAI faculty discuss." Stanford Report, February 2025. https://hai.stanford.edu/news/how-disruptive-is-deepseek-stanford-hai-faculty-discuss-chinas-new-model ↩
"Winning the Race: America's AI Action Plan." White House Office of Science and Technology Policy, July 23, 2025. https://www.whitehouse.gov/wp-content/uploads/2025/07/Americas-AI-Action-Plan.pdf ↩
"How Tencent, Alibaba and Baidu rushed to integrate DeepSeek." Caixin Global, February 2025. https://www.caixinglobal.com/2025-02-10/how-chinas-tech-giants-are-racing-to-integrate-deepseek-r1-102287345.html ↩
"OpenAI's Sam Altman on DeepSeek-R1." X (Twitter), January 28, 2025. https://x.com/sama/status/1884361876710736356 ↩
"Chinese smartphone vendors integrate DeepSeek-R1 distills on-device." Various sources, February-July 2025. ↩
"Open-R1: a fully open reproduction of DeepSeek-R1." Hugging Face Open-R1 team, 2025. https://huggingface.co/blog/open-r1 ↩
"DeepSeek-R1-0528 release." DeepSeek API Docs, May 28, 2025. https://api-docs.deepseek.com/news/news250528 ↩
"DeepSeek-R1-0528 model card." DeepSeek-AI, Hugging Face. https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 ↩
"China's DeepSeek racing to launch successor to viral R1 model, sources say." Reuters, March 25, 2025. https://www.reuters.com/technology/artificial-intelligence/chinas-deepseek-racing-launch-successor-viral-r1-model-sources-say-2025-03-25/ ↩
"A Technical Tour of the DeepSeek Models from V3 to V3.2." Sebastian Raschka, 2026. https://magazine.sebastianraschka.com/p/technical-deepseek ↩
"DeepSeek roadmap and confirmed releases through 2026." Chat-Deep.ai roadmap tracker, 2026. https://chat-deep.ai/guide/deepseek-roadmap-rumors/ ↩
"R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model." arXiv:2505.12625, May 2025. https://arxiv.org/abs/2505.12625 ↩
"Which countries have banned DeepSeek and why?" Al Jazeera, February 6, 2025. https://www.aljazeera.com/news/2025/2/6/which-countries-have-banned-deepseek-and-why ↩
"OpenAI Accuses China's DeepSeek of Distilling US AI Models to Gain an Edge." Bloomberg, February 12, 2026. https://www.bloomberg.com/news/articles/2026-02-12/openai-accuses-deepseek-of-distilling-us-models-to-gain-an-edge ↩
"Disrupting state-sponsored uses of AI." Anthropic, February 2026. https://www.anthropic.com/news/disrupting-AI ↩
"International AI Safety Report 2026 update." Bengio et al., 2026. https://www.gov.uk/government/publications/international-ai-safety-report-2026 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

11 revisions by 1 contributors · full history

Suggest edit

DeepSeek-R1

What is DeepSeek-R1?

Background

From hedge fund to frontier lab

The reasoning paradigm

The three models

DeepSeek-R1-Zero (RL-only from V3-Base)

DeepSeek-R1 (SFT + RL with cold start)

DeepSeek-R1-Distill (smaller dense models)

Architecture

Training methodology

Group Relative Policy Optimization

Rule-based rewards

Cold-start data and the RL stages

Distillation to smaller bases

How much did DeepSeek-R1 cost to train?

How does DeepSeek-R1 compare to OpenAI o1?

Is DeepSeek-R1 open source?

Availability

Why did DeepSeek-R1 crash Nvidia's stock?

US chip export sanctions and the H800

Reception and impact

Open-weight ecosystem impact

Successors

DeepSeek-R1-0528 (May 28, 2025)

What happened to "DeepSeek-R2"

Later DeepSeek releases

Nature publication and peer review

Security, regulatory, and political concerns

Legacy and current status

See also

References

Improve this article

What links here (24 of 165)

What links here (24 of 165)

What is DeepSeek-R1?

Background

From hedge fund to frontier lab

The reasoning paradigm

The three models

DeepSeek-R1-Zero (RL-only from V3-Base)

DeepSeek-R1 (SFT + RL with cold start)

DeepSeek-R1-Distill (smaller dense models)

Architecture

Training methodology

Group Relative Policy Optimization

Rule-based rewards

Cold-start data and the RL stages

Distillation to smaller bases

How much did DeepSeek-R1 cost to train?

How does DeepSeek-R1 compare to OpenAI o1?

Is DeepSeek-R1 open source?

Availability

Why did DeepSeek-R1 crash Nvidia's stock?

US chip export sanctions and the H800

Reception and impact

Open-weight ecosystem impact

Successors

DeepSeek-R1-0528 (May 28, 2025)

What happened to "DeepSeek-R2"

Later DeepSeek releases

Nature publication and peer review

Security, regulatory, and political concerns

Legacy and current status

See also

References

Improve this article

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

QwQ

MiniMax M1

DeepSeek-Prover

DeepSeekMath

What links here (24 of 165)

Related Articles

DeepSeek-R1-Distill

DeepSeek V3.1

QwQ

MiniMax M1

DeepSeek-Prover

DeepSeekMath

What links here (24 of 165)