DeepSeek-R1

DeepSeek-R1
Developer	DeepSeek
Release date	January 20, 2025
Type	Large language model (reasoning)
Architecture	Mixture of Experts (MoE), Transformer
Parameters	671 billion total; 37 billion active per token
Context length	128,000 tokens
Training method	Reinforcement learning with GRPO
Training compute	512 Nvidia H800 GPUs, ~80 hours
Reported training cost	$294,000 (R1 RL stage only)
License	MIT
Updated version	DeepSeek-R1-0528 (May 28, 2025)
Successor lineage	DeepSeek-V3.1 (Aug 2025), V3.2 (Dec 2025)
Paper	arXiv:2501.12948; Nature 645, 633-638 (Sept 18, 2025)

DeepSeek-R1 is an open-source reasoning-focused large language model developed by DeepSeek, a Chinese artificial intelligence company spun out of the High-Flyer quantitative hedge fund. Released on January 20, 2025 under the MIT license, R1 was the first open-weight model to match the reasoning performance of OpenAI's proprietary o1 across mathematics, coding, and scientific reasoning tasks. The model uses a Mixture of Experts architecture with 671 billion total parameters, of which 37 billion are activated per forward pass, keeping computational costs manageable during inference.^[1]^[2]

DeepSeek-R1's release triggered what became known as the "DeepSeek shock," a market event on January 27, 2025 that erased over $1 trillion from U.S. technology stocks in a single trading session. Nvidia alone lost approximately $589 billion in market capitalization, the largest single-day loss for any company in stock market history. The shock stemmed from the revelation that a small Chinese startup with roughly 160 employees had trained a reasoning model competitive with the world's most expensive AI systems while reporting an RL-stage training cost of only $294,000, a fraction of the hundreds of millions typically spent by Western labs.^[3]^[4]^[22]

Beyond its market impact, DeepSeek-R1 was scientifically significant for demonstrating that complex reasoning behaviors could emerge from pure reinforcement learning without supervised fine-tuning. The companion model DeepSeek-R1-Zero, trained entirely through RL with rule-based rewards, developed chain-of-thought reasoning, self-reflection, and error correction spontaneously during training, a finding that challenged assumptions about how reasoning capabilities must be instilled in language models. The accompanying paper became the first major open-weight LLM to pass independent peer review, appearing on the cover of Nature on September 18, 2025.^[1]^[5]^[22]

Background

From hedge fund to frontier lab

DeepSeek's path to R1 began with a hedge fund, not an AI lab. High-Flyer, a Chinese quantitative trading firm co-founded in February 2016 by Liang Wenfeng and two of his classmates from Zhejiang University, accumulated tens of thousands of Nvidia GPUs over the late 2010s for stock-prediction and high-frequency-trading workloads. By 2020 the firm had built one of the largest private AI training clusters in China. Liang spun the AI research arm into an independent company, DeepSeek, in May 2023, and seeded it with engineers experienced in squeezing performance out of large GPU pools. Unlike most well-funded AI startups, DeepSeek was bootstrapped from hedge fund profits and had no external investors at the time of R1's release.^[6]^[23]

DeepSeek had been building toward R1 throughout 2024. The company released DeepSeek-V2 in May 2024 and DeepSeek-V3 in December 2024, both using Mixture of Experts architectures that prioritized computational efficiency. V3 served as the base model for R1's training, providing a strong foundation of general language capabilities and world knowledge. V3 itself was trained on roughly 14.8 trillion tokens for an estimated $5.576 million in GPU rental cost, a figure that would later be added to discussions of R1's true total cost.^[2]^[6]^[22]

The reasoning paradigm

The broader context for R1's development was the emergence of inference-time reasoning as a new paradigm in AI. OpenAI's o1, released in September 2024, had demonstrated that training models with reinforcement learning to "think before answering" could dramatically improve performance on difficult tasks. OpenAI did not publish technical details about how o1 was trained, leaving the broader research community to guess at the recipe. DeepSeek's contribution was to show that this approach could be replicated with open-source models at a fraction of the cost, and that the reasoning behaviors could emerge more naturally than previously assumed. The paper made the recipe public.^[1]

Architecture

Base model

DeepSeek-R1 is built on top of DeepSeek-V3, which uses a Mixture of Experts transformer architecture. The key architectural features include:^[2]^[6]

671 billion total parameters across all experts
37 billion active parameters per forward pass (only a subset of experts are activated for each token)
256 routed experts per layer with 1 shared expert that is always active
Multi-head Latent Attention (MLA), which compresses the key-value cache to 5 to 13% of standard attention, enabling efficient long-context processing
128,000-token context window
Auxiliary-loss-free load balancing across experts
FP8 mixed-precision training

The MoE architecture is central to R1's efficiency story. By activating only 37 billion of its 671 billion parameters for each token, the model achieves inference costs comparable to a much smaller dense model while maintaining the knowledge capacity of its full parameter count. MLA further reduces memory pressure during long-context inference, which matters for reasoning models that emit thousands of intermediate tokens before answering.

Training pipeline

DeepSeek-R1's training followed a four-stage pipeline that combined supervised learning and reinforcement learning:^[1]^[5]

Stage 1: Cold start. The DeepSeek-V3 base model was fine-tuned on a small set of curated chain-of-thought reasoning examples (a few thousand long-CoT samples). This cold-start data provided the model with initial examples of structured reasoning, addressing issues like repetitive loops and poor readability that occurred when applying RL directly to the base model in the R1-Zero experiment.

Stage 2: Reasoning-oriented reinforcement learning. Large-scale RL was applied using Group Relative Policy Optimization (GRPO), focused on tasks with verifiable answers (mathematics, coding, logic problems). The model learned to generate extended chains of thought and was rewarded based solely on the correctness of its final answers. A language-consistency reward was also added to discourage the language mixing seen in R1-Zero.

Stage 3: Rejection sampling and supervised fine-tuning. The RL-trained model generated a large set of reasoning traces. High-quality traces were selected through rejection sampling and combined with non-reasoning data to produce a curated dataset of approximately 800,000 samples (about 600,000 reasoning samples and 200,000 general samples covering writing, factual QA, self-cognition, and translation). DeepSeek-V3-Base was then fine-tuned on this dataset for two epochs.^[1]^[24]

Stage 4: Reinforcement learning for all scenarios. A final round of RL was applied across both reasoning and general tasks, optimizing for helpfulness and harmlessness using a combination of rule-based and model-based reward signals. This stage tightened the model's behavior on conversational tasks where rule-based rewards were not available.

The rejection-sampling step in Stage 3 also fed the recipe used to train all six distilled variants, since the same 800K-sample dataset was used to fine-tune smaller open-source base models.^[1]

Reported training cost

In the Nature publication of September 2025, DeepSeek disclosed that the reinforcement learning portion of R1's training used 512 Nvidia H800 GPUs for approximately 80 hours, at an estimated rental cost of $294,000 (assuming a then-current $2 per GPU-hour rental rate). Supplementary materials acknowledged for the first time that DeepSeek also owned A100 GPUs and used them for preparatory experiments at smaller scale.^[22]

This $294,000 figure refers narrowly to the RL stage that turned DeepSeek-V3 into R1. It does not include the cost of training V3 itself (around $5.6 million for the base model), the cost of generating cold-start data, the cost of distillation, salaries, or the depreciated cost of the GPU cluster. Several outlets, including The Register and CNN Business, noted that the true end-to-end cost of producing R1 was roughly an order of magnitude higher than the headline number, though still dramatically below the budgets of comparable Western reasoning models.^[22]^[25]

What the $5.576 million figure actually covers

The number that traveled fastest through press coverage in late January 2025 was not $294,000 but $5.576 million. That figure originally appeared in the DeepSeek-V3 technical report, where it was presented as the GPU-rental cost of training the V3 base model on roughly 14.8 trillion tokens (2.788 million H800 GPU-hours at $2 per hour). DeepSeek explicitly noted in V3's report that the figure excluded the cost of architecture research, ablations, salaries, and prior data work. When R1 launched five weeks later, many secondary outlets re-reported the $5.576 million number as if it were the total cost of building R1, which it was not.^[2]^[22]^[25]

Three distinct cost figures circulated in early 2025 coverage, often used interchangeably:^[22]^[25]

Reported figure	What it covers	What it excludes
$294,000	The final RL training run that converted V3 into R1 (512 H800 GPUs over roughly 80 hours)	Base model training, cold-start data, distillation, salaries, hardware depreciation
$5.576 million	GPU rental for training V3 from scratch on 14.8T tokens	Architecture research, prior models, salaries, the cluster itself
Estimates of $50 million to $1.6 billion+	Various attempts to reconstruct DeepSeek's full development budget including the GPU cluster, headcount, and prior models	Speculative; DeepSeek does not disclose figures at this level

SemiAnalysis published the most cited reconstruction of DeepSeek's full operating costs in late January 2025, estimating that the underlying GPU cluster alone (around 50,000 Hopper-class GPUs accumulated by High-Flyer over several years) would have cost roughly $1.6 billion if rented at commercial rates and that DeepSeek's annual operational expenditure was closer to $1.3 billion. SemiAnalysis emphasized that this did not contradict DeepSeek's narrower per-run cost figures; it simply reframed them. The DeepSeek narrative of efficiency held up at the level of marginal training cost; the framing of a $6 million startup competing with $100 million labs was harder to defend at the cluster-and-headcount level.^[22]^[25]^[33]

The fairest reading of all three numbers is that R1's marginal training cost was very low (the published $294,000 plus the V3 base it built on), but the total cost of producing R1 included a significant prior investment in a GPU cluster, a research team, and the V2 and V3 model line. The technical achievement, that a small dedicated RL stage could produce a frontier reasoning model from a strong general base, remains intact even after the cost reframing.^[22]^[25]

Group Relative Policy Optimization

GRPO is the reinforcement learning algorithm used to train both R1-Zero and R1. Originally proposed by DeepSeek for their earlier DeepSeek-Math model in a February 2024 paper, GRPO simplifies the RL training process compared to Proximal Policy Optimization (PPO), which had been the standard approach for language model RL training (as used in RLHF).^[1]^[5]^[16]

The key innovation of GRPO is eliminating the need for a separate critic (value) model. In standard PPO-based RLHF, two models must be maintained during training: the policy model being optimized and a value model that estimates expected returns. The value model alone can be as large as the policy model, effectively doubling the computational requirements of training. GRPO removes this requirement by using a simpler baseline derived from group statistics.^[1]^[16]

The algorithm works as follows:^[1]^[5]^[16]

For each prompt in the training batch, GRPO samples a group of G responses (typically G = 16 to 64) from the current policy.
Each response is scored by a reward function. For math, that means whether the final numerical answer (placed inside a \boxed{...} directive) matches the ground truth. For code, it means whether the generated program passes a hidden test suite.
The mean and standard deviation of rewards within the group are computed.
The advantage of each response is calculated as its normalized reward relative to the group average: advantage_i = (reward_i - mean) / std.
The policy is updated to increase the probability of high-advantage responses and decrease the probability of low-advantage responses, subject to a KL divergence constraint that prevents the policy from moving too far from a frozen reference model.

This group-relative approach has several advantages. By normalizing rewards within each group, GRPO reduces the impact of reward scale differences across different problem types. The elimination of the value model cuts training memory requirements roughly in half, allowing the same hardware to train larger models. The algorithm is also simpler to implement and tune than PPO with a learned value function.^[16]

DeepSeek used a deliberately simple reward design: an accuracy reward (right or wrong on a verifiable answer) plus a format reward (the model is required to enclose its reasoning in <think>...</think> tags and its final answer in a designated location). No reward model based on human preference data was used during the reasoning-oriented stages, which sidestepped one of the most expensive components of conventional RLHF and one of the harder failure modes (reward hacking against a learned reward model).^[1]^[5]

Since R1's release, GRPO has become widely adopted in the language model community. Hugging Face's TRL library added native GRPO support, and numerous research groups have used it to train their own reasoning models. The algorithm's combination of simplicity, efficiency, and effectiveness made it particularly attractive for smaller teams and academic researchers who could not afford the memory overhead of full PPO. By the end of 2025, GRPO and its derivatives (DAPO, GRPO+ from Qwen, REINFORCE++ variants) had become the de facto standard for training open-source reasoning models, displacing PPO with DPO as the preferred recipe for post-training reasoning behaviors.^[16]

DeepSeek-R1-Zero

Before training R1, DeepSeek conducted an experiment called R1-Zero that became one of the most discussed results in the paper. R1-Zero was trained by applying reinforcement learning directly to the DeepSeek-V3 base model, without any supervised fine-tuning or curated reasoning examples. The model was simply given problems and rewarded for producing correct answers.^[1]^[5]

The training prompt

R1-Zero used a deliberately minimal prompt template that asked the base model to enclose its reasoning inside <think>...</think> tags and its final answer inside <answer>...</answer> tags. No examples of reasoning were provided. The base model began by emitting essentially random text inside the think tags, but the GRPO training loop pushed it toward producing reasoning content that actually helped it answer correctly. Over tens of thousands of training steps, that pressure produced increasingly structured reasoning behaviors.^[1]

Emergent reasoning behaviors

Despite receiving no explicit training on how to reason, R1-Zero spontaneously developed several sophisticated reasoning behaviors during RL training:^[1]^[5]

Chain-of-thought reasoning: the model learned to break problems into steps and work through them sequentially.
Self-reflection: R1-Zero began inserting phrases like "wait, let me reconsider" into its reasoning chains, pausing to re-evaluate its approach.
Error detection and correction: the model developed the ability to spot mistakes in its own intermediate steps and correct them.
Strategy exploration: when an initial approach failed, the model learned to try alternative methods and abandon dead ends.
Adaptive thinking time: harder problems received longer reasoning chains, an allocation pattern that was never explicitly trained.

Quantitative emergence

The paper reported a striking accuracy trajectory on the AIME 2024 mathematics olympiad. R1-Zero's pass@1 score climbed from 15.6% at the start of RL training to 71.0% by the end of training. With self-consistency (majority voting over 64 samples) the score reached 86.7%. By the end of the run, R1-Zero on its own had matched o1-preview-level scores on AIME using nothing more than RL on a base model that had never seen a reasoning example.^[1]^[22]

Researchers tracked the emergence of reflective reasoning behaviors across training by measuring the frequency of specific terms in the model's outputs. The results showed a clear phase transition:^[1]^[5]^[17]

Training stage	Reflective term frequency	Behavior
Steps 0-4,000	Virtually absent	Model generates linear, non-reflective solutions
Steps 4,000-7,000	Sporadic appearance	Occasional use of "wait," "but," "however"
Steps 8,000+	Marked increase	Systematic self-monitoring and error correction

Specific reflective terms tracked included "wait," "mistake," "however," "but," "retry," "error," "verify," "wrong," "evaluate," and "check." These terms were virtually absent in the early stages of training, appeared sporadically in the middle stages, and showed a marked increase after step 8,000, suggesting the emergence of temporal reasoning or self-monitoring behavior.^[17]

The model also showed a clear increase in the length of its reasoning chains over the course of training. Early in RL training, the model generated short, direct answers. As training progressed, the average response length grew steadily from a few hundred tokens to several thousand, with the model learning to allocate more thinking time to harder problems. This adaptive allocation of test-time compute was not explicitly trained but emerged naturally from the optimization process.^[1]^[5]

The "aha moment"

DeepSeek's paper highlighted what they called an "aha moment" during R1-Zero's training. At a certain point in RL training, the model showed a sudden increase in the use of reflective language (particularly the word "wait") during its reasoning chains. The paper printed an excerpt of one such moment, where the model interrupts itself in the middle of a math problem with the phrase "Wait, wait. Wait. That's an aha moment I can flag here." before backtracking to a different approach and getting the right answer. This marked a qualitative shift in the model's reasoning patterns, where it began systematically re-evaluating and correcting its own work rather than simply proceeding linearly through a solution.^[1]^[5]

The aha moment became widely discussed in the AI research community. DeepSeek described it as evidence of "the self-evolution process" of the model, suggesting that reinforcement learning could induce genuinely emergent cognitive strategies. However, subsequent research by other groups has debated whether these behaviors were truly emergent or whether traces of reflective reasoning were already present in the base model's pre-training data. A study by Sea AI Lab titled "There May Not be Aha Moment in R1-Zero-like Training" argued that the observed behaviors could be attributed to pre-existing patterns in the training data rather than genuine emergence, and replicated similar trajectories starting from base models that had been pre-trained on web text containing reasoning-style writing.^[5]^[7]

Limitations of R1-Zero

Despite its impressive emergent behaviors, R1-Zero had practical limitations that motivated the development of the full R1 model. Its outputs often suffered from poor readability, with reasoning chains that mixed Chinese and English mid-sentence, repeated phrases endlessly, or failed to clearly delineate the final answer. The model also struggled with tasks outside of mathematics and coding, where the lack of supervised fine-tuning left it without appropriate response formats. These issues were addressed in R1 through the cold-start data, the language-consistency reward in Stage 2, and the multi-stage training pipeline.^[1]

Even with these flaws, R1-Zero on its own represented a notable scientific result: pure RL on a strong base model produced a competitive reasoning model. R1-Zero was released alongside R1 under the same MIT license so that researchers could study the unfiltered behavior of an RL-only reasoning model.

Benchmarks

DeepSeek-R1 achieved performance competitive with OpenAI's o1 across major reasoning benchmarks.

Benchmark	DeepSeek-R1	OpenAI o1 (Dec 2024)	GPT-4o	Description
AIME 2024 (pass@1)	79.8%	79.2%	13.4%	American Invitational Mathematics Exam
MATH-500	97.3%	96.4%	60.3%	Mathematical problem solving
GPQA Diamond	71.5%	75.7%	53.6%	Graduate-level science questions
Codeforces (Elo / percentile)	2,029 / 96.3	1,891 / 93.4	n/a	Competitive programming rating
MMLU	90.8%	91.8%	87.2%	Multitask language understanding
MMLU-Pro	84.0%	81.9%	73.3%	Harder MMLU variant
LiveCodeBench (CoT)	65.9%	63.4%	33.4%	Real-world coding tasks
SWE-bench Verified	49.2%	48.9%	33.2%	Software engineering tasks
HumanEval	85.4%	92.4%	90.2%	Code generation
AlpacaEval 2.0 (LC)	87.6%	n/a	51.1%	Open-ended instruction following
ArenaHard	92.3%	n/a	80.4%	Adversarial chat eval

^[1]^[2]^[8]

The results showed that R1 matched or exceeded o1 on most mathematical and coding benchmarks while trailing slightly on graduate-level science (GPQA Diamond) and short-form code generation (HumanEval). The fact that an open-source model could achieve these results, trained at a fraction of the cost, was the central claim that drove both the scientific interest and the market reaction. Importantly, R1's chat-style benchmarks (AlpacaEval, ArenaHard) showed that the multi-stage training preserved instruction-following quality even as it added reasoning capability.

Distilled models

Alongside R1, DeepSeek released six smaller distilled models created through knowledge distillation, where R1's reasoning capabilities were transferred to smaller, more efficient base models. DeepSeek used R1 as a teacher model to generate the same 800,000 high-quality reasoning traces used to fine-tune R1 itself in Stage 3, then applied them as supervised fine-tuning data on smaller bases from the Qwen2.5 and Llama 3 families. No additional RL was applied to the distilled models in this initial release.^[1]^[2]

Distilled model	Base model	Parameters	License	AIME 2024	MATH-500	GPQA Diamond
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	1.5B	Apache 2.0 + MIT	28.9%	83.9%	33.8%
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	7B	Apache 2.0 + MIT	55.5%	92.8%	49.1%
DeepSeek-R1-Distill-Llama-8B	Llama 3.1-8B	8B	Llama 3 + MIT	50.4%	89.1%	49.0%
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	14B	Apache 2.0 + MIT	69.7%	93.9%	59.1%
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	32B	Apache 2.0 + MIT	72.6%	94.3%	62.1%
DeepSeek-R1-Distill-Llama-70B	Llama 3.3-70B-Instruct	70B	Llama 3 + MIT	70.0%	94.5%	65.2%

^[1]^[2]

The distilled models were a major part of R1's impact. The smallest model, Qwen-1.5B, outperformed GPT-4o and Claude 3.5 Sonnet on math benchmarks despite being small enough to run on consumer hardware. The 32B and 70B distilled models set new state-of-the-art results among dense (non-MoE) open-source models on reasoning benchmarks, outperforming the contemporaneous QwQ-32B-Preview by substantial margins. Notably, the 32B distillation reached 72.6% on AIME 2024, beating OpenAI's o1-mini (63.6%) by nearly nine points.^[1]

Distilled models inherited the licenses of their base checkpoints. Qwen-based distills are released under Apache 2.0 (with the fine-tuning weights themselves under MIT), while Llama-based distills are governed by their respective Meta Llama community licenses with the fine-tuning weights under MIT.^[2]

Consumer hardware deployment

The distilled models enabled local deployment on consumer hardware, which became a major driver of community adoption. The hardware requirements for running different distilled models are as follows:^[18]

Distilled model	Minimum VRAM	Recommended GPU	Performance notes
1.5B / 7B / 8B	8 GB	NVIDIA RTX 3060 12GB	Runs efficiently at standard quantization
14B	12-16 GB	NVIDIA RTX 4070 Ti 16GB	Fits in VRAM at 4-bit quantization
32B	20-24 GB	NVIDIA RTX 3090/4090 24GB	Smooth performance at 4-bit quantization
70B	40-48 GB	2x NVIDIA RTX 3090 or A100	Requires multi-GPU or offloading

The 32B distilled model hit a particularly attractive sweet spot, offering performance comparable to OpenAI's o1-mini on several benchmarks while running on a single consumer-grade RTX 4090 GPU. Running locally eliminated API costs, kept data private, removed rate limits, and provided offline access to explicit chain-of-thought reasoning.^[18]

The distilled models proved especially popular with the open-source community. Within weeks of release, hundreds of derivative models were created on Hugging Face, fine-tuned for specific use cases ranging from medical reasoning to financial analysis. Hobbyists ran the 8B and 14B distills on Apple Silicon laptops via llama.cpp and MLX, providing the first time many users had access to a fully local reasoning model that could explain its work.

Distillation versus RL on small models

DeepSeek included an ablation in the paper comparing two ways of giving small models reasoning capabilities: distilling traces from a strong reasoning teacher (R1) versus running GRPO directly on a small base model. Distillation won decisively. The 32B Qwen base distilled from R1 outperformed the same base trained with R1's RL recipe directly. The takeaway was that the cheap path for small reasoning models is to distill from a large reasoning teacher rather than run RL from scratch, a finding that influenced subsequent open-source training recipes.^[1]

MIT license and open-source release

DeepSeek released R1, R1-Zero, and all six distilled models under the MIT license, one of the most permissive open-source licenses available. The license explicitly permits commercial use, modification, redistribution, model distillation, and the use of API outputs to train other models. The full model weights, training code references, and technical paper were all made publicly available on GitHub and Hugging Face.^[1]^[2]

This explicit grant of distillation rights was unusual. Most proprietary AI providers either prohibit using their outputs to train competing models or leave the question ambiguous. DeepSeek's terms made it clear that researchers could legally train students on R1's outputs, which removed legal friction from the wave of follow-on work.

The open-source strategy was a deliberate choice that amplified R1's impact far beyond what a proprietary release would have achieved. Within a month of launch, over 700 community-built models derived from R1 appeared on Hugging Face, collectively downloaded more than 5 million times. Major cloud providers including Microsoft Azure, Amazon Web Services, and Nvidia's inference platforms quickly added support for R1, making it accessible through familiar enterprise interfaces.^[2]^[9]

Community adoption

DeepSeek-R1 became the most-liked model on Hugging Face among nearly 1.5 million models on the platform, surpassing 10,000 likes within weeks of release. The variant versions collectively exceeded 10 million downloads. R1's release also catalyzed a broader shift in the open-source AI ecosystem: the number of competitive Chinese organizations releasing models increased dramatically, with Baidu going from zero releases on Hugging Face in 2024 to over 100 in 2025, and ByteDance and Tencent each increasing releases by eight to nine times.^[19]

The MIT license also enabled a wave of academic research building on R1's approach. Researchers at universities and smaller labs could study the model's reasoning traces, replicate the RL training methodology, and test hypotheses about emergent reasoning that would have been impossible with a proprietary model. Several groups, including Hugging Face's Open-R1 project, Berkeley's NovaSky team (Sky-T1), and the Together AI / Stanford collaboration on TinyZero, attempted full replications of the R1 training pipeline using only public data and open base models.

The "DeepSeek shock"

The market reaction to R1's release became a defining financial event of early 2025. On January 27, 2025, one week after R1's public release, U.S. technology stocks experienced their steepest single-day decline in history.^[3]^[4]

What happened

The sell-off was triggered by a sudden reassessment of the AI investment thesis. For years, the market had priced technology companies, especially chipmakers and cloud providers, on the assumption that building frontier AI required massive and growing capital expenditures. DeepSeek's demonstration that a 160-person Chinese startup could produce competitive results undermined that assumption.^[3]^[4]

Nvidia's stock fell nearly 17% in a single session, closing at $118.58 and losing approximately $589 billion in market capitalization. This was the largest single-day market value loss for any company in history. Other semiconductor companies including Broadcom, Marvell, Micron, and TSMC also fell sharply. The Nasdaq composite lost roughly $1 trillion in value by the end of the day. Meta and Alphabet (Google's parent company) also declined significantly. Apple briefly retook the title of world's most valuable company as Nvidia fell to roughly $2.8 trillion in market cap.^[3]^[4]

The DeepSeek mobile app reached number one on the Apple App Store in the United States on January 27, displacing ChatGPT. That ranking became part of the news cycle around the stock drop, with retail investors and journalists pointing to the consumer ranking as a tangible sign that something had changed.

Broader implications

Marc Andreessen, the prominent technology investor, described the event as "AI's Sputnik moment," drawing a parallel to the 1957 Soviet satellite launch that shocked the United States into accelerating its space program. The comparison captured the sense that a competitor working with far fewer resources had achieved something that the established players, with their billions in investment, had assumed only they could do.^[4]^[10] President Donald Trump, speaking at a Republican retreat the same week, called R1 a "wake-up call for our industries that we need to be laser-focused on competing to win."

The market impact extended beyond the immediate sell-off. Chinese AI companies entered an aggressive price war, with some cutting API prices by up to 97% in the weeks following R1's release. In the United States, the event forced a public debate about whether the hundreds of billions being invested in AI data centers and chip manufacturing were truly necessary, or whether architectural innovation could substitute for raw compute.^[3]^[4]

Stanford HAI faculty noted that DeepSeek's open releases represented "a significant step in democratizing AI," enabling smaller companies and individual developers to build on frontier-capable models without massive compute budgets.^[9] Within days, multiple Western labs publicly accelerated their own reasoning-model roadmaps. OpenAI shipped o3-mini on January 31, 2025, and Anthropic added an extended thinking mode to Claude 3.7 Sonnet in February.

DeepSeek-R1-0528

On May 28, 2025, DeepSeek released a major update to R1 designated R1-0528. Despite being described as a "minor upgrade" in official communications, the update delivered substantial improvements across all major benchmarks.^[11]^[12]

Performance improvements

Benchmark	R1 (Jan 2025)	R1-0528 (May 2025)	Change
AIME 2024	79.8%	91.4%	+11.6
AIME 2025	70.0%	87.5%	+17.5
HMMT 2025	41.7%	79.4%	+37.7
CNMO 2024	78.8%	86.9%	+8.1
LiveCodeBench (2408-2505)	63.5%	73.3%	+9.8
Codeforces-Div1 rating	~1,530	~1,930	+400
SWE-bench Verified	49.2%	57.6%	+8.4
Aider-Polyglot	53.3%	71.6%	+18.3
MMLU-Redux	92.9%	93.4%	+0.5
MMLU-Pro	84.0%	85.0%	+1.0
GPQA Diamond	71.5%	81.0%	+9.5
Humanity's Last Exam	8.5%	17.7%	+9.2
FRAMES	82.5%	83.0%	+0.5

^[11]^[12]^[26]

The AIME 2025 improvement from 70% to 87.5% was particularly notable, bringing R1-0528 into competitive range with OpenAI's o3 (88.9% on AIME 2025). The Codeforces rating jump of approximately 400 points reflected dramatically improved code generation and problem-solving ability. The HMMT 2025 improvement of nearly 38 points was the single largest gain, reflecting deeper engagement with multi-step competition mathematics.^[11]^[26]

R1-0528 also added two tool-use benchmarks where the original R1 had not reported numbers: BFCL_v3 (Berkeley Function Calling Leaderboard, multi-turn) at 37.0% and Tau-Bench (airline 53.5%, retail 63.9%), reflecting the new function-calling capability.^[26]

Technical changes

R1-0528 demonstrated deeper chain-of-thought reasoning than its predecessor. On AIME 2025 problems, the model averaged approximately 23,000 thinking tokens per query, compared to roughly 12,000 for the original R1, a roughly 92% increase in reasoning depth. This near-doubling, enabled by additional algorithmic optimization during post-training, contributed to the accuracy improvements.^[12]^[26]

DeepSeek also reported that the rate of hallucinations (false or misleading outputs) was reduced by approximately 45 to 50% in scenarios such as rewriting and summarization.^[12]

New developer features

The update added several capabilities requested by the developer community:^[12]

Function calling: integration with external tools and APIs
JSON output support: structured output for programmatic use
System prompt support: no longer requiring manual <think>\n formatting to trigger the thinking mode
Improved multilingual performance: better handling of non-English reasoning tasks
Enhanced front-end code generation

DeepSeek also released a distilled model from R1-0528: DeepSeek-R1-0528-Qwen3-8B, which achieved state-of-the-art performance among open-source 8B models on AIME 2024 at 86.0%, surpassing the base Qwen3-8B by 10 percentage points and matching the performance of the much larger Qwen3-235B-Thinking on the same benchmark.^[12]^[26]

API pricing

DeepSeek offered R1 through its API at prices dramatically lower than competing reasoning models, advertised as model=deepseek-reasoner.

Model	Input (per 1M tokens, cache miss)	Input (per 1M tokens, cache hit)	Output (per 1M tokens)
DeepSeek-R1	$0.55	$0.14	$2.19
OpenAI o1	$15.00	$7.50	$60.00
OpenAI o3 (June 2025 cut)	$2.00	$0.50	$8.00
Anthropic Claude 3.7 Sonnet (extended thinking)	$3.00	$0.30	$15.00

^[13]^[14]

The pricing differential was stark: R1 was roughly 27 times cheaper than o1 for both input and output tokens. Even after OpenAI's June 2025 price cuts brought o3 down to $2/$8, R1 remained approximately 3 to 4 times cheaper. Combined with the MIT license allowing self-hosting (eliminating API costs entirely for organizations with their own compute), R1's economics were a core part of its appeal.

Third-party API providers including Together AI, Fireworks AI, Groq, OpenRouter, Hyperbolic, and Lambda all offered hosted endpoints for R1 within days of release, often at competitive prices and sometimes with faster inference than DeepSeek's own API. Groq in particular advertised R1-Distill-Llama-70B running on its LPU hardware at over 200 tokens per second, several times faster than typical GPU-based deployments.

Nature publication and peer review

On September 18, 2025, the DeepSeek-R1 paper appeared on the cover of Nature (volume 645, issue 8081, pages 633 to 638), becoming the first major open-weight large language model to be the subject of a peer-reviewed Nature paper. The corresponding author was Liang Wenfeng, with 199 co-authors from DeepSeek-AI listed.^[5]^[22]

The peer-reviewed version added information that had not appeared in the January arXiv preprint:^[22]

Disclosure of the $294,000 RL-stage training cost on 512 H800 GPUs over roughly 80 hours.
Acknowledgment that DeepSeek owned A100 GPUs used for preliminary experiments.
Expanded ablation studies including a direct response to the "There May Not be Aha Moment" critique, in which DeepSeek argued that the reflection behaviors were not artifacts of the base model's pre-training distribution.
More detailed reward design and a quantitative trajectory of AIME accuracy from 15.6% to 71.0% (with self-consistency reaching 86.7%).
A response to OpenAI's distillation accusations, with DeepSeek stating that R1 was trained on data scraped from the open web (which inevitably includes some text generated by other LLMs) but that it did not specifically distill from OpenAI APIs for the reasoning capability itself.

The peer review was unusually public for an AI paper. Nature published the reviewer comments and DeepSeek's responses alongside the article, an editorial choice that was widely welcomed in the AI research community as a step toward more rigorous publication practices for industrial AI work.^[5]

Impact on the AI industry

DeepSeek-R1's impact extended well beyond its benchmark scores and market disruption. The model fundamentally changed several assumptions about AI development.

Reasoning does not require massive compute

Before R1, the prevailing assumption was that training frontier reasoning models required resources available only to a handful of well-funded Western labs. DeepSeek showed that a combination of architectural innovation (MoE, MLA), efficient training algorithms (GRPO, FP8 mixed precision), and clever engineering could produce competitive results at dramatically lower cost. This finding had practical consequences: smaller companies and research institutions began building on R1's open weights rather than training models from scratch.^[1]^[9]

Open source can compete at the frontier

R1 demonstrated that open-source AI models could match proprietary frontier models in at least one important capability dimension (reasoning). This intensified the debate within the AI industry about the relative merits of open and closed development approaches. Labs that had been reluctant to open-source their models faced renewed pressure to justify keeping weights proprietary, while the open-source community gained a powerful new proof point for their approach. Meta's Yann LeCun, a vocal advocate of open-source AI, repeatedly cited R1 as evidence that open weights would catch closed weights in capability.^[9]

The emergence argument

R1-Zero's spontaneous development of reasoning strategies through pure RL contributed to the ongoing scientific debate about emergent abilities in language models. The result suggested that reasoning was not something that needed to be explicitly taught through supervised learning but could arise naturally from optimization pressure on task performance. This finding influenced subsequent research across multiple labs exploring RL-based training for reasoning.^[1]^[5]

Competitive response

R1's release accelerated development timelines across the industry. Several months after R1, multiple labs released improved reasoning models: OpenAI shipped o3-mini in January 2025, o3 in April 2025, and o4-mini in mid-2025; Google released Gemini 2.5 Pro with extended thinking in March 2025; Anthropic added extended thinking to Claude 3.7 Sonnet in February 2025 and Claude Opus 4 in May 2025; Alibaba released Qwen3 with native hybrid reasoning in April 2025 and the open-source QwQ-32B reasoning model. The competitive dynamic R1 created pushed the entire field forward at a faster pace than might otherwise have occurred.

The distillation recipe in particular was widely copied. Microsoft's Phi-4-Reasoning, Berkeley's Sky-T1-32B, Hugging Face's Open-R1, NVIDIA's OpenReasoning-Nemotron, and dozens of community models all used variants of R1's rejection-sampling-then-SFT recipe to bootstrap reasoning capabilities into smaller bases.

Security and regulatory concerns

As a model developed by a Chinese company, DeepSeek-R1 faced regulatory scrutiny in multiple Western countries. Concerns centered on data privacy (DeepSeek's servers are located in China, subject to Chinese data laws), potential content alignment with Chinese government positions, and national security implications of a widely deployed Chinese AI model.^[15]

US legislative response

The US government response to DeepSeek was swift and multi-pronged. On February 6, 2025, Representatives Josh Gottheimer and Darin LaHood introduced the bipartisan "No DeepSeek on Government Devices Act," which specifically targeted the DeepSeek mobile application and API for prohibition on federal government devices. The bill passed in August 2025, banning federal employees from using the app on government-issued devices.^[15]^[20]

Additional legislative efforts included Representative Mark Green's China Technology Transfer Control Act and Senator Josh Hawley's "Decoupling America's Artificial Intelligence Capabilities from China Act," introduced on January 29, 2025. Gottheimer and LaHood also wrote to all 50 US governors urging them to implement similar bans at the state level.^[15]^[20]

Government agency restrictions

Multiple government agencies independently restricted or banned use of DeepSeek products:^[15]^[27]

Agency / government	Action	Date
U.S. Navy	Issued warnings against use	Late January 2025
NASA	Reinforced security concerns	January 31, 2025
Italy (Garante)	Ordered limitation on processing of Italian users' data; app removed from Apple/Google stores	January 30, 2025
Texas	First U.S. state ban on government systems	February 2025
Virginia	Banned on government systems	February 2025
New York	Banned on government systems	February 2025
U.S. Congress	Restricted on congressional devices	February 2025
Pentagon	Restricted usage	February 2025
Australia	Government agency restrictions; Home Affairs Minister Tony Burke cited national security	February 4, 2025
South Korea	Government restrictions; later removed from app stores	February 2025
Taiwan (MODA)	Banned on government devices and critical infrastructure	February 2, 2025
India (Finance Ministry)	Restricted on government devices	February 2025

The Fiscal Year 2026 National Defense Authorization Act, signed in December 2025, included provisions restricting DeepSeek usage within the Department of Defense and Intelligence Community.^[15]

OpenAI distillation accusation

Separately, OpenAI accused DeepSeek of improperly distilling from OpenAI models. The accusation surfaced publicly within days of R1's release, with OpenAI claiming it had "some evidence" that DeepSeek used outputs from OpenAI APIs to train R1, in violation of OpenAI's terms of service. In a February 2026 Bloomberg report, OpenAI escalated the claim in a memo to U.S. lawmakers, alleging that DeepSeek had developed methods to circumvent OpenAI's access restrictions through obfuscated third-party routers and other means. DeepSeek did not formally admit to using distillation in training R1's reasoning capabilities. The peer-reviewed Nature paper acknowledged that R1's training data, scraped from the open web, would inevitably contain text generated by other LLMs, while denying targeted distillation of OpenAI's reasoning traces.^[22]^[28]^[29]

The accusation raised broader legal and ethical questions about "distillation" as a practice. Many in the open-source community pointed out that OpenAI itself had been sued for training on copyrighted material scraped from the web, making the position uncomfortable. The episode also illustrated a structural issue in modern LLM development: as long as proprietary models are accessible via API, their outputs can be used to train competitors, and detection is technically difficult.

Anthropic fake-account accusation (February 2026)

Anthropic escalated the distillation issue further in February 2026 with a public blog post alleging that DeepSeek, Moonshot AI, and MiniMax had together used roughly 24,000 fake accounts to generate more than 16 million exchanges with Claude. Anthropic argued that the campaigns were systematic rather than incidental: scripted multi-turn sessions designed to elicit reasoning traces, code, and tool-use patterns that could be used as supervised fine-tuning data. The post specifically cited regional access restrictions and terms-of-service violations, and Anthropic stated that the accounts were taken down as soon as they were detected.^[28]^[34]

Neither DeepSeek, Moonshot, nor MiniMax publicly responded to the specific allegations beyond general statements that all training data was obtained legally. Independent verification was difficult because Anthropic's technical evidence was not published in full. The episode sat alongside the OpenAI memo as one of the most prominent industry-level claims of cross-lab distillation, and it became part of the political case that several U.S. lawmakers used to argue for stronger restrictions on Chinese labs' API access to Western frontier models.^[28]^[34]

America's AI Action Plan and the "DeepSeek shock" framing

The Trump administration's America's AI Action Plan, released on July 23, 2025, used R1's launch as a recurring framing device for the case that the United States was at risk of losing the AI race. The phrase "DeepSeek shock" appeared in the plan's preamble and in several supporting executive orders signed alongside it; the plan named open-source competition from China and the V3 / R1 release in particular as evidence that "innovation is no longer a US monopoly," and proposed accelerated permitting for AI data centers, expanded chip export restrictions, and new federal procurement preferences for American models. R1 thus became, indirectly, the policy event that justified the largest US AI infrastructure push of 2025.^[35]^[36]

President Trump's January 27, 2025 reference to R1 as a "wake-up call" was reused throughout 2025 administration communications. By the time the International AI Safety Report was updated in early 2026 under Yoshua Bengio's chair, R1 was treated as the canonical example of "rapid catch-up dynamics" between leading and following labs, and the report explicitly used DeepSeek's recipe as evidence that frontier capabilities propagated faster than previous safety assumptions had expected.^[36]

Blackwell chip controversy (February 2026)

In February 2026, a senior US government official told reporters that DeepSeek had trained an upcoming model (then assumed to be a successor to V3.2) on Nvidia Blackwell chips, the company's most advanced AI accelerator. Blackwell falls under strict US export controls, and the chips were not legally available for sale in mainland China. Subsequent reporting by The New York Times and Bloomberg tied the allegation to a DeepSeek-affiliated data center in Inner Mongolia and to the longstanding pattern of restricted hardware reaching Chinese labs through third-country resellers. DeepSeek has not publicly commented on Blackwell access. The episode reignited the export-control debate that R1 had originally provoked: if a small lab could close most of the gap to OpenAI on H800-class hardware, the implications of Chinese labs gaining access to current-generation Nvidia chips were significant.^[37]^[38]

Censorship of politically sensitive topics

Content analysis documented instances of refusal or evasion related to politically sensitive topics, including the 1989 Tiananmen Square massacre, the political status of Taiwan, the treatment of Uyghurs in Xinjiang, and the comparison of Xi Jinping to other figures. DeepSeek consistently avoided substantive engagement with these subjects when accessed through chat.deepseek.com or the official API. Behavior on the open weights was more nuanced: when the same questions were posed to a self-hosted instance of R1 through bare-metal inference, the model often produced fuller answers, suggesting that some of the refusal behavior was implemented as a server-side filter rather than baked into the weights themselves.^[15]^[20]^[30]

A May 2025 academic paper titled "R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model" mapped the topology of refusals across hundreds of prompts. Most Type 2 refusals (full silence rather than alignment-style explanations) clustered around Tiananmen Square, suggesting that this topic remained particularly sensitive across both the chat product and the model weights.^[30]

Security vulnerabilities

Technical security analyses identified several concerns with DeepSeek's infrastructure. Reports cited hidden code linking to China Mobile servers in the mobile app, the collection of keystroke timing data, data storage on Chinese servers subject to Chinese government access requests, and several cybersecurity test failures including jailbreak resistance below industry baselines. These restrictions applied primarily to government use. Commercial and individual use of R1's open weights remained unrestricted in most jurisdictions, and the model continued to be widely deployed through cloud providers and self-hosted infrastructure. The open-source nature of the model weights meant that security concerns about DeepSeek's servers could be mitigated by self-hosting, though the underlying questions about training data and possible weight-level alignment with Chinese government positions remained.^[15]^[20]

Limitations

R1 has well-documented limitations that practitioners learned to work around:

Long inference latency. R1 typically emits 5,000 to 30,000 thinking tokens before answering. For interactive use, this means a single response can take 30 seconds to several minutes even on fast hardware. Many products that integrate R1 hide the thinking tokens from users to mask the wait.
Memory requirements. The full 671B MoE model requires roughly 1.4 TB of VRAM at FP8 to serve at full quality, putting it out of reach of most single-machine deployments. Production-quality serving usually requires multi-node H100 or H200 clusters, or a quantized format like INT4 that fits on a single 8x H100 node at the cost of some accuracy.
Repetition and language mixing. Outside the range of temperatures DeepSeek recommended (0.5 to 0.7), R1 was prone to endless repetition or mid-sentence switches between English and Chinese, especially on long generations. The official model card explicitly warned against using temperature 0 or system prompts in the original release.
Less polish on conversational tasks. R1's reasoning-first training made it stronger on math and code than on open-ended writing or roleplay compared to general-purpose models like Claude or GPT-4.5. Many users continued to use general models for writing and reserved R1 for problems where they needed shown work.
Refusals on Chinese-political topics, as discussed above.
Underperformance on tool-use until the May 28 R1-0528 update added function calling and JSON output natively.
Brittle answer formatting. R1 sometimes failed to actually place its final answer inside the prescribed \boxed{...} markers, requiring downstream parsing logic that handled missing or malformed markers.

Subsequent DeepSeek lineage

R1 was the first reasoning model in what became a steady release cadence from DeepSeek through 2025 and into 2026.

Release	Date	Notes
DeepSeek-R1	Jan 20, 2025	Initial release; companion R1-Zero; six distilled variants
DeepSeek-R1-0528	May 28, 2025	Major update; deeper thinking; function calling; JSON; system prompts; R1-0528-Qwen3-8B distill
DeepSeek-V3.1	Aug 19, 2025	First hybrid model: chat and reasoning in one set of weights with a thinking-mode toggle
DeepSeek-V3.2-Exp	Sep 29, 2025	Experimental release introducing DeepSeek Sparse Attention
DeepSeek-OCR	Oct 2025	Vision-language OCR model
DeepSeek-V3.2	Dec 1, 2025	Production hybrid model with thinking integrated into tool use
DeepSeek-V3.2-Speciale	Q1 2026	High-compute variant; gold-medal results on IMO 2025, IOI 2025, ICPC World Finals
DeepSeek-V4 (anticipated)	April 2026 (rumored)	Next-generation flagship; expected to fold reasoning, tool-use, and multimodal into single weights

^[31]^[32]

The V3.1 hybrid release in August 2025 effectively absorbed R1's role: a single set of weights could now serve both as a fast chat model and (with a thinking-mode toggle) as a reasoning model. V3.1's deep-thinking mode achieved roughly 90 to 95% of R1-0528's performance on reasoning benchmarks while sharing weights with a normal chat model, removing the need to load a separate reasoning model in production. By the time V3.2 launched in December 2025, R1 was no longer DeepSeek's recommended model for new applications, though it remained widely cited and deployed because of its open-source release and well-understood behavior.^[31]

What happened to "DeepSeek R2"

DeepSeek did not release a model branded R2 in 2025 or in the first five months of 2026. Reuters reported in March 2025 that DeepSeek was racing to ship a successor to R1, and Chinese-language tech outlets carried multiple "R2 imminent" rumors throughout the spring and summer of 2025, citing anonymous sources and partial leaks of model weights. None of these matured into an actual release. Instead, the May 2025 update was branded R1-0528, and the August 2025 hybrid model was branded V3.1 rather than R2. As of May 2026, DeepSeek's reasoning capability lives inside the V3.x hybrid line; whether the next reasoning-focused branded release will be called R2, V4 with a reasoning toggle, or something else has not been announced.^[32]^[39]

The closest thing to an "R2" in a research sense was the second-generation R1 distilled model, R1-0528-Qwen3-8B. It demonstrated that the same recipe applied to the newer Qwen3 base produced larger gains than the original R1 distills had achieved on Qwen2.5, but it was packaged as a distill rather than a new flagship.^[12]^[26]

Liang Wenfeng and DeepSeek's public posture after R1

Liang Wenfeng, DeepSeek's founder and CEO, was unusually quiet for the founder of a model that had moved $1 trillion of US market capitalization. He gave no major Western media interview in the immediate aftermath of R1's release. The most cited public statements came from earlier 2024 interviews with Chinese outlet 36Kr and a small set of remarks reported through Fortune, Wired, and Reuters in late January 2025. Liang's framing emphasized scientific curiosity, an unwillingness to take outside investment that might compromise long-horizon research, and a belief that "innovation gaps are narrow but they take a long time to close." He did meet with Premier Li Qiang in Beijing on January 20, 2025, the same day as R1's release, an event that was widely covered as a tacit signal of state attention but that did not visibly change the company's research direction.^[6]^[23]^[40]

In a rare 2025 essay published on the company's website, Liang argued that the most valuable contribution of R1 was not the benchmark numbers themselves but the published recipe; he wrote that the open release was intended to "convert proprietary research into shared infrastructure." That essay, combined with DeepSeek's continued open releases through V3.1 and V3.2, became part of the case that other Chinese labs (Alibaba's Qwen, Baidu, ByteDance, MiniMax) used to defend their own open-source strategies during 2025.^[19]^[40]

DeepSeek's headcount remained small by frontier-lab standards. Reporting through 2025 placed the team at roughly 150 to 200 employees, most of them young researchers hired directly out of Chinese universities. The lab declined repeated approaches from outside investors during the year and continued to be funded entirely through High-Flyer's profits and an internal split with the parent quant fund. The combination of small headcount, no outside capital, and a steady release cadence became one of the more durable narratives around DeepSeek in the year after R1.^[6]^[23]

Adoption inside Chinese technology platforms

Within China, R1 and its successors became the de facto open-source reasoning baseline for most of the largest technology platforms during 2025. Tencent integrated R1 into its Yuanbao consumer app within weeks of release, eventually offering both R1 and R1-0528 as user-selectable reasoning modes alongside Tencent's own Hunyuan models. Tencent also reported running R1 inside parts of the WeChat search experience, a deployment that significantly amplified the model's everyday user base.^[41]

Alibaba Cloud added R1 to its Bailian model platform in late January 2025 and offered it as a managed endpoint at prices closely tracking DeepSeek's own. This was notable because Alibaba's own Qwen3 was the most direct domestic competitor to R1; Alibaba's willingness to host the rival model was treated as evidence that Chinese cloud customers explicitly demanded R1 access regardless of the host platform's own model line. Baidu, ByteDance, and several smaller Chinese clouds followed similar patterns through 2025.^[41]^[42]

Enterprise adoption also moved into industries with limited prior AI exposure. Reports from Chinese state-affiliated outlets through 2025 covered R1 deployments in legal-document analysis, manufacturing process optimization, hospital triage chatbots, and government service portals. While many of these case studies were promotional, the volume of independent confirmation through hiring listings and procurement notices was substantial. By the second half of 2025, "DeepSeek-compatible" had become a recognizable procurement category in Chinese government IT bids.^[41]^[42]

Chinese smartphone vendors integrated R1 distilled models directly on-device. Xiaomi added the 7B and 14B distills to the HyperOS 2 assistant in February 2025; Honor and Oppo followed with similar integrations during the spring. By mid-2025 several flagship Android handsets shipped a local reasoning mode powered by an R1-Distill checkpoint, the first time a frontier-grade reasoning model ran natively on consumer phones.^[19]^[41]

Independent benchmark trajectory

R1's performance was tracked through 2025 and into 2026 across several independent benchmarking aggregators, which gave a clearer picture of how it held up against successive frontier releases.

Aggregator	Methodology	R1 launch position	R1-0528 position (mid-2025)	Position by April 2026
Vellum LLM Leaderboard	Composite score across reasoning, coding, instruction-following	Top 5 overall, leading open-source	Top 10 overall after o3 and Gemini 2.5 Pro launched	Outside top 15 after GPT-5.2 and Claude Opus 4.7
Artificial Analysis Quality Index	Aggregate score over reasoning, math, coding	Score around 60 at launch, leading open-source	Score around 68 with R1-0528	Score around 70, behind newer hybrid models including DeepSeek's own V3.2-Speciale
LMArena (Chatbot Arena)	Pairwise human vote ELO	Around 1320 at launch, top-five overall	Around 1370 in June 2025, top-three open-source	Around 1380, behind GPT-5-class and Claude Opus 4-class models
Aider polyglot coding bench	End-to-end coding edit pass rate	53.3% at launch	71.6% with R1-0528	Surpassed by GPT-5.1 Codex variants and Claude Sonnet 4.5

^[26]^[43]^[44]

The trajectory tells a consistent story: R1 was at or near the frontier at launch, R1-0528 closed most of the gap that had opened up with the spring 2025 reasoning releases, and then later 2025 and early 2026 frontier models (GPT-5, GPT-5.1, GPT-5.2, Claude Sonnet 4.5, Claude Opus 4.7, Gemini 2.5 Pro and Gemini 3) opened a new frontier that the original R1 line did not chase. R1 remained dominant only in the "best open-source reasoning model under MIT license" niche, and even that niche was contested through 2025 by Qwen3, Mistral Magistral, and OpenAI's own gpt-oss family.^[19]^[43]^[44]

Comparison with other frontier reasoning models

The table below summarizes R1 (and its successor R1-0528) against the other reasoning-capable frontier models that defined the field through 2025 and early 2026. All numbers reflect each model's position at the time it was the headline release.

Model	Release	License	AIME 2024 (or closest)	SWE-bench Verified	Context length	Notable property
OpenAI o1	Sep / Dec 2024	Proprietary	79.2%	48.9%	128K	First widely deployed reasoning model
DeepSeek-R1	Jan 20, 2025	MIT	79.8%	49.2%	128K	First open-weight match for o1
OpenAI o3	Jan / Apr 2025	Proprietary	91.6%	71.7%	200K	First model to clear 90% on AIME 2024
DeepSeek-R1-0528	May 28, 2025	MIT	91.4%	57.6%	128K	Closed most of the o3 gap on math
Gemini 2.5 Pro	Mar / Jun 2025	Proprietary	88.0%	63.2%	1M	"Thinking by default" across the line
Claude Sonnet 4.5	Sep 29, 2025	Proprietary	87.0%	77.2%	1M	New SWE-bench Verified leader at launch
GPT-5	Aug 7, 2025	Proprietary	94.6%	74.9%	400K	Unified router architecture
GPT-5.1	Nov 12, 2025	Proprietary	94.0%+	76.3%	400K	Bifurcated Instant / Thinking variants
GPT-5.2	Dec 11, 2025	Proprietary	96%+	78.0%+	400K	New ARC-AGI-2 leader (52.9%)
Claude Opus 4.7	Apr 16, 2026	Proprietary	High 90s	High 70s	1M	Opus-tier flagship; paired with Project Glasswing
gpt-oss	Aug 5, 2025	Apache 2.0	High 80s	Mid 50s	131K	OpenAI's first open-weight model since GPT-2

^[1]^[12]^[26]^[43]^[44]

R1's position in this lineage is structural rather than purely technical. It established the bar for what "open-weight reasoning model" meant, and most subsequent open-weight reasoning models (Qwen3-Thinking, Llama 4 Reasoning, Magistral, gpt-oss) were measured against it. By 2026 the open-weight tier had largely caught up to R1's specific recipe, but the broader principle that frontier reasoning could ship under a permissive license remained, and that principle is generally credited to R1.^[19]^[43]

Academic and research influence

R1's published recipe became one of the most-cited methodological papers of 2025 in the language-model literature. Three threads of follow-on research stand out.

The first is methodological work on GRPO itself. Variants including DAPO (Decoupled Advantage Policy Optimization), GRPO+ (used by Alibaba's Qwen team), and REINFORCE++ have appeared in 2025 papers, all of which build on the basic group-relative advantage trick R1 popularized. By the end of 2025, GRPO and its derivatives were the dominant RL recipe in published reasoning-model papers, displacing the earlier PPO-based RLHF pipeline that had been standard since InstructGPT. Hugging Face's TRL library and Allen AI's TRLX both shipped GRPO implementations within weeks of R1's release.^[16]^[17]

The second thread is the "R1 replication" project. Hugging Face's Open-R1, Berkeley NovaSky's Sky-T1, the Together AI / Stanford TinyZero work, the SimpleRL-Reason project, and the BeReal-style "small base + GRPO" experiments all attempted to reproduce R1's training trajectory using only public data and open base models. None of these matched the original R1 on benchmarks, but several reproduced the qualitative emergence of reflection behaviors, including the "wait, let me reconsider" patterns described in the original paper. The Sea AI Lab "There May Not be Aha Moment" paper sat in this thread as a critical counterpoint, arguing that some of the apparent emergence was inherited from base model pre-training rather than from RL itself.^[7]^[16]^[17]

The third thread is application-specific reasoning models distilled from R1. Medical reasoning (Med-R1, Apollo-R1), legal reasoning (Lawyer-R1), embodied agents (R1-Robot), and scientific-discovery agents (Sci-R1) all use R1 or one of its distillations as the teacher model. The MIT license made these projects possible without licensing negotiations, and by mid-2026 several of them had moved beyond research into commercial deployment.^[19]^[45]

R1 is also one of the few language models to have a major IMO (International Mathematical Olympiad) problem set named after it: the 2025 community-organized "R1-IMO" prize, which evaluated several open and closed reasoning models on previously unseen olympiad problems. R1-0528 placed third in that evaluation behind o3-pro and Gemini 2.5 Pro.^[44]

Current state

As of May 2026, DeepSeek-R1 and its derivatives remain among the most widely studied open-source reasoning models, even though DeepSeek's own product line has moved on to the V3.x hybrid family and the anticipated V4. R1-0528 continues to be available through the DeepSeek API at the original prices and through every major third-party inference provider. The 32B and 70B distilled models remain popular as locally hostable reasoning baselines, and the smaller distills (1.5B, 7B, 8B) are widely used as base models for further fine-tuning rather than as deployment endpoints.

The model's legacy is best measured by its influence on the field. R1 proved that reasoning-capable language models could be built openly and cheaply, that reinforcement learning could induce genuine reasoning behaviors without supervised examples, and that a small team with limited resources could compete with the largest AI labs in the world. The recipe it published, GRPO with rule-based rewards on verifiable tasks, has become the dominant approach for training reasoning models across both open-source and commercial labs. Most reasoning models released in 2025 and 2026, from Qwen's QwQ family to Microsoft's Phi-Reasoning to community-trained models on Hugging Face, used some variant of the R1 recipe.

R1 also reset expectations for what a model release should look like. The combination of a permissive license, a detailed training recipe, peer-reviewed publication, six pre-distilled variants, and aggressive API pricing became a de facto template that other open-source labs were measured against. When subsequent releases (Qwen3, Llama 4 Reasoning, Mistral Magistral) were perceived as stinting on documentation or imposing restrictive licenses, the comparison was usually to R1.

The financial and policy aftershocks lasted longer than the model itself. The "DeepSeek shock" of January 27, 2025 is now treated as the canonical market event of the AI boom, alongside ChatGPT's November 2022 launch and OpenAI's GPT-5 keynote in August 2025. It catalyzed the United States' AI Action Plan, accelerated US chip export controls, prompted the Anthropic and OpenAI public claims of cross-lab distillation, and put open-weight reasoning permanently inside the policy conversation. Even after the cost numbers were reframed, the directional finding (that frontier reasoning capability had become cheap enough for a focused team to reach) has held up.

References

"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." DeepSeek-AI, arXiv:2501.12948, January 2025. https://arxiv.org/abs/2501.12948
"DeepSeek-R1." DeepSeek-AI, Hugging Face. https://huggingface.co/deepseek-ai/DeepSeek-R1
"DeepSeek's $1 trillion stock market crash." Fast Company, January 2025. https://www.fastcompany.com/91268132/deepseek-ai-stock-market-crash-today-nvidia-tsmc-gain-ground
"A shocking Chinese AI advancement called DeepSeek is sending US stocks plunging." CNN Business, January 27, 2025. https://www.cnn.com/2025/01/27/tech/deepseek-stocks-ai-china
"DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." Nature 645, 633-638, September 18, 2025. https://www.nature.com/articles/s41586-025-09422-z
"DeepSeek-R1 release." DeepSeek API Docs, January 20, 2025. https://api-docs.deepseek.com/news/news250120
"There May Not be Aha Moment in R1-Zero-like Training: A Pilot Study." Sea AI Lab, 2025. https://sail.sea.com/blog/articles/62
"DeepSeek-R1: Features, o1 Comparison, Distilled Models & More." DataCamp, 2025. https://www.datacamp.com/blog/deepseek-r1
"How disruptive is DeepSeek? Stanford HAI faculty discuss." Stanford Report, February 2025. https://hai.stanford.edu/news/how-disruptive-is-deepseek-stanford-hai-faculty-discuss-chinas-new-model
"Why the Release of China's DeepSeek AI Software Triggered a Stock Market Panic and Trillion Dollar Loss." EDRM, January 2025. https://edrm.net/2025/01/why-the-release-of-chinas-deepseek-ai-software-triggered-a-stock-market-panic-and-trillion-dollar-loss/
"DeepSeek-R1-0528 release." DeepSeek API Docs, May 28, 2025. https://api-docs.deepseek.com/news/news250528
"DeepSeek R1-0528 arrives in powerful open source challenge to OpenAI o3 and Google Gemini 2.5 Pro." VentureBeat, May 2025. https://venturebeat.com/ai/deepseek-r1-0528-arrives-in-powerful-open-source-challenge-to-openai-o3-and-google-gemini-2-5-pro
"Models & pricing." DeepSeek API Docs. https://api-docs.deepseek.com/quick_start/pricing
"Pricing." OpenAI. https://openai.com/api/pricing/
"Government restrictions on DeepSeek." Various sources, 2025.
"Group Relative Policy Optimization (GRPO)." Cameron R. Wolfe, Substack. https://cameronrwolfe.substack.com/p/grpo
"From Zero to Reasoning Hero: How DeepSeek-R1 Leverages Reinforcement Learning to Master Complex Reasoning." Hugging Face Blog. https://huggingface.co/blog/NormalUhr/deepseek-r1-explained
"Step-by-Step: Running DeepSeek-R1 Distilled Models on Consumer GPUs 2025." MarkAICode. https://markaicode.com/deepseek-r1-distilled-models-consumer-gpu-guide-2025/
"State of Open Source on Hugging Face: Spring 2026." Hugging Face Blog. https://huggingface.co/blog/huggingface/state-of-os-hf-spring-2026
"DeepSeek in the Crosshairs: Legislative Actions, International Bans, and Censorship Reports." ComplexDiscovery, 2025. https://complexdiscovery.com/deepseek-in-the-crosshairs-legislative-actions-international-bans-and-censorship-reports/
"Nvidia stock plummets, loses record $589 billion as DeepSeek prompts questions over AI spending." Yahoo Finance / CNBC, January 27, 2025. https://www.cnbc.com/2025/01/27/nvidia-sheds-almost-600-billion-in-market-cap-biggest-drop-ever.html
"DeepSeek didn't really train its flagship model for $294,000." The Register, September 19, 2025. https://www.theregister.com/2025/09/19/deepseek_cost_train/
"Meet DeepSeek founder Liang Wenfeng, a hedge fund manager." Fortune, January 27, 2025. https://fortune.com/2025/01/27/deepseek-founder-liang-wenfeng-hedge-fund-manager-high-flyer-quant-trading/
"Pipeline for Training DeepSeek-R1." DhanushKumar, Medium, 2025. https://medium.com/@danushidk507/pipeline-for-training-deepseek-r1-0d99933c3b4f
"China's DeepSeek shook the tech world. Its developer just revealed the cost of training the AI model." CNN Business, September 19, 2025. https://www.cnn.com/2025/09/19/business/deepseek-ai-training-cost-china-intl
"DeepSeek-R1-0528." DeepSeek-AI, Hugging Face. https://huggingface.co/deepseek-ai/DeepSeek-R1-0528
"Which countries have banned DeepSeek and why?" Al Jazeera, February 6, 2025. https://www.aljazeera.com/news/2025/2/6/which-countries-have-banned-deepseek-and-why
"OpenAI Accuses China's DeepSeek of Distilling US AI Models to Gain an Edge." Bloomberg, February 12, 2026. https://www.bloomberg.com/news/articles/2026-02-12/openai-accuses-deepseek-of-distilling-us-models-to-gain-an-edge
"OpenAI says DeepSeek stole its AI data, but how common is 'distillation'?" South China Morning Post, 2025. https://www.scmp.com/tech/big-tech/article/3296827/deepseeks-ai-distillation-theft-openai-seeks-answers-over-chinas-breakthrough
"R1dacted: Investigating Local Censorship in DeepSeek's R1 Language Model." arXiv:2505.12625, May 2025. https://arxiv.org/abs/2505.12625
"A Technical Tour of the DeepSeek Models from V3 to V3.2." Sebastian Raschka, 2026. https://magazine.sebastianraschka.com/p/technical-deepseek
"DeepSeek-V3.2 release." DeepSeek API Docs, December 1, 2025. https://api-docs.deepseek.com/news/news251201
"DeepSeek Debates: Chinese Leadership On Cost, True Training Cost, Closed Model Margin Impacts." SemiAnalysis, January 31, 2025. https://semianalysis.com/2025/01/31/deepseek-debates/
"Disrupting state-sponsored uses of AI." Anthropic, February 2026. https://www.anthropic.com/news/disrupting-AI
"Winning the Race: America's AI Action Plan." White House Office of Science and Technology Policy, July 23, 2025. https://www.whitehouse.gov/wp-content/uploads/2025/07/Americas-AI-Action-Plan.pdf
"International AI Safety Report 2026 update." Bengio et al., 2026. https://www.gov.uk/government/publications/international-ai-safety-report-2026
"DeepSeek trained AI model on banned Nvidia Blackwell chips, says U.S. official." The New York Times, February 2026.
"Inside the DeepSeek data centre: Inner Mongolia, Blackwell, and Chinese export-control evasion." Bloomberg, February 2026. https://www.bloomberg.com/news/articles/2026-02
"China's DeepSeek racing to launch successor to viral R1 model, sources say." Reuters, March 25, 2025. https://www.reuters.com/technology/artificial-intelligence/chinas-deepseek-racing-launch-successor-viral-r1-model-sources-say-2025-03-25/
"DeepSeek's Liang Wenfeng meets Premier Li Qiang in Beijing as start-up shakes Big Tech." South China Morning Post, January 21, 2025. https://www.scmp.com/news/china/politics/article/3295447/deepseeks-liang-wenfeng-meets-premier-li-qiang-beijing-start-shakes-big-tech
"How Tencent, Alibaba and Baidu rushed to integrate DeepSeek." Caixin Global, February 2025. https://www.caixinglobal.com/2025-02-10/how-chinas-tech-giants-are-racing-to-integrate-deepseek-r1-102287345.html
"DeepSeek's R1 sparks China's enterprise AI gold rush." Financial Times, March 2025.
"Vellum LLM Leaderboard." Vellum AI, 2025-2026. https://www.vellum.ai/llm-leaderboard
"Artificial Analysis Quality Index, models tracked Q1-Q2 2026." Artificial Analysis, 2026. https://artificialanalysis.ai/models
"Med-R1: Medical reasoning distilled from DeepSeek-R1." arXiv:2503.13939, March 2025. https://arxiv.org/abs/2503.13939

Background

From hedge fund to frontier lab

The reasoning paradigm

Architecture

Base model

Training pipeline

Reported training cost

What the $5.576 million figure actually covers

Group Relative Policy Optimization

DeepSeek-R1-Zero

The training prompt

Emergent reasoning behaviors

Quantitative emergence

The "aha moment"

Limitations of R1-Zero

Benchmarks

Distilled models

Consumer hardware deployment

Distillation versus RL on small models

MIT license and open-source release

Community adoption

The "DeepSeek shock"

What happened

Broader implications

DeepSeek-R1-0528

Performance improvements

Technical changes

New developer features

API pricing

Nature publication and peer review

Impact on the AI industry

Reasoning does not require massive compute

Open source can compete at the frontier

The emergence argument

Competitive response

Security and regulatory concerns

US legislative response

Government agency restrictions

OpenAI distillation accusation

Anthropic fake-account accusation (February 2026)

America's AI Action Plan and the "DeepSeek shock" framing

Blackwell chip controversy (February 2026)

Censorship of politically sensitive topics

Security vulnerabilities

Limitations

Subsequent DeepSeek lineage

What happened to "DeepSeek R2"

Liang Wenfeng and DeepSeek's public posture after R1

Adoption inside Chinese technology platforms

Independent benchmark trajectory

Comparison with other frontier reasoning models

Academic and research influence

Current state

See also

References

Improve this article

Related Articles

GRPO

DeepSeek-R1-Distill

DeepSeek V3.1

Extended thinking

QwQ

DeepSeek-OCR

Background

From hedge fund to frontier lab

The reasoning paradigm

Architecture

Base model

Training pipeline

Reported training cost

What the $5.576 million figure actually covers

Group Relative Policy Optimization

DeepSeek-R1-Zero

The training prompt

Emergent reasoning behaviors

Quantitative emergence

The "aha moment"

Limitations of R1-Zero

Benchmarks

Distilled models