DeepSeek-R1-Distill

DeepSeek-R1-Distill is a family of six open-weight language models released by DeepSeek in January 2025, alongside the flagship DeepSeek-R1 reasoning model. The distilled models transfer the chain-of-thought reasoning capabilities of DeepSeek-R1 into smaller, dense architectures derived from Qwen and LLaMA base models. They range from 1.5 billion to 70 billion parameters and were created through supervised fine-tuning on 800,000 reasoning samples generated by DeepSeek-R1, without any additional reinforcement learning stage. The family is notable for achieving reasoning benchmark scores that significantly exceed those of their base models and, in some cases, exceed much larger general-purpose models, while remaining deployable on consumer hardware.

Background

DeepSeek-R1 and the reasoning model breakthrough

DeepSeek-R1 was published on January 20, 2025, alongside the technical paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (arXiv:2501.12948). The full model is a 671-billion parameter mixture-of-experts architecture trained with large-scale Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that teaches the model to produce extended chain-of-thought traces before giving a final answer. DeepSeek-R1 matched OpenAI o1 on several mathematics and science reasoning benchmarks, a significant result for open research since its weights and training details were disclosed publicly.

The DeepSeek-R1 paper made two contributions in parallel: it released the full 671B reasoning model, and it also released six distilled versions derived from smaller open-source base models. The distilled variants were intended to demonstrate that the reasoning patterns learned by the large RL-trained model could be transferred to much smaller architectures through straightforward supervised fine-tuning, making strong reasoning accessible without proprietary API access or high-end server hardware.

DeepSeek was founded in 2023 as an AI research lab affiliated with the quantitative hedge fund High-Flyer Capital Management, based in Hangzhou, China. Prior to the R1 release, the lab had released DeepSeek V3, a 685B mixture-of-experts language model noted for its competitive performance and efficient training cost. DeepSeek-R1 built on the V3 architecture and added an RL-based reasoning training pipeline.

Purpose of the distilled family

The full DeepSeek-R1 model requires substantial GPU memory (multiple A100s or H100s for full-precision inference), placing it outside the reach of most individual researchers and small organizations. The distilled models address this by targeting the 1.5B to 70B range that fits on consumer and workstation GPUs. At the same time, the distillation project served a research purpose: the authors of the paper wanted to test whether the reasoning behaviors that emerged from reinforcement learning on a large model could be reproduced in a small model purely through data-driven fine-tuning, without rerunning the expensive RL procedure.

This question had practical implications. Training small models with RL directly had been attempted but produced poor results compared with applying RL to large models. The paper showed that distilling RL-trained behavior into small models via supervised fine-tuning substantially outperforms training small models with RL from scratch, a finding that influenced subsequent work on open reasoning models.

The DeepSeek-R1 paper also released DeepSeek-R1-Zero, a model trained purely with GRPO on DeepSeek-V3-Base without any supervised fine-tuning phase. R1-Zero spontaneously developed reasoning behaviors including self-reflection and search, but also exhibited readability problems and language mixing. The full DeepSeek-R1 pipeline added a cold-start supervised fine-tuning phase before RL to address those issues. The distilled models do not involve RL at all; they receive only the supervised fine-tuning step, applied to pre-existing base architectures rather than DeepSeek-V3-Base.

Distillation methodology

Knowledge distillation via supervised fine-tuning

The DeepSeek-R1-Distill models are produced through a form of knowledge distillation that operates at the output level rather than the logit level. Instead of minimizing a KL divergence between teacher and student output distributions during training, the method uses the teacher model to generate a large dataset of reasoning traces and then fine-tunes the student models on those traces using standard supervised fine-tuning (SFT). This approach is sometimes called black-box distillation or behavioral cloning from a teacher model.

The dataset used for fine-tuning contains approximately 800,000 samples. These split into two broad categories:

Roughly 600,000 reasoning-related samples, generated by running the RL-trained DeepSeek-R1 checkpoint on mathematics and science problems and collecting the full chain-of-thought along with verified correct answers. The traces were filtered using rejection sampling, keeping only solutions that produced correct final answers.
Roughly 200,000 non-reasoning samples covering writing tasks, factual question answering, translation, and self-cognition queries. These were included to maintain general instruction-following capability in the distilled models.

The training procedure for each distilled model fine-tunes the base architecture on this combined dataset for a small number of epochs. No RL stage, reward modeling, or preference optimization is applied after the SFT phase. The authors explicitly noted that they withheld the RL stage to keep the contribution focused on demonstrating distillation effectiveness, and acknowledged that adding RL on top of the distilled checkpoints would likely improve performance further.

Why distillation outperforms RL on small models

A central empirical finding in the DeepSeek-R1 paper is that applying RL directly to a small base model (such as Qwen2.5-7B trained with GRPO) produces weaker reasoning than distilling the same base model from a large RL-trained teacher. The authors attribute this to the fact that small models have limited capacity to discover sophisticated reasoning strategies through trial-and-error reward optimization. The patterns they converge to under RL tend to be shallow. By contrast, when those models are shown thousands of long, verified reasoning chains generated by a much larger model, they can learn to mimic the structure of extended deliberation without needing to rediscover it independently.

The paper presented a direct comparison: DeepSeek-R1-Distill-Qwen-32B, produced by distilling from DeepSeek-R1 into a Qwen2.5-32B base, outperformed a version of Qwen2.5-32B trained directly with GRPO under the same compute budget. This suggested that the most efficient path to capable small reasoning models is to first develop reasoning in a large model through RL and then transfer it downward through distillation, rather than running RL at every scale.

This insight contributed to a broader shift in how the open-source community approached reasoning model development in 2025. Rather than attempting to reproduce the DeepSeek-R1 RL pipeline at smaller scale, many researchers and organizations used the distilled checkpoints as starting points for further fine-tuning on domain-specific reasoning data.

Rejection sampling and data quality

The quality and diversity of the reasoning traces used for distillation materially affect the resulting model. Traces were selected through rejection sampling: DeepSeek-R1 generated multiple candidate solutions for each problem, and only those producing verified correct final answers were retained for the training dataset. This filtering step ensures that the student model learns from correct reasoning chains rather than from plausible-sounding but incorrect ones.

Subsequent research from the community found that the characteristics of the training traces matter beyond correctness. A 2025 study showed that using more difficult problems, or generating traces from a teacher with more adaptive and diverse reasoning patterns, could produce student models that outperform the standard DeepSeek-R1-Distill checkpoints on hard mathematics benchmarks. This suggests the 800K dataset is sufficient but not necessarily optimal, and that re-distillation with better-curated data is a viable path to improving on the released models.

The six model variants

DeepSeek released six distilled checkpoints, spanning two base model families and four parameter scales in the Qwen line.

Model overview

Model	Base model	Parameters	HuggingFace
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	1.5B	deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	7B	deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	14B	deepseek-ai/DeepSeek-R1-Distill-Qwen-14B
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	32B	deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B	8B	deepseek-ai/DeepSeek-R1-Distill-Llama-8B
DeepSeek-R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	70B	deepseek-ai/DeepSeek-R1-Distill-Llama-70B

The two smallest models (1.5B and 7B) use Qwen2.5-Math as their base rather than the general-purpose Qwen2.5 series. Qwen2.5-Math is a variant of the Qwen2.5 architecture pre-trained by Alibaba with heavy emphasis on mathematical data, giving it a stronger prior for reasoning distillation at small parameter counts. The 14B and 32B models use general-purpose Qwen2.5 bases, which have broader training across text types.

For the Llama family, the 8B model uses Llama-3.1-8B-Base. The 70B model uses Llama-3.3-70B-Instruct rather than Llama-3.1-70B-Base. The paper noted this choice was made because the 3.3 instruction-tuned model showed somewhat stronger baseline reasoning capability than its predecessor. Using an instruction-tuned base for distillation is unconventional, but the SFT training on reasoning traces overrides most of the instruction-following behavior with the chain-of-thought format.

Architecture and generation configuration

All six models are standard dense transformer architectures, inheriting the layer counts, attention head configurations, and embedding dimensions of their respective bases. They do not use mixture-of-experts routing, which distinguishes them from the full DeepSeek-R1 model. The context length for generation is set to 32,768 tokens across all variants. The models use a chat template that wraps the chain-of-thought reasoning in <think> and </think> tags, with the final answer appearing after the closing tag.

The recommended sampling temperature is 0.6. The models are sensitive to temperature; values below 0.5 can cause the reasoning traces to loop or collapse, while values above 0.7 may produce incoherent outputs on hard problems. The models were designed to be used without a system prompt; in the January 2025 release, system prompt support was limited.

Benchmark performance

Core reasoning benchmarks

The table below shows the performance of all six distilled models on the main benchmarks reported in the DeepSeek-R1 paper, using pass@1 (single-sample accuracy) unless otherwise noted.

Model	AIME 2024 pass@1	AIME 2024 cons@64	MATH-500	GPQA Diamond	LiveCodeBench	CodeForces rating
DeepSeek-R1-Distill-Qwen-1.5B	28.9%	52.7%	83.9%	33.8%	16.9%	954
DeepSeek-R1-Distill-Qwen-7B	55.5%	83.3%	92.8%	49.1%	37.6%	1,189
DeepSeek-R1-Distill-Qwen-14B	69.7%	80.0%	93.9%	59.1%	53.1%	1,481
DeepSeek-R1-Distill-Qwen-32B	72.6%	83.3%	94.3%	62.1%	57.2%	1,691
DeepSeek-R1-Distill-Llama-8B	50.4%	80.0%	89.1%	49.0%	39.6%	1,205
DeepSeek-R1-Distill-Llama-70B	70.0%	86.7%	94.5%	65.2%	57.5%	1,633

Benchmark descriptions:

AIME 2024: American Invitational Mathematics Examination 2024, a competition mathematics test with 30 problems. pass@1 is single-sample accuracy; cons@64 is majority-vote accuracy over 64 independent samples.
MATH-500: A 500-problem subset of the MATH benchmark covering algebra, number theory, geometry, precalculus, and other topics.
GPQA Diamond: A set of graduate-level science questions in biology, chemistry, and physics, designed to be difficult for non-expert humans.
LiveCodeBench: A continuously updated competitive programming benchmark using problems added after the training cutoff of the evaluated models.
CodeForces rating: Estimated Elo-style rating on competitive programming problems drawn from the CodeForces platform.

The cons@64 metric is informative because it shows how much performance can be recovered through majority voting over repeated samples. The 1.5B model improves from 28.9% to 52.7% with 64 samples, a large gain that reflects the model's ability to reach the correct answer on some fraction of attempts even when it does not do so consistently.

Comparison with external models

The paper compared the distilled models against several external reference points at the time of the January 2025 publication:

Model	AIME 2024	MATH-500	GPQA Diamond
OpenAI o1-mini	63.6%	90.0%	60.0%
DeepSeek-R1-Distill-Qwen-32B	72.6%	94.3%	62.1%
DeepSeek-R1-Distill-Llama-70B	70.0%	94.5%	65.2%
QwQ-32B-Preview	50.0%	90.6%	54.5%

DeepSeek-R1-Distill-Qwen-32B exceeded OpenAI o1-mini on all three benchmarks despite being a smaller, open-weight dense model. The 14B model surpassed QwQ-32B-Preview (a 32B model with its own chain-of-thought training released by Alibaba in late 2024) on all reported metrics, demonstrating that distillation from a stronger teacher can compensate for a large parameter count gap.

Comparison with original base models

One of the clearest illustrations of what the distillation process accomplishes is comparing the distilled models directly to their base architectures before fine-tuning. The base models scored considerably lower on competition mathematics tasks because they lack explicit chain-of-thought training.

Model	AIME 2024 pass@1	MATH-500	Notes
Qwen2.5-Math-7B (base)	~16-18%	~70-75%	No chain-of-thought training
DeepSeek-R1-Distill-Qwen-7B	55.5%	92.8%	Roughly +38pp AIME, +20pp MATH
Qwen2.5-32B (base)	~35-40%	~83-85%	No chain-of-thought training
DeepSeek-R1-Distill-Qwen-32B	72.6%	94.3%	Roughly +33pp AIME, +10pp MATH
Llama-3.1-8B (base)	<10%	~50-55%	No chain-of-thought training
DeepSeek-R1-Distill-Llama-8B	50.4%	89.1%	Very large gains from distillation

The gains on AIME are particularly large because competition mathematics problems require multi-step planning and self-correction that base models without chain-of-thought training rarely exhibit. The distillation process teaches the model to generate a reasoning trace before committing to an answer, which allows it to catch errors and backtrack. This behavior is not spontaneous in the base models; it has to be instilled through training on traces that demonstrate it.

The Llama-3.1-8B base shows the most dramatic improvement in absolute terms. Its starting AIME score is very low because the Llama-3.1 base was trained primarily as a general-purpose language model with no particular emphasis on mathematical reasoning. The distillation adds over 40 percentage points of AIME accuracy by teaching it to reason step by step.

DeepSeek-R1-0528-Qwen3-8B

On May 28, 2025, DeepSeek released an updated version of its reasoning model called DeepSeek-R1-0528, along with a new distilled variant called DeepSeek-R1-0528-Qwen3-8B. This model followed the same distillation paradigm as the January 2025 family but used Qwen 3-8B Base rather than Qwen2.5-Math-7B as the starting point, reflecting the release of Alibaba's Qwen 3 model series in April 2025.

DeepSeek-R1-0528 itself represented a meaningful improvement over the original DeepSeek-R1. On AIME 2025, the full R1-0528 model scored 87.5%, up from 70.0% for the original R1. The increased capability of the teacher model flowed through to the distilled variant.

Training approach

The Qwen3-8B distilled model was produced by applying chain-of-thought traces from DeepSeek-R1-0528 to post-train Qwen3-8B Base. The methodology matched the January 2025 approach: supervised fine-tuning on teacher-generated reasoning traces, without an additional RL stage. The model shares the same tokenizer configuration as DeepSeek-R1-0528. Its architecture is identical to Qwen3-8B, with 8.19 billion parameters and a BF16 weight format.

Performance

DeepSeek-R1-0528-Qwen3-8B achieved state-of-the-art performance among open-source models in its size class at the time of release.

Benchmark	R1-0528-Qwen3-8B	Qwen3-8B	Qwen3-235B-A22B	o3-mini (medium)
AIME 2024	86.0%	76.0%	85.7%	79.6%
AIME 2025	76.3%	67.3%	81.5%	76.7%
HMMT Feb 2025	61.5%	n/a	62.5%	53.3%
GPQA Diamond	61.1%	62.0%	71.1%	76.8%
LiveCodeBench	60.5%	n/a	66.5%	62.3%

The model outperforms the base Qwen3-8B by 10 percentage points on AIME 2024 and matches the 235B mixture-of-experts Qwen3-235B-A22B model on that benchmark, despite having 8 billion parameters. On HMMT February 2025, a harder competition mathematics test than AIME, the 8B distilled model (61.5%) matches Qwen3-235B-A22B (62.5%) and substantially exceeds o3-mini medium (53.3%). These results illustrate how much reasoning capability can be transferred through distillation when the teacher model is strong.

Changes from the January 2025 distills

The May 2025 model introduced several practical usability improvements compared to the original January family:

System prompt support: the model accepts a system prompt at inference time. The January 2025 distilled models largely ignored or were not designed for system prompts.
Thinking mode activation: earlier distills required prepending <think>\n to the prompt to trigger chain-of-thought mode. The Qwen3-8B distill activates reasoning through normal conversation formatting without a forced prefix.
The model's configuration files must be sourced from the DeepSeek repository rather than the original Qwen3 repository, because the tokenizer and generation settings differ from the base Qwen3-8B release.

The recommended sampling temperature remains 0.6. The model is released under the MIT license.

Deeper reasoning traces

One measurable shift from the January 2025 distillation to the May 2025 round is reasoning depth. On difficult mathematics problems, the Qwen3-8B distilled model uses an average of around 23,000 tokens of internal reasoning before producing its final answer, compared to roughly 12,000 tokens for the January distilled models on similar problems. This near-doubling of thinking depth corresponds to improvements in accuracy on multi-step problems, and reflects the stronger reasoning behavior in the R1-0528 teacher, which itself benefits from improved RL training and more computational resources applied during post-training.

Hardware requirements

VRAM requirements by precision and quantization

The amount of GPU memory needed depends on the model size and the numerical precision used for inference. The table below shows approximate requirements for full-precision (BF16) and 4-bit quantized weights.

Model	BF16 VRAM	Q4 VRAM (approx.)	Practical consumer GPU
R1-Distill-Qwen-1.5B	~3 GB	~1.5 GB	Any modern GPU or CPU
R1-Distill-Qwen-7B	~14 GB	~4-5 GB	RTX 3060 (12 GB) at Q4
R1-Distill-Llama-8B	~16 GB	~5-6 GB	RTX 3060 (12 GB) at Q4
R1-Distill-Qwen-14B	~28 GB	~8-10 GB	RTX 3090 or 4090 at Q4
R1-Distill-Qwen-32B	~66 GB	~18-20 GB	RTX 4090 (24 GB) at Q4
R1-Distill-Llama-70B	~140 GB	~40 GB	Multi-GPU or Mac Studio 192 GB

For CPU-only inference, the models can run on systems with 48 GB or more of RAM at reduced throughput (typically under 2 tokens per second for the 14B and larger variants on current consumer hardware).

Quantization formats and tools

The Unsloth team released GGUF-format quantized versions of the distilled models shortly after the January 2025 release, including Q4_K_M, Q6_K, and Q8_0 variants. GGUF is the standard format used by llama.cpp and Ollama, making the models accessible on both NVIDIA GPUs (via CUDA) and Apple Silicon (via Metal). Ollama added the distilled models to its library under the deepseek-r1 tag with size suffixes. Running ollama run deepseek-r1:7b downloads and runs the Qwen-7B distill, and ollama run deepseek-r1:8b fetches the Llama-8B variant.

For server-side deployment in production, the models are compatible with vLLM and SGLang. A vLLM launch for the 32B model typically uses tensor parallelism across two GPUs with --max-model-len 32768 --enforce-eager flags. SGLang achieves lower latency on batch inference workloads through its RadixAttention cache management.

Unsloth also released bitsandbytes-quantized versions of the distilled models (4-bit NF4 format) for users who prefer to run fine-tuning and inference within a Python environment without converting to GGUF.

Apple Silicon

Apple Silicon Macs with unified memory are well-suited for the smaller distilled models because the CPU, GPU, and neural engine share the same memory pool, eliminating PCIe bandwidth bottlenecks. The 7B and 8B models run comfortably on M2 and M3 MacBook Pro configurations with 16 GB of unified memory at Q4 quantization, producing interactive-speed output. The 14B and 32B variants require Mac Studio or Mac Pro configurations with at least 64 GB of unified memory for reasonable throughput.

License and open-source availability

The licensing of the DeepSeek-R1-Distill models depends on which variant is used, since each inherits the license of its base model architecture.

Model group	License
DeepSeek-R1-Distill-Qwen-1.5B, 7B, 14B, 32B	Apache 2.0 (from Qwen2.5 base)
DeepSeek-R1-Distill-Llama-8B	Meta Llama 3.1 Community License
DeepSeek-R1-Distill-Llama-70B	Meta Llama 3.3 Community License
DeepSeek-R1-0528-Qwen3-8B	MIT License

The DeepSeek-R1 model card and the GitHub repository state that the model series supports commercial use and allows derivative works including further distillation for training other language models. The Qwen variants, governed by Apache 2.0, are broadly permissive for commercial and research use. The Llama variants require compliance with Meta's community license agreements, which permit commercial use for organizations below a certain user count threshold.

The R1-0528-Qwen3-8B model, released under MIT, is the most permissive of the family and places essentially no restrictions on use or redistribution.

All model weights are publicly available on HuggingFace under the deepseek-ai organization page.

Use cases

Mathematics and science reasoning

The primary intended use case for the distilled models is reasoning-heavy tasks: mathematics competition problems, physics and chemistry calculations, and multi-step logical deduction. The AIME and GPQA benchmark scores show that even the 7B and 8B variants substantially exceed what general-purpose models of similar size achieve on these tasks. Researchers and developers who need strong quantitative reasoning without server infrastructure have adopted the 7B, 8B, and 14B models as a practical alternative to calling large-model APIs.

The chain-of-thought format also makes the reasoning process inspectable. Users can read the model's trace to verify that it approached a problem correctly or to identify where it made an error, which is useful in educational and research workflows where the derivation matters as much as the answer.

Fine-tuning starting points

Because the distilled models already encode extended chain-of-thought behavior, they serve as effective starting points for further domain-specific fine-tuning. The community has produced several derivatives:

Fin-R1: a model fine-tuned from the 7B distill on financial reasoning data, achieving notable scores on FinQA and ConvFinQA benchmarks for structured financial question answering.
Medical variants: LoRA-fine-tuned versions of the 7B distill on medical question data have been reported to reach above 90% accuracy on USMLE Step 1 question sets while reducing memory requirements significantly compared to full fine-tuning.
Table reasoning models: Table-R1-SFT, a 7B variant fine-tuned on tabular data with reasoning traces, showed large accuracy gains on multiple tabular benchmarks relative to the unmodified 7B distill.
Biomedical NLP: distilled models from 7B to 70B have been applied to named entity recognition, relation extraction, and event extraction in biomedical literature, with reported F1 scores above 0.95 on some extraction tasks.

Unsloth and Alibaba Cloud both provide tooling for LoRA and full fine-tuning of the distilled checkpoints. Alibaba's platform specifically documented a one-click fine-tuning workflow for all six January 2025 distilled models.

Local and private deployment

Organizations and individuals who cannot or do not want to send data to external APIs use the distilled models for on-premise or local inference. Healthcare providers working with patient records, law firms processing confidential documents, and government agencies with data residency requirements all represent sectors where sending data to a cloud API is impractical or prohibited. The 1.5B model runs on any modern GPU or CPU with modest memory requirements, making it accessible for edge devices and embedded applications. The 7B and 14B models cover the range where a single consumer GPU can handle interactive inference speeds for most reasoning tasks.

Cost-constrained reasoning applications

In cloud deployments where inference cost per token is a constraint, the distilled models allow operators to run reasoning tasks at a fraction of the cost of the full 671B model. The 32B and 70B distilled models capture a large share of the reasoning performance of DeepSeek-R1 on standard benchmarks while requiring far fewer GPU-hours per call. This cost profile has made them attractive for applications where reasoning is needed at scale, such as automated code review, large-scale document analysis, or educational platforms generating mathematics feedback.

Adoption

Community downloads and usage

The distilled models drew large download numbers on HuggingFace in the weeks following the January 2025 release. The 7B and 14B models were particularly widely downloaded because they fit within the hardware available to most developers. By May 2025, the DeepSeek-R1-0528-Qwen3-8B model was receiving over 258,000 downloads per month on HuggingFace. Ollama reported that the deepseek-r1 tag, which covers the distilled family, became one of the most-pulled model families in its library during February and March 2025.

Third-party API providers

Several API providers added distilled DeepSeek-R1 models to their hosted offerings. OpenRouter listed multiple size variants with per-token pricing. Fireworks.ai and Together AI hosted the 7B, 14B, and 70B variants. SiliconFlow, a Chinese API provider, listed the 14B model on its platform. These offerings gave developers access to the distilled models without running local infrastructure.

Integration into developer tooling

LM Studio, an application for running language models on consumer hardware, added support for the GGUF-quantized distilled models through its model browser. Mozilla AI released a llamafile-packaged version of the 14B model under the identifier mozilla-ai/DeepSeek-R1-Distill-Qwen-14B-llamafile, which packages the weights and inference runtime into a single executable file that runs on multiple operating systems without a separate installation step.

Inference backends including vLLM, SGLang, and Ollama all added explicit support for the distilled models, including configuration recommendations in their documentation.

Research uptake

Academic papers published in the first half of 2025 used the distilled models as baselines and fine-tuning starting points across areas including biomedical text processing, financial analytics, and mathematical problem generation. The Dropbox engineering team published a technical blog post on re-distillation experiments, finding that using more challenging reasoning traces as training data produced student models that outperformed the standard DeepSeek-R1-Distill checkpoints on hard mathematics benchmarks. This work contributed to understanding of how the quality of the teacher's traces influences student model capability.

A research group published findings (arXiv:2505.13792) examining the disconnect between trace interpretability and student model outcomes in trace-based knowledge distillation, using the DeepSeek-R1-Distill family as a test case. The paper found that even when traces are interpretable to humans, students can learn unexpected generalizations that diverge from the surface reasoning in the traces.

Limitations

Language mixing in reasoning traces

The distilled models can exhibit language mixing in their internal reasoning traces, particularly when prompted in a language other than Chinese or English. The model may begin a reasoning trace in English and switch to Chinese mid-trace, or mix Chinese mathematical terminology with English variable names. This was an observed failure mode in the DeepSeek-R1-Zero training run and persists to a lesser degree in the distilled models. The problem typically appears in the <think> section rather than in the final answer, and does not usually affect answer correctness, but it is distracting and can impede debugging of reasoning chains.

Context window limits

The maximum generation length for the January 2025 distilled models is 32,768 tokens. This includes both the chain-of-thought reasoning trace and the final answer. For difficult problems that require very long reasoning traces, the model may run out of generation budget before completing the solution. The 32K limit is sufficient for most standard benchmark problems but can be a constraint for novel hard problems or long-form analytical tasks.

Trade-offs in general capability

The smaller distilled models (1.5B and 7B) show reduced performance on open-ended writing, creative tasks, and conversational instruction following compared to general-purpose models of similar parameter count. The 800K training dataset is concentrated on reasoning and question answering, so capabilities that depend on exposure to diverse text types are less developed. Users who need a balance between reasoning and general language quality typically find the 14B or 32B variants more suitable.

The authors noted explicitly that the distilled models were released without an RL stage on top of the SFT. Community experiments have confirmed that applying RL post-training to the distilled checkpoints improves reasoning accuracy on hard benchmarks, sometimes by several percentage points on AIME. The released models represent the SFT-only checkpoint, not the theoretical ceiling achievable with additional training.

Tool use and structured output reliability

Early deployment reports from the developer community noted that the January 2025 distilled variants were less reliable than the full DeepSeek-R1 model on tool calling and structured JSON output tasks. The internal reasoning mode and function-calling modes were described as partially conflicting in some inference server configurations. This limitation was less pronounced in the May 2025 Qwen3-8B distill, which introduced better separation of reasoning and structured-output behavior.

Content refusal patterns

The distilled models inherit refusal behaviors from their training that reflect content moderation policies applied during the DeepSeek-R1 training process. Independent evaluations have observed refusal patterns on topics that are politically sensitive in China. This has no effect on the models' mathematics or science reasoning performance but is a practical consideration for organizations deploying the models in open-domain conversational settings where a broad range of user queries is expected.

References

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948 (January 2025). https://arxiv.org/abs/2501.12948
DeepSeek-AI. DeepSeek-R1 GitHub repository. https://github.com/deepseek-ai/DeepSeek-R1
DeepSeek-AI. DeepSeek-R1 model card, HuggingFace. https://huggingface.co/deepseek-ai/DeepSeek-R1
DeepSeek-AI. DeepSeek-R1-Distill-Qwen-7B model card, HuggingFace. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
DeepSeek-AI. DeepSeek-R1-Distill-Qwen-32B model card, HuggingFace. https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
DeepSeek-AI. DeepSeek-R1-0528-Qwen3-8B model card, HuggingFace. https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
DeepSeek API Docs. "DeepSeek-R1-0528 Release." https://api-docs.deepseek.com/news/news250528
Fireworks AI. "DeepSeek Models: V3.2, R1, Distills, and Production Caveats." https://fireworks.ai/blog/deepseek-models
Fireworks AI. "Distillation with Reasoning: Can DeepSeek R1 Teach Better Than Humans?" https://fireworks.ai/blog/deepseek-r1-distillation-reasoning
Unsloth. "Run DeepSeek-R1 / R1 Zero." https://unsloth.ai/blog/deepseek-r1
DataCamp. "DeepSeek-R1: Features, o1 Comparison, Distilled Models & More." https://www.datacamp.com/blog/deepseek-r1
Dropbox Engineering. "Re-Distilling Smaller DeepSeek R1 Models for Better Performance." https://dropbox.github.io/r1_redistill_blogpost/
BentoML. "The Complete Guide to DeepSeek Models: V3, R1, V4 and Beyond." https://www.bentoml.com/blog/the-complete-guide-to-deepseek-models-from-v3-to-r1-and-beyond
Red Hat Developer. "Deployment-ready reasoning with quantized DeepSeek-R1 models." https://developers.redhat.com/articles/2025/03/03/deployment-ready-reasoning-quantized-deepseek-r1-models
Alibaba Cloud. "One-click fine-tuning of DeepSeek-R1 distill models." https://www.alibabacloud.com/help/en/pai/use-cases/one-click-fine-tuning-of-deepseek-r1-distill-models
RunPod. "The Minor Upgrade That's Anything But: DeepSeek R1 0528 Deep Dive." https://www.runpod.io/blog/deepseek-r1-0528-deep-dive
DeepWiki. "Distilled Models, deepseek-ai/DeepSeek-R1." https://deepwiki.com/deepseek-ai/DeepSeek-R1/2.3-distilled-models
arXiv:2505.13792. "Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation." https://arxiv.org/html/2505.13792

Background

DeepSeek-R1 and the reasoning model breakthrough

Purpose of the distilled family

Distillation methodology

Knowledge distillation via supervised fine-tuning

Why distillation outperforms RL on small models

Rejection sampling and data quality

The six model variants

Model overview

Architecture and generation configuration

Benchmark performance

Core reasoning benchmarks

Comparison with external models

Comparison with original base models

DeepSeek-R1-0528-Qwen3-8B

Training approach

Performance

Changes from the January 2025 distills

Deeper reasoning traces

Hardware requirements

VRAM requirements by precision and quantization

Quantization formats and tools

Apple Silicon

License and open-source availability

Use cases

Mathematics and science reasoning

Fine-tuning starting points

Local and private deployment

Cost-constrained reasoning applications

Adoption

Community downloads and usage

Third-party API providers

Integration into developer tooling

Research uptake

Limitations

Language mixing in reasoning traces

Context window limits

Trade-offs in general capability

Absence of RL refinement

Tool use and structured output reliability

Content refusal patterns

See also

References

Improve this article

Related Articles

DeepSeek V3.1

DeepSeek V3

QwQ

DeepSeek V4

OLMo 3

ZAYA1-8B

Background

DeepSeek-R1 and the reasoning model breakthrough

Purpose of the distilled family

Distillation methodology

Knowledge distillation via supervised fine-tuning

Why distillation outperforms RL on small models

Rejection sampling and data quality

The six model variants

Model overview

Architecture and generation configuration

Benchmark performance

Core reasoning benchmarks

Comparison with external models

Comparison with original base models

DeepSeek-R1-0528-Qwen3-8B

Training approach

Performance

Changes from the January 2025 distills

Deeper reasoning traces

Hardware requirements

VRAM requirements by precision and quantization

Quantization formats and tools

Apple Silicon

License and open-source availability

Use cases

Mathematics and science reasoning

Fine-tuning starting points

Local and private deployment

Cost-constrained reasoning applications