DeepSeek-R1-Distill is a family of six open-weight language models released by DeepSeek in January 2025, alongside the flagship DeepSeek-R1 reasoning model. The distilled models transfer the chain-of-thought reasoning capabilities of DeepSeek-R1 into smaller, dense architectures derived from Qwen and LLaMA base models. They range from 1.5 billion to 70 billion parameters and were created through supervised fine-tuning on 800,000 reasoning samples generated by DeepSeek-R1, without any additional reinforcement learning stage. The family is notable for achieving reasoning benchmark scores that significantly exceed those of their base models and, in some cases, exceed much larger general-purpose models, while remaining deployable on consumer hardware.
DeepSeek-R1 was published on January 20, 2025, alongside the technical paper "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" (arXiv:2501.12948). The full model is a 671-billion parameter mixture-of-experts architecture trained with large-scale Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm that teaches the model to produce extended chain-of-thought traces before giving a final answer. DeepSeek-R1 matched OpenAI o1 on several mathematics and science reasoning benchmarks, a significant result for open research since its weights and training details were disclosed publicly.
The DeepSeek-R1 paper made two contributions in parallel: it released the full 671B reasoning model, and it also released six distilled versions derived from smaller open-source base models. The distilled variants were intended to demonstrate that the reasoning patterns learned by the large RL-trained model could be transferred to much smaller architectures through straightforward supervised fine-tuning, making strong reasoning accessible without proprietary API access or high-end server hardware.
DeepSeek was founded in 2023 as an AI research lab affiliated with the quantitative hedge fund High-Flyer Capital Management, based in Hangzhou, China. Prior to the R1 release, the lab had released DeepSeek V3, a 685B mixture-of-experts language model noted for its competitive performance and efficient training cost. DeepSeek-R1 built on the V3 architecture and added an RL-based reasoning training pipeline.
The full DeepSeek-R1 model requires substantial GPU memory (multiple A100s or H100s for full-precision inference), placing it outside the reach of most individual researchers and small organizations. The distilled models address this by targeting the 1.5B to 70B range that fits on consumer and workstation GPUs. At the same time, the distillation project served a research purpose: the authors of the paper wanted to test whether the reasoning behaviors that emerged from reinforcement learning on a large model could be reproduced in a small model purely through data-driven fine-tuning, without rerunning the expensive RL procedure.
This question had practical implications. Training small models with RL directly had been attempted but produced poor results compared with applying RL to large models. The paper showed that distilling RL-trained behavior into small models via supervised fine-tuning substantially outperforms training small models with RL from scratch, a finding that influenced subsequent work on open reasoning models.
The DeepSeek-R1 paper also released DeepSeek-R1-Zero, a model trained purely with GRPO on DeepSeek-V3-Base without any supervised fine-tuning phase. R1-Zero spontaneously developed reasoning behaviors including self-reflection and search, but also exhibited readability problems and language mixing. The full DeepSeek-R1 pipeline added a cold-start supervised fine-tuning phase before RL to address those issues. The distilled models do not involve RL at all; they receive only the supervised fine-tuning step, applied to pre-existing base architectures rather than DeepSeek-V3-Base.
The DeepSeek-R1-Distill models are produced through a form of knowledge distillation that operates at the output level rather than the logit level. Instead of minimizing a KL divergence between teacher and student output distributions during training, the method uses the teacher model to generate a large dataset of reasoning traces and then fine-tunes the student models on those traces using standard supervised fine-tuning (SFT). This approach is sometimes called black-box distillation or behavioral cloning from a teacher model.
The dataset used for fine-tuning contains approximately 800,000 samples. These split into two broad categories:
The training procedure for each distilled model fine-tunes the base architecture on this combined dataset for a small number of epochs. No RL stage, reward modeling, or preference optimization is applied after the SFT phase. The authors explicitly noted that they withheld the RL stage to keep the contribution focused on demonstrating distillation effectiveness, and acknowledged that adding RL on top of the distilled checkpoints would likely improve performance further.
A central empirical finding in the DeepSeek-R1 paper is that applying RL directly to a small base model (such as Qwen2.5-7B trained with GRPO) produces weaker reasoning than distilling the same base model from a large RL-trained teacher. The authors attribute this to the fact that small models have limited capacity to discover sophisticated reasoning strategies through trial-and-error reward optimization. The patterns they converge to under RL tend to be shallow. By contrast, when those models are shown thousands of long, verified reasoning chains generated by a much larger model, they can learn to mimic the structure of extended deliberation without needing to rediscover it independently.
The paper presented a direct comparison: DeepSeek-R1-Distill-Qwen-32B, produced by distilling from DeepSeek-R1 into a Qwen2.5-32B base, outperformed a version of Qwen2.5-32B trained directly with GRPO under the same compute budget. This suggested that the most efficient path to capable small reasoning models is to first develop reasoning in a large model through RL and then transfer it downward through distillation, rather than running RL at every scale.
This insight contributed to a broader shift in how the open-source community approached reasoning model development in 2025. Rather than attempting to reproduce the DeepSeek-R1 RL pipeline at smaller scale, many researchers and organizations used the distilled checkpoints as starting points for further fine-tuning on domain-specific reasoning data.
The quality and diversity of the reasoning traces used for distillation materially affect the resulting model. Traces were selected through rejection sampling: DeepSeek-R1 generated multiple candidate solutions for each problem, and only those producing verified correct final answers were retained for the training dataset. This filtering step ensures that the student model learns from correct reasoning chains rather than from plausible-sounding but incorrect ones.
Subsequent research from the community found that the characteristics of the training traces matter beyond correctness. A 2025 study showed that using more difficult problems, or generating traces from a teacher with more adaptive and diverse reasoning patterns, could produce student models that outperform the standard DeepSeek-R1-Distill checkpoints on hard mathematics benchmarks. This suggests the 800K dataset is sufficient but not necessarily optimal, and that re-distillation with better-curated data is a viable path to improving on the released models.
DeepSeek released six distilled checkpoints, spanning two base model families and four parameter scales in the Qwen line.
| Model | Base model | Parameters | HuggingFace |
|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | Qwen2.5-Math-1.5B | 1.5B | deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
| DeepSeek-R1-Distill-Qwen-7B | Qwen2.5-Math-7B | 7B | deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| DeepSeek-R1-Distill-Qwen-14B | Qwen2.5-14B | 14B | deepseek-ai/DeepSeek-R1-Distill-Qwen-14B |
| DeepSeek-R1-Distill-Qwen-32B | Qwen2.5-32B | 32B | deepseek-ai/DeepSeek-R1-Distill-Qwen-32B |
| DeepSeek-R1-Distill-Llama-8B | Llama-3.1-8B | 8B | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
| DeepSeek-R1-Distill-Llama-70B | Llama-3.3-70B-Instruct | 70B | deepseek-ai/DeepSeek-R1-Distill-Llama-70B |
The two smallest models (1.5B and 7B) use Qwen2.5-Math as their base rather than the general-purpose Qwen2.5 series. Qwen2.5-Math is a variant of the Qwen2.5 architecture pre-trained by Alibaba with heavy emphasis on mathematical data, giving it a stronger prior for reasoning distillation at small parameter counts. The 14B and 32B models use general-purpose Qwen2.5 bases, which have broader training across text types.
For the Llama family, the 8B model uses Llama-3.1-8B-Base. The 70B model uses Llama-3.3-70B-Instruct rather than Llama-3.1-70B-Base. The paper noted this choice was made because the 3.3 instruction-tuned model showed somewhat stronger baseline reasoning capability than its predecessor. Using an instruction-tuned base for distillation is unconventional, but the SFT training on reasoning traces overrides most of the instruction-following behavior with the chain-of-thought format.
All six models are standard dense transformer architectures, inheriting the layer counts, attention head configurations, and embedding dimensions of their respective bases. They do not use mixture-of-experts routing, which distinguishes them from the full DeepSeek-R1 model. The context length for generation is set to 32,768 tokens across all variants. The models use a chat template that wraps the chain-of-thought reasoning in <think> and </think> tags, with the final answer appearing after the closing tag.
The recommended sampling temperature is 0.6. The models are sensitive to temperature; values below 0.5 can cause the reasoning traces to loop or collapse, while values above 0.7 may produce incoherent outputs on hard problems. The models were designed to be used without a system prompt; in the January 2025 release, system prompt support was limited.
The table below shows the performance of all six distilled models on the main benchmarks reported in the DeepSeek-R1 paper, using pass@1 (single-sample accuracy) unless otherwise noted.
| Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 | GPQA Diamond | LiveCodeBench | CodeForces rating |
|---|---|---|---|---|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B | 28.9% | 52.7% | 83.9% | 33.8% | 16.9% | 954 |
| DeepSeek-R1-Distill-Qwen-7B | 55.5% | 83.3% | 92.8% | 49.1% | 37.6% | 1,189 |
| DeepSeek-R1-Distill-Qwen-14B | 69.7% | 80.0% | 93.9% | 59.1% | 53.1% | 1,481 |
| DeepSeek-R1-Distill-Qwen-32B | 72.6% | 83.3% | 94.3% | 62.1% | 57.2% | 1,691 |
| DeepSeek-R1-Distill-Llama-8B | 50.4% | 80.0% | 89.1% | 49.0% | 39.6% | 1,205 |
| DeepSeek-R1-Distill-Llama-70B | 70.0% | 86.7% | 94.5% | 65.2% | 57.5% | 1,633 |
Benchmark descriptions:
The cons@64 metric is informative because it shows how much performance can be recovered through majority voting over repeated samples. The 1.5B model improves from 28.9% to 52.7% with 64 samples, a large gain that reflects the model's ability to reach the correct answer on some fraction of attempts even when it does not do so consistently.
The paper compared the distilled models against several external reference points at the time of the January 2025 publication:
| Model | AIME 2024 | MATH-500 | GPQA Diamond |
|---|---|---|---|
| OpenAI o1-mini | 63.6% | 90.0% | 60.0% |
| DeepSeek-R1-Distill-Qwen-32B | 72.6% | 94.3% | 62.1% |
| DeepSeek-R1-Distill-Llama-70B | 70.0% | 94.5% | 65.2% |
| QwQ-32B-Preview | 50.0% | 90.6% | 54.5% |
DeepSeek-R1-Distill-Qwen-32B exceeded OpenAI o1-mini on all three benchmarks despite being a smaller, open-weight dense model. The 14B model surpassed QwQ-32B-Preview (a 32B model with its own chain-of-thought training released by Alibaba in late 2024) on all reported metrics, demonstrating that distillation from a stronger teacher can compensate for a large parameter count gap.
One of the clearest illustrations of what the distillation process accomplishes is comparing the distilled models directly to their base architectures before fine-tuning. The base models scored considerably lower on competition mathematics tasks because they lack explicit chain-of-thought training.
| Model | AIME 2024 pass@1 | MATH-500 | Notes |
|---|---|---|---|
| Qwen2.5-Math-7B (base) | ~16-18% | ~70-75% | No chain-of-thought training |
| DeepSeek-R1-Distill-Qwen-7B | 55.5% | 92.8% | Roughly +38pp AIME, +20pp MATH |
| Qwen2.5-32B (base) | ~35-40% | ~83-85% | No chain-of-thought training |
| DeepSeek-R1-Distill-Qwen-32B | 72.6% | 94.3% | Roughly +33pp AIME, +10pp MATH |
| Llama-3.1-8B (base) | <10% | ~50-55% | No chain-of-thought training |
| DeepSeek-R1-Distill-Llama-8B | 50.4% | 89.1% | Very large gains from distillation |
The gains on AIME are particularly large because competition mathematics problems require multi-step planning and self-correction that base models without chain-of-thought training rarely exhibit. The distillation process teaches the model to generate a reasoning trace before committing to an answer, which allows it to catch errors and backtrack. This behavior is not spontaneous in the base models; it has to be instilled through training on traces that demonstrate it.
The Llama-3.1-8B base shows the most dramatic improvement in absolute terms. Its starting AIME score is very low because the Llama-3.1 base was trained primarily as a general-purpose language model with no particular emphasis on mathematical reasoning. The distillation adds over 40 percentage points of AIME accuracy by teaching it to reason step by step.
On May 28, 2025, DeepSeek released an updated version of its reasoning model called DeepSeek-R1-0528, along with a new distilled variant called DeepSeek-R1-0528-Qwen3-8B. This model followed the same distillation paradigm as the January 2025 family but used Qwen 3-8B Base rather than Qwen2.5-Math-7B as the starting point, reflecting the release of Alibaba's Qwen 3 model series in April 2025.
DeepSeek-R1-0528 itself represented a meaningful improvement over the original DeepSeek-R1. On AIME 2025, the full R1-0528 model scored 87.5%, up from 70.0% for the original R1. The increased capability of the teacher model flowed through to the distilled variant.
The Qwen3-8B distilled model was produced by applying chain-of-thought traces from DeepSeek-R1-0528 to post-train Qwen3-8B Base. The methodology matched the January 2025 approach: supervised fine-tuning on teacher-generated reasoning traces, without an additional RL stage. The model shares the same tokenizer configuration as DeepSeek-R1-0528. Its architecture is identical to Qwen3-8B, with 8.19 billion parameters and a BF16 weight format.
DeepSeek-R1-0528-Qwen3-8B achieved state-of-the-art performance among open-source models in its size class at the time of release.
| Benchmark | R1-0528-Qwen3-8B | Qwen3-8B | Qwen3-235B-A22B | o3-mini (medium) |
|---|---|---|---|---|
| AIME 2024 | 86.0% | 76.0% | 85.7% | 79.6% |
| AIME 2025 | 76.3% | 67.3% | 81.5% | 76.7% |
| HMMT Feb 2025 | 61.5% | n/a | 62.5% | 53.3% |
| GPQA Diamond | 61.1% | 62.0% | 71.1% | 76.8% |
| LiveCodeBench | 60.5% | n/a | 66.5% | 62.3% |
The model outperforms the base Qwen3-8B by 10 percentage points on AIME 2024 and matches the 235B mixture-of-experts Qwen3-235B-A22B model on that benchmark, despite having 8 billion parameters. On HMMT February 2025, a harder competition mathematics test than AIME, the 8B distilled model (61.5%) matches Qwen3-235B-A22B (62.5%) and substantially exceeds o3-mini medium (53.3%). These results illustrate how much reasoning capability can be transferred through distillation when the teacher model is strong.
The May 2025 model introduced several practical usability improvements compared to the original January family:
<think>\n to the prompt to trigger chain-of-thought mode. The Qwen3-8B distill activates reasoning through normal conversation formatting without a forced prefix.The recommended sampling temperature remains 0.6. The model is released under the MIT license.
One measurable shift from the January 2025 distillation to the May 2025 round is reasoning depth. On difficult mathematics problems, the Qwen3-8B distilled model uses an average of around 23,000 tokens of internal reasoning before producing its final answer, compared to roughly 12,000 tokens for the January distilled models on similar problems. This near-doubling of thinking depth corresponds to improvements in accuracy on multi-step problems, and reflects the stronger reasoning behavior in the R1-0528 teacher, which itself benefits from improved RL training and more computational resources applied during post-training.
The amount of GPU memory needed depends on the model size and the numerical precision used for inference. The table below shows approximate requirements for full-precision (BF16) and 4-bit quantized weights.
| Model | BF16 VRAM | Q4 VRAM (approx.) | Practical consumer GPU |
|---|---|---|---|
| R1-Distill-Qwen-1.5B | ~3 GB | ~1.5 GB | Any modern GPU or CPU |
| R1-Distill-Qwen-7B | ~14 GB | ~4-5 GB | RTX 3060 (12 GB) at Q4 |
| R1-Distill-Llama-8B | ~16 GB | ~5-6 GB | RTX 3060 (12 GB) at Q4 |
| R1-Distill-Qwen-14B | ~28 GB | ~8-10 GB | RTX 3090 or 4090 at Q4 |
| R1-Distill-Qwen-32B | ~66 GB | ~18-20 GB | RTX 4090 (24 GB) at Q4 |
| R1-Distill-Llama-70B | ~140 GB | ~40 GB | Multi-GPU or Mac Studio 192 GB |
For CPU-only inference, the models can run on systems with 48 GB or more of RAM at reduced throughput (typically under 2 tokens per second for the 14B and larger variants on current consumer hardware).
The Unsloth team released GGUF-format quantized versions of the distilled models shortly after the January 2025 release, including Q4_K_M, Q6_K, and Q8_0 variants. GGUF is the standard format used by llama.cpp and Ollama, making the models accessible on both NVIDIA GPUs (via CUDA) and Apple Silicon (via Metal). Ollama added the distilled models to its library under the deepseek-r1 tag with size suffixes. Running ollama run deepseek-r1:7b downloads and runs the Qwen-7B distill, and ollama run deepseek-r1:8b fetches the Llama-8B variant.
For server-side deployment in production, the models are compatible with vLLM and SGLang. A vLLM launch for the 32B model typically uses tensor parallelism across two GPUs with --max-model-len 32768 --enforce-eager flags. SGLang achieves lower latency on batch inference workloads through its RadixAttention cache management.
Unsloth also released bitsandbytes-quantized versions of the distilled models (4-bit NF4 format) for users who prefer to run fine-tuning and inference within a Python environment without converting to GGUF.
Apple Silicon Macs with unified memory are well-suited for the smaller distilled models because the CPU, GPU, and neural engine share the same memory pool, eliminating PCIe bandwidth bottlenecks. The 7B and 8B models run comfortably on M2 and M3 MacBook Pro configurations with 16 GB of unified memory at Q4 quantization, producing interactive-speed output. The 14B and 32B variants require Mac Studio or Mac Pro configurations with at least 64 GB of unified memory for reasonable throughput.
The licensing of the DeepSeek-R1-Distill models depends on which variant is used, since each inherits the license of its base model architecture.
| Model group | License |
|---|---|
| DeepSeek-R1-Distill-Qwen-1.5B, 7B, 14B, 32B | Apache 2.0 (from Qwen2.5 base) |
| DeepSeek-R1-Distill-Llama-8B | Meta Llama 3.1 Community License |
| DeepSeek-R1-Distill-Llama-70B | Meta Llama 3.3 Community License |
| DeepSeek-R1-0528-Qwen3-8B | MIT License |
The DeepSeek-R1 model card and the GitHub repository state that the model series supports commercial use and allows derivative works including further distillation for training other language models. The Qwen variants, governed by Apache 2.0, are broadly permissive for commercial and research use. The Llama variants require compliance with Meta's community license agreements, which permit commercial use for organizations below a certain user count threshold.
The R1-0528-Qwen3-8B model, released under MIT, is the most permissive of the family and places essentially no restrictions on use or redistribution.
All model weights are publicly available on HuggingFace under the deepseek-ai organization page.
The primary intended use case for the distilled models is reasoning-heavy tasks: mathematics competition problems, physics and chemistry calculations, and multi-step logical deduction. The AIME and GPQA benchmark scores show that even the 7B and 8B variants substantially exceed what general-purpose models of similar size achieve on these tasks. Researchers and developers who need strong quantitative reasoning without server infrastructure have adopted the 7B, 8B, and 14B models as a practical alternative to calling large-model APIs.
The chain-of-thought format also makes the reasoning process inspectable. Users can read the model's trace to verify that it approached a problem correctly or to identify where it made an error, which is useful in educational and research workflows where the derivation matters as much as the answer.
Because the distilled models already encode extended chain-of-thought behavior, they serve as effective starting points for further domain-specific fine-tuning. The community has produced several derivatives:
Unsloth and Alibaba Cloud both provide tooling for LoRA and full fine-tuning of the distilled checkpoints. Alibaba's platform specifically documented a one-click fine-tuning workflow for all six January 2025 distilled models.
Organizations and individuals who cannot or do not want to send data to external APIs use the distilled models for on-premise or local inference. Healthcare providers working with patient records, law firms processing confidential documents, and government agencies with data residency requirements all represent sectors where sending data to a cloud API is impractical or prohibited. The 1.5B model runs on any modern GPU or CPU with modest memory requirements, making it accessible for edge devices and embedded applications. The 7B and 14B models cover the range where a single consumer GPU can handle interactive inference speeds for most reasoning tasks.
In cloud deployments where inference cost per token is a constraint, the distilled models allow operators to run reasoning tasks at a fraction of the cost of the full 671B model. The 32B and 70B distilled models capture a large share of the reasoning performance of DeepSeek-R1 on standard benchmarks while requiring far fewer GPU-hours per call. This cost profile has made them attractive for applications where reasoning is needed at scale, such as automated code review, large-scale document analysis, or educational platforms generating mathematics feedback.
The distilled models drew large download numbers on HuggingFace in the weeks following the January 2025 release. The 7B and 14B models were particularly widely downloaded because they fit within the hardware available to most developers. By May 2025, the DeepSeek-R1-0528-Qwen3-8B model was receiving over 258,000 downloads per month on HuggingFace. Ollama reported that the deepseek-r1 tag, which covers the distilled family, became one of the most-pulled model families in its library during February and March 2025.
Several API providers added distilled DeepSeek-R1 models to their hosted offerings. OpenRouter listed multiple size variants with per-token pricing. Fireworks.ai and Together AI hosted the 7B, 14B, and 70B variants. SiliconFlow, a Chinese API provider, listed the 14B model on its platform. These offerings gave developers access to the distilled models without running local infrastructure.
LM Studio, an application for running language models on consumer hardware, added support for the GGUF-quantized distilled models through its model browser. Mozilla AI released a llamafile-packaged version of the 14B model under the identifier mozilla-ai/DeepSeek-R1-Distill-Qwen-14B-llamafile, which packages the weights and inference runtime into a single executable file that runs on multiple operating systems without a separate installation step.
Inference backends including vLLM, SGLang, and Ollama all added explicit support for the distilled models, including configuration recommendations in their documentation.
Academic papers published in the first half of 2025 used the distilled models as baselines and fine-tuning starting points across areas including biomedical text processing, financial analytics, and mathematical problem generation. The Dropbox engineering team published a technical blog post on re-distillation experiments, finding that using more challenging reasoning traces as training data produced student models that outperformed the standard DeepSeek-R1-Distill checkpoints on hard mathematics benchmarks. This work contributed to understanding of how the quality of the teacher's traces influences student model capability.
A research group published findings (arXiv:2505.13792) examining the disconnect between trace interpretability and student model outcomes in trace-based knowledge distillation, using the DeepSeek-R1-Distill family as a test case. The paper found that even when traces are interpretable to humans, students can learn unexpected generalizations that diverge from the surface reasoning in the traces.
The distilled models can exhibit language mixing in their internal reasoning traces, particularly when prompted in a language other than Chinese or English. The model may begin a reasoning trace in English and switch to Chinese mid-trace, or mix Chinese mathematical terminology with English variable names. This was an observed failure mode in the DeepSeek-R1-Zero training run and persists to a lesser degree in the distilled models. The problem typically appears in the <think> section rather than in the final answer, and does not usually affect answer correctness, but it is distracting and can impede debugging of reasoning chains.
The maximum generation length for the January 2025 distilled models is 32,768 tokens. This includes both the chain-of-thought reasoning trace and the final answer. For difficult problems that require very long reasoning traces, the model may run out of generation budget before completing the solution. The 32K limit is sufficient for most standard benchmark problems but can be a constraint for novel hard problems or long-form analytical tasks.
The smaller distilled models (1.5B and 7B) show reduced performance on open-ended writing, creative tasks, and conversational instruction following compared to general-purpose models of similar parameter count. The 800K training dataset is concentrated on reasoning and question answering, so capabilities that depend on exposure to diverse text types are less developed. Users who need a balance between reasoning and general language quality typically find the 14B or 32B variants more suitable.
The authors noted explicitly that the distilled models were released without an RL stage on top of the SFT. Community experiments have confirmed that applying RL post-training to the distilled checkpoints improves reasoning accuracy on hard benchmarks, sometimes by several percentage points on AIME. The released models represent the SFT-only checkpoint, not the theoretical ceiling achievable with additional training.
Early deployment reports from the developer community noted that the January 2025 distilled variants were less reliable than the full DeepSeek-R1 model on tool calling and structured JSON output tasks. The internal reasoning mode and function-calling modes were described as partially conflicting in some inference server configurations. This limitation was less pronounced in the May 2025 Qwen3-8B distill, which introduced better separation of reasoning and structured-output behavior.
The distilled models inherit refusal behaviors from their training that reflect content moderation policies applied during the DeepSeek-R1 training process. Independent evaluations have observed refusal patterns on topics that are politically sensitive in China. This has no effect on the models' mathematics or science reasoning performance but is a practical consideration for organizations deploying the models in open-domain conversational settings where a broad range of user queries is expected.