DeepSeek V3.1 is a large language model developed by DeepSeek, released on August 19, 2025. It is a hybrid reasoning model that integrates both thinking and non-thinking inference modes into a single 671-billion-parameter Mixture of Experts architecture. The model extends its predecessor DeepSeek V3's 64K context window to 128K tokens and introduces explicit chain-of-thought reasoning controlled via chat template tokens rather than requiring a separate dedicated reasoning model. DeepSeek positioned V3.1 as the company's first step toward what it described as "the agent era," citing substantial gains in code agent and search agent benchmarks over both V3-0324 and DeepSeek-R1.
The model was released under the MIT license, continuing DeepSeek's open-weight strategy. An incremental follow-up, DeepSeek-V3.1-Terminus, was released on September 22, 2025, with fixes for language consistency issues and further improvements to agentic tool use.
DeepSeek V3 was released in December 2024 as a 671-billion-parameter MoE model trained on 14.8 trillion tokens. It achieved strong performance on general-purpose benchmarks at a fraction of the training cost of comparable Western frontier models, attracting significant attention in the research community. A subsequent update, V3-0324, released in March 2025, improved coding performance further but remained a non-reasoning model, meaning it could not produce extended chain-of-thought traces.
DeepSeek-R1 was released in January 2025 as DeepSeek's dedicated reasoning model. Trained via reinforcement learning using Group Relative Policy Optimization (GRPO), R1 demonstrated strong performance on mathematics and competitive programming. However, it operated in a single inference mode (chain-of-thought only), lacked native structured tool-use support, and produced long reasoning traces that increased latency significantly compared to non-reasoning models. The time to first token for R1 on complex tasks was measured at over 100 seconds in some configurations, making it impractical for latency-sensitive applications.
In May 2025, DeepSeek released R1-0528, an updated reasoning model with improved performance across mathematics, coding, and science benchmarks. Despite this, R1-0528 remained a reasoning-only model and could not perform structured tool calls or operate in non-thinking mode. At the same time, V3-0324, while fast and versatile, could not perform deep chain-of-thought reasoning. The community identified a clear gap: no single DeepSeek model could switch between fast general-purpose responses and deep reasoning on demand.
V3.1 was built to address this gap directly.
DeepSeek V3.1 was announced on August 19, 2025. The release was characteristically low-key for DeepSeek: the model appeared on Hugging Face under the repository deepseek-ai/DeepSeek-V3.1 before any formal blog post, with the initial announcement made through DeepSeek's official WeChat group rather than a press release. This contrasted with the more prominent launch strategies of Western AI laboratories.
The model was made available simultaneously via DeepSeek's API (where deepseek-chat and deepseek-reasoner were both upgraded to V3.1), through Hugging Face model repositories, through ModelScope, and via the official DeepSeek web and mobile applications. Third-party inference providers including Together AI, OpenRouter, SambaNova, and Google Cloud's Vertex AI Model Garden subsequently added V3.1 to their offerings within days.
The base model, DeepSeek-V3.1-Base, was also released separately on Hugging Face for researchers who wished to apply their own fine-tuning or post-training procedures.
The defining feature of V3.1 is its hybrid inference architecture, which allows a single model to operate in either thinking mode (extended chain-of-thought reasoning) or non-thinking mode (direct response) depending on a parameter set in the chat template.
The mode selection is implemented through special tokens added to the assistant turn prefix. In thinking mode, the assistant turn begins with <think>, signaling the model to produce a reasoning trace before the final answer. In non-thinking mode, the assistant turn begins with </think> (an immediately closed think tag), signaling the model to skip chain-of-thought generation and respond directly.
This design means the two modes share a single set of model weights and are differentiated purely at the tokenization level. The model was trained to recognize these token patterns and generate behavior accordingly. Downstream, the thinking boolean parameter in the apply_chat_template call handles the correct prefix:
# Thinking mode
tokenizer.apply_chat_template(messages, thinking=True, add_generation_prompt=True)
# Non-thinking mode
tokenizer.apply_chat_template(messages, thinking=False, add_generation_prompt=True)
On the API side, the same toggle is available via the thinking parameter in chat completion requests, mirroring the interface that was already familiar to users of the separate deepseek-reasoner endpoint.
DeepSeek's approach with V3.1 differs from the conventional two-model paradigm (a fast general model plus a slow reasoning model). A single hybrid model offers several practical advantages. Operators do not need to maintain separate API endpoints or routing logic. Users can switch modes within the same conversation context without starting over. The model can, in principle, calibrate reasoning depth to task difficulty rather than committing entirely to one mode per request.
This approach had precedent in the broader industry. Qwen3, released by Alibaba's Tongyi team in April 2025, used a comparable thinking/non-thinking toggle via <think> and </think> tags. However, DeepSeek V3.1's scale (671B total parameters versus Qwen3's various sizes) and its agent-specific post-training made it distinct in scope.
DeepSeek stated that V3.1 in thinking mode achieves comparable answer quality to R1-0528 while responding more quickly, due to shorter reasoning traces. Internal testing cited a 20 to 50 percent reduction in output tokens during chain-of-thought generation compared to R1-0528 on equivalent problems, attributed to training optimizations that produce more concise reasoning. On the AIME 2024 benchmark, V3.1 thinking mode scored 93.1% Pass@1, slightly above R1-0528's score, while V3.1 non-thinking mode scored 66.3% on the same benchmark.
DeepSeek V3.1 uses the same underlying architecture as DeepSeek V3. There were no structural changes to the transformer design between versions; the differences come from the base model's extended context training, the post-training stage that introduced hybrid reasoning, and the additional agent-focused fine-tuning.
V3.1 uses the DeepSeekMoE architecture, which routes each token to a small subset of expert feed-forward networks rather than activating all parameters. The model has 671 billion total parameters but activates approximately 37 billion per token (roughly 5.5% of the total), keeping inference compute and memory bandwidth requirements substantially lower than a dense model of equivalent capacity. DeepSeek's custom routing mechanism reduces load imbalance across experts, which is a known challenge with MoE designs at this scale.
V3.1 retains Multi-head Latent Attention (MLA), a key-value cache compression technique DeepSeek introduced in DeepSeek-V2. MLA compresses the key and value tensors into a lower-dimensional latent space before caching them, dramatically reducing KV cache memory consumption at the cost of a small projection step during inference. This makes 128K-context inference feasible on practical hardware configurations.
Like its predecessor, V3.1 uses the UE8M0 FP8 format for both weights and activations. The base model weights are stored in FP8, and inference is performed in FP8 where supported, reducing memory footprint and improving throughput on compatible hardware. A known issue in the initial V3.1-Terminus release noted that self_attn.o_proj parameters did not conform to the UE8M0 scale format; DeepSeek acknowledged this and indicated it would be addressed in a future release.
DeepSeek V3's context window was 64K tokens. V3.1 extends this to 128K tokens through a two-phase long-context training procedure applied to V3.1-Base before the hybrid reasoning post-training stage.
The first phase extended the context from the original training length to 32K tokens using 630 billion tokens of continued pretraining. This represents a ten-fold increase in training data compared to V3-0324's long-context extension phase, indicating substantially more investment in long-context capability.
The second phase further extended the context to 128K tokens using 209 billion additional training tokens, a 3.3-fold extension from the 32K checkpoint. The model was evaluated on the Needle In A Haystack (NIAH) test suite to verify retrieval accuracy across the full 128K range.
Google Cloud's Vertex AI deployment of V3.1 reported a context window of 163,840 tokens for their managed API offering, which may reflect infrastructure-level padding applied by the hosting provider.
V3.1 received substantial post-training focused on agentic workflows, which DeepSeek identified as a strategic priority. The model's agent-related performance improvements over V3-0324 are among the largest gains in the release.
On SWE-bench Verified, V3.1 scored 66.0%, compared to 45.4% for V3-0324. This benchmark tests whether a model can autonomously identify and fix real-world software bugs in GitHub repositories, making it a measure of practical software engineering capability rather than isolated code generation. The 45% relative improvement is substantially larger than gains on general benchmarks, reflecting the targeted post-training.
On BrowseComp, a benchmark that evaluates web browsing and research tasks, V3.1 scored 30.0% versus 8.9% for R1-0528. This roughly 237% improvement reflects V3.1's specialized training for search-augmented workflows. The model uses dedicated control tokens (<search_begin> and <search_end>) to format web search calls within the thinking mode trace, enabling structured retrieval during reasoning.
On Terminal-bench, which evaluates autonomous task execution in a terminal environment, V3.1 scored 31.3% compared to 13.3% for V3-0324.
V3.1 introduced native structured tool calling with a standardized JSON-based format. Tool definitions are specified in the system prompt or via the API's tools parameter, and the model returns tool calls in a structured format that downstream systems can parse and execute. This capability was not available in the original R1 release.
On September 22, 2025, DeepSeek released DeepSeek-V3.1-Terminus as an incremental update to V3.1. The update was motivated by user-reported issues in production deployments rather than research-driven improvements.
The primary fix in V3.1-Terminus addressed language consistency: users had observed that the model occasionally produced outputs mixing Chinese and English when responding in an English context, and sometimes generated rare or anomalous characters. V3.1-Terminus reduced the frequency of these issues substantially.
Agent capabilities also received additional optimization. The search agent template and tool set were updated (documented in assets/search_tool_trajectory.html in the repository), and both code agent and search agent performance improved measurably on benchmarks.
DeepSeek described the overall result as "more stable and reliable outputs across benchmarks compared to the previous version."
The following improvements were reported on the reasoning mode benchmarks (without tool use):
| Benchmark | V3.1 | V3.1-Terminus | Change |
|---|---|---|---|
| MMLU-Pro | 84.8% | 85.0% | +0.2% |
| GPQA-Diamond | 80.1% | 80.7% | +0.6% |
| Humanity's Last Exam | 15.9% | 21.7% | +5.8% |
| LiveCodeBench | 74.8% | 74.9% | +0.1% |
| Codeforces Rating | 2091 | 2046 | -45 |
Agentic tool use showed larger improvements:
| Benchmark | V3.1 | V3.1-Terminus | Change |
|---|---|---|---|
| BrowseComp | 30.0% | 38.5% | +8.5% |
| SimpleQA | 93.4% | 96.8% | +3.4% |
| SWE-bench Verified | 66.0% | 68.4% | +2.4% |
| SWE-bench Multilingual | 54.5% | 57.8% | +3.3% |
| Terminal-bench | 31.3% | 36.7% | +5.4% |
DeepSeek-V3.1-Terminus weights were released on Hugging Face under the MIT license. The model replaced V3.1 on DeepSeek's official API, web application, and mobile application on September 22, 2025. Third-party providers including SambaNova, OpenRouter, and DeepInfra added support for V3.1-Terminus within the following weeks.
The following table compares DeepSeek V3.1 (both modes), V3-0324, and R1-0528 across major benchmark categories.
| Benchmark | V3.1 (non-thinking) | V3.1 (thinking) | V3-0324 | R1-0528 |
|---|---|---|---|---|
| MMLU-Redux | 91.8% | 93.7% | 90.5% | 93.4% |
| MMLU-Pro | 83.7% | 84.8% | 81.2% | 85.0% |
| GPQA-Diamond | 74.9% | 80.1% | 68.4% | 81.0% |
| LiveCodeBench Pass@1 | 56.4% | 74.8% | 43.0% | 73.3% |
| Aider-Polyglot | 62.1% | 76.3% | 52.6% | 74.9% |
| AIME 2024 Pass@1 | 66.3% | 93.1% | 59.4% | 91.4% |
| AIME 2025 Pass@1 | 49.8% | 88.4% | 47.2% | 87.1% |
| HMMT 2025 | 33.5% | 84.2% | 29.3% | 82.9% |
| SWE-bench Verified | 66.0% | n/a | 45.4% | n/a |
| BrowseComp | 30.0% | n/a | n/a | 8.9% |
| Terminal-bench | 31.3% | n/a | 13.3% | n/a |
| SimpleQA | 33.5% | 93.4% | 31.2% | 92.7% |
Benchmark notes: SWE-bench Verified results for V3.1 use non-thinking mode with tool calling enabled. BrowseComp results use thinking mode with search tool access. Codeforces ratings are not directly comparable to percentage pass scores and are listed separately. V3.1 thinking mode scored a Codeforces rating of approximately 2091, above R1-0528's 1930.
DeepSeek V3.1 and V3.1-Terminus are both released under the MIT License, the same terms applied to earlier DeepSeek models including V3 and R1. The MIT License permits commercial use, modification, distribution, and private use, with the only requirement being preservation of the copyright notice and license text in source distributions.
The base model weights (DeepSeek-V3.1-Base and DeepSeek-V3.1-Terminus-Base) are available for download from Hugging Face and ModelScope. Users deploying the full 671B model require approximately 1,400 to 1,500 GB of GPU memory for full-precision loading, which in practice requires multi-node configurations or quantized variants. Community-produced GGUF quantizations reduce memory requirements substantially while accepting some performance degradation.
The following table shows API pricing for DeepSeek V3.1 and related models on DeepSeek's official API at time of release. DeepSeek applies prompt caching with reduced rates for cache-hit tokens.
| Model | Input (cache miss) | Input (cache hit) | Output |
|---|---|---|---|
| DeepSeek V3.1 | $0.15 / 1M tokens | $0.015 / 1M tokens | $0.75 / 1M tokens |
| DeepSeek V3.1-Terminus | $0.15 / 1M tokens | $0.015 / 1M tokens | $0.75 / 1M tokens |
| DeepSeek V3-0324 (reference) | $0.14 / 1M tokens | $0.014 / 1M tokens | $0.28 / 1M tokens |
Third-party providers set their own rates. OpenRouter listed V3.1 at $0.15 per million input tokens and $0.75 per million output tokens at launch. These prices positioned V3.1 substantially below frontier Western models with comparable capabilities; OpenAI's o3 and Anthropic's Claude Opus 4, which offer broadly comparable reasoning performance, carried output prices an order of magnitude higher at the same time period.
Note that DeepSeek's API pricing was superseded by later updates. With the release of V3.2 in December 2025 and the V4 series in April 2026, V3.1 and V3.1-Terminus were scheduled for deprecation, with deepseek-chat and deepseek-reasoner endpoint aliases retiring on July 24, 2026.
DeepSeek V3 (including the V3-0324 update) is a non-thinking model. It cannot produce chain-of-thought reasoning and scored 59.4% on AIME 2024 compared to V3.1's 93.1% in thinking mode. V3.1's non-thinking mode also outperforms V3-0324 on most benchmarks, suggesting that the post-training improvements contributed to general capability gains independent of the reasoning mode addition.
The context window grew from 64K (V3-0324) to 128K (V3.1), doubling the amount of material the model can consider in a single request.
DeepSeek-R1 (and its May 2025 update R1-0528) is a dedicated reasoning model with no non-thinking mode. R1-0528 produces long chain-of-thought traces before every answer, resulting in substantially higher latency than V3.1's non-thinking mode (over 100 seconds to first token for complex tasks, versus approximately 2.9 seconds for V3.1 non-thinking). V3.1 thinking mode achieves comparable benchmark scores to R1-0528 on math and coding while producing shorter reasoning traces.
The most practically significant difference is tool calling. R1-0528 did not natively support structured tool use when it was released, making it unsuitable for direct integration into agentic pipelines without additional scaffolding. V3.1 includes native tool calling in both modes.
DeepSeek-V3.2-Exp was released on September 29, 2025, one week after V3.1-Terminus. It was built on V3.1-Terminus as its base and introduced DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism designed to improve inference efficiency for long-context workloads. V3.2-Exp performs roughly on par with V3.1-Terminus on most benchmarks, with the primary motivation being architectural efficiency research rather than raw capability improvement. DeepSeek positioned V3.2-Exp as a research preview rather than a production-ready replacement.
The following table summarizes the comparison across the V3 lineage and R1:
| Feature | V3-0324 | V3.1 | V3.1-Terminus | V3.2-Exp | R1-0528 |
|---|---|---|---|---|---|
| Parameters (total) | 685B | 671B | 671B | 671B | 685B |
| Parameters (active) | ~37B | 37B | 37B | 37B | ~37B |
| Context window | 128K | 128K | 128K | 128K | 128K |
| Thinking mode | No | Yes | Yes | Yes | Yes (only) |
| Non-thinking mode | Yes | Yes | Yes | Yes | No |
| Native tool calling | No | Yes | Yes | Yes | No |
| SWE-bench Verified | 45.4% | 66.0% | 68.4% | ~68% | n/a |
| AIME 2024 (thinking) | 59.4% | 93.1% | ~93% | ~93% | 91.4% |
| License | MIT | MIT | MIT | MIT | MIT |
| Release date | March 2025 | Aug 2025 | Sep 2025 | Sep 2025 | May 2025 |
DeepSeek V3.1's combination of hybrid reasoning, expanded context, native tool calling, and open weights supports a range of deployment scenarios.
The 45% relative improvement on SWE-bench Verified over V3-0324 made V3.1 one of the strongest available models for automated software engineering at its release. Developers using it through code agent frameworks (such as Aider, OpenHands, or SWE-agent) benefit from its ability to read and modify code across large codebases within the 128K context window, call external tools such as test runners, and apply chain-of-thought reasoning when debugging complex failures.
The 128K context window allows V3.1 to ingest full-length technical documents, legal contracts, research papers, or codebases in a single request. In non-thinking mode, the model can respond quickly to document-level queries with minimal latency. In thinking mode, it can perform multi-step reasoning over document content before producing a structured analysis.
V3.1's search agent capabilities, demonstrated by the BrowseComp score of 30.0%, allow it to operate in pipelines that call external search APIs during the reasoning trace. This makes it suitable for applications that require the model to retrieve and synthesize current information rather than relying entirely on pretraining knowledge.
At $0.75 per million output tokens on DeepSeek's official API, V3.1 offered reasoning-capable performance at a cost substantially below Western frontier models with comparable benchmark results. Developers and organizations running high-volume inference found the cost differential meaningful for production use cases where reasoning quality matters but budget is constrained.
As an MIT-licensed open-weight model, V3.1 is available for on-premises deployment. Organizations with data residency requirements or those preferring not to send data to third-party APIs can self-host the model. The full 671B model requires large-scale GPU infrastructure, but community quantizations in GGUF and other formats allow partial capability with reduced hardware requirements.
DeepSeek V3.1 received positive attention from the developer community following its August 2025 release. The model became one of the most downloaded models on Hugging Face within days of launch, reaching a top-five position in overall download volume.
Developers testing V3.1 in code agent configurations reported that its SWE-bench improvements translated to real-world usability gains, with the model successfully resolving issues that V3-0324 had failed on. The hybrid mode was widely noted as a practical improvement: the ability to use the same model for both quick responses and deep analysis reduced infrastructure complexity for teams running multi-agent pipelines.
The low-key announcement style drew comment. Several observers noted that DeepSeek released a model with industry-leading agent benchmarks with less ceremony than competitors typically apply to incremental updates. This was consistent with DeepSeek's pattern of prioritizing Hugging Face and WeChat announcements over press-release-style launches.
The September 2025 V3.1-Terminus update was covered by VentureBeat and other technology publications as a quick turnaround on user-reported quality issues, with the language mixing fix being highlighted as particularly responsive to production feedback.
Some researchers noted that the simultaneous availability of V3.1 and V3.2-Exp within the same month (September 2025) created ambiguity about which model was the recommended production version. DeepSeek's API kept V3.1-Terminus as the default behind deepseek-chat and deepseek-reasoner while V3.2-Exp was offered as a separate endpoint, which largely resolved the practical question for API users.
Despite strong benchmark performance, several limitations apply to V3.1 in practice.
The full 671B model requires approximately 1,400 to 1,500 GB of GPU memory for full-weight loading. This places full-precision local deployment out of reach for most individual researchers and many organizations without large GPU clusters. Quantized versions trade some performance for reduced memory requirements.
The knowledge cutoff for V3.1 is March 31, 2025. Events or developments after this date are not reflected in the model's pretraining knowledge, and the model may produce outdated or incorrect information about topics that evolved after the cutoff.
The language consistency issues that V3.1-Terminus addressed were a real limitation of the original V3.1 release in production. Users deploying V3.1 for English-language applications encountered occasional Chinese-language intrusions in outputs, which required additional filtering or prompted migration to V3.1-Terminus.
While V3.1 in thinking mode approaches R1-0528 on most reasoning benchmarks, R1-0528 retains a slight advantage on extended humanities reasoning and some creative tasks that benefit from prolonged deliberation. The compression of thinking traces that makes V3.1 faster also means it invests less computation in some complex tasks than a dedicated reasoning model would.
V3.1's training data composition and post-training details are not fully documented in any publicly available technical report, unlike the original DeepSeek V3, which had an accompanying paper (arXiv 2412.19437). The V3.1 release was based on the same arXiv paper without a separate publication, limiting the degree to which the post-training procedure can be analyzed or reproduced by external researchers.