DeepSeek V3.1

DeepSeek V3.1 is a large language model developed by DeepSeek, released on August 19, 2025 and made broadly available via the official API on August 21, 2025.[^8][^1] It is a hybrid reasoning model that integrates both thinking and non-thinking inference modes into a single 671-billion-parameter Mixture of Experts architecture (37 billion active parameters per token).[^2] The model extends its predecessor DeepSeek V3's 64K context window to 128K tokens via a two-phase long-context continued pretraining stage totaling 840 billion additional tokens, and introduces explicit chain-of-thought reasoning controlled via chat template tokens rather than requiring a separate dedicated reasoning model.[^2][^7] DeepSeek positioned V3.1 as the company's first step toward what it described as "the agent era," citing substantial gains in code agent and search agent benchmarks over both V3-0324 and DeepSeek-R1-0528.[^7][^14]

The model was released under the MIT License, continuing DeepSeek's open-weight strategy.[^2] An incremental follow-up, DeepSeek-V3.1-Terminus, was released on September 22, 2025 with fixes for language consistency issues and further improvements to agentic tool use,[^3][^4] and was succeeded one week later by the experimental DeepSeek-V3.2-Exp introducing DeepSeek Sparse Attention.[^15] On the official DeepSeek API, the legacy deepseek-chat and deepseek-reasoner endpoint aliases that V3.1 had upgraded are scheduled for full retirement on July 24, 2026, 15:59 UTC, in favor of explicit deepseek-v4-flash and deepseek-v4-pro model IDs introduced with DeepSeek V4 on April 24, 2026.[^1][^18]

Background

DeepSeek V3

DeepSeek V3 was released in December 2024 as a 671-billion-parameter MoE model trained on 14.8 trillion tokens.[^2] It achieved strong performance on general-purpose benchmarks at a fraction of the training cost of comparable Western frontier models, attracting significant attention in the research community. A subsequent update, V3-0324, released in March 2025, improved coding performance further but remained a non-reasoning model, meaning it could not produce extended chain-of-thought traces.[^5]

DeepSeek-R1

DeepSeek-R1 was released in January 2025 as DeepSeek's dedicated reasoning model. Trained via reinforcement learning using Group Relative Policy Optimization (GRPO), R1 demonstrated strong performance on mathematics and competitive programming. However, it operated in a single inference mode (chain-of-thought only), lacked native structured tool-use support, and produced long reasoning traces that increased latency significantly compared to non-reasoning models. The time to first token for R1 on complex tasks was measured at over 100 seconds in some configurations, making it impractical for latency-sensitive applications.[^5]

R1-0528 and the gap between V and R lines

In May 2025, DeepSeek released R1-0528, an updated reasoning model with improved performance across mathematics, coding, and science benchmarks. Despite this, R1-0528 remained a reasoning-only model and could not perform structured tool calls or operate in non-thinking mode. At the same time, V3-0324, while fast and versatile, could not perform deep chain-of-thought reasoning. The community identified a clear gap: no single DeepSeek model could switch between fast general-purpose responses and deep reasoning on demand.[^12]

V3.1 was built to address this gap directly, and Sebastian Raschka subsequently characterized it as the "transition checkpoint" between V3's instruct-only lineage and V3.2's sparse-attention efficiency research, noting that V3.1, V3.1-Base, and V3 all share the same underlying architecture.[^14]

Release

DeepSeek V3.1 was announced on August 19, 2025. The release was characteristically low-key for DeepSeek: the model appeared on Hugging Face under the repository deepseek-ai/DeepSeek-V3.1 before any formal blog post, with the initial announcement made through DeepSeek's official WeChat group rather than a press release.[^8][^10] This contrasted with the more prominent launch strategies of Western AI laboratories. The formal API change-log entry followed on August 21, 2025, with both deepseek-chat and deepseek-reasoner endpoints simultaneously upgraded to V3.1 and tagged with a new "hybrid reasoning architecture" description.[^1]

The model was made available simultaneously via DeepSeek's API, through Hugging Face model repositories, through ModelScope, and via the official DeepSeek web and mobile applications.[^1][^2] Third-party inference providers including Together AI,[^6] OpenRouter,[^11] SambaNova,[^16] Fireworks AI, and Google Cloud's Vertex AI Model Garden subsequently added V3.1 to their offerings within days.[^10]

The base model, DeepSeek-V3.1-Base, was also released separately on Hugging Face for researchers who wished to apply their own fine-tuning or post-training procedures.[^2] On Hugging Face, the base repository lists 685 billion parameters in total because the publicly distributed checkpoint includes both the 671B main model and an additional ~14B Multi-Token Prediction (MTP) module inherited from V3; the inference-time main model remains 671B with 37B active per token, as in V3.[^21][^14]

Alongside the August 21 announcement, DeepSeek also added beta support for "Strict Function Calling" and Anthropic-format API compatibility, broadening the set of agent frameworks that could target DeepSeek's endpoints without a translation layer.[^1]

Hybrid thinking mode

The defining feature of V3.1 is its hybrid inference architecture, which allows a single model to operate in either thinking mode (extended chain-of-thought reasoning) or non-thinking mode (direct response) depending on a parameter set in the chat template.[^2]

How it works

The mode selection is implemented through special tokens added to the assistant turn prefix. In thinking mode, the assistant turn begins with <think>, signaling the model to produce a reasoning trace before the final answer. In non-thinking mode, the assistant turn begins with </think> (an immediately closed think tag), signaling the model to skip chain-of-thought generation and respond directly.[^2]

The exact prefixes documented in the official model card are:

# Thinking mode
<｜begin▁of▁sentence｜>{system}<｜User｜>{query}<｜Assistant｜><think>

# Non-thinking mode
<｜begin▁of▁sentence｜>{system}<｜User｜>{query}<｜Assistant｜></think>

This design means the two modes share a single set of model weights and are differentiated purely at the tokenization level.[^2][^14] The model was trained to recognize these token patterns and generate behavior accordingly. Downstream, the thinking boolean parameter in the apply_chat_template call handles the correct prefix:

# Thinking mode
tokenizer.apply_chat_template(messages, thinking=True, add_generation_prompt=True)

# Non-thinking mode
tokenizer.apply_chat_template(messages, thinking=False, add_generation_prompt=True)

On the API side, the same toggle is available via the thinking (or reasoning_enabled) parameter in chat completion requests, mirroring the interface that was already familiar to users of the separate deepseek-reasoner endpoint.[^1][^11]

Why one model instead of two

DeepSeek's approach with V3.1 differs from the conventional two-model paradigm (a fast general model plus a slow reasoning model). A single hybrid model offers several practical advantages. Operators do not need to maintain separate API endpoints or routing logic. Users can switch modes within the same conversation context without starting over. The model can, in principle, calibrate reasoning depth to task difficulty rather than committing entirely to one mode per request.[^7]

This approach had precedent in the broader industry. Qwen3, released by Alibaba's Tongyi team in April 2025, used a comparable thinking/non-thinking toggle via <think> and </think> tags. However, DeepSeek V3.1's scale (671B total parameters versus Qwen3's various sizes) and its agent-specific post-training made it distinct in scope.[^14]

Thinking mode performance versus R1-0528

DeepSeek stated that V3.1 in thinking mode achieves comparable answer quality to R1-0528 while responding more quickly, due to shorter reasoning traces.[^2] Internal testing cited a 20 to 50 percent reduction in output tokens during chain-of-thought generation compared to R1-0528 on equivalent problems, attributed to training optimizations that produce more concise reasoning.[^5][^7] On the AIME 2024 benchmark, V3.1 thinking mode scored 93.1% Pass@1, slightly above R1-0528's 91.4%, while V3.1 non-thinking mode scored 66.3% on the same benchmark.[^2]

Architecture

DeepSeek V3.1 uses the same underlying transformer architecture as DeepSeek V3.[^14] There were no structural changes to the transformer design between versions; the differences come from the base model's extended context training, the post-training stage that introduced hybrid reasoning, and the additional agent-focused fine-tuning.[^2][^14]

Mixture of Experts

V3.1 uses the DeepSeekMoE architecture, which routes each token to a small subset of expert feed-forward networks rather than activating all parameters.[^21] The model has 671 billion total parameters but activates approximately 37 billion per token (roughly 5.5% of the total), keeping inference compute and memory bandwidth requirements substantially lower than a dense model of equivalent capacity.[^2] DeepSeek's custom routing mechanism reduces load imbalance across experts, which is a known challenge with MoE designs at this scale.[^21]

Multi-head Latent Attention

V3.1 retains Multi-head Latent Attention (MLA), a KV cache compression technique DeepSeek introduced in DeepSeek-V2. MLA compresses the key and value tensors into a lower-dimensional latent space before caching them, dramatically reducing KV cache memory consumption at the cost of a small projection step during inference.[^14] This makes 128K-context window inference feasible on practical hardware configurations.

Multi-Token Prediction

V3.1-Base inherits the Multi-Token Prediction (MTP) auxiliary module introduced in V3, which predicts the next two tokens jointly during training as an auxiliary objective and adds approximately 14 billion parameters to the checkpoint.[^21] At inference time, the MTP head can optionally be used for speculative decoding to increase throughput; without it, the model behaves as a standard 671B-parameter MoE.[^14] The combined 685B count reported in some sources (such as the Hugging Face base repository tag) reflects the 671B main model plus the MTP module.[^21]

FP8 precision

Like its predecessor, V3.1 uses the UE8M0 FP8 format for both weights and activations, supported alongside BF16 and FP32 tensor types in the published checkpoint.[^2][^21] The base model weights are stored in FP8, and inference is performed in FP8 where supported, reducing memory footprint and improving throughput on compatible hardware. A known issue noted in the V3.1-Terminus release card stated that self_attn.o_proj parameters did not conform to the UE8M0 scale format in that checkpoint; DeepSeek acknowledged this and indicated it would be corrected in a future release.[^3] The mlp.gate.e_score_correction_bias parameters must additionally be loaded and computed in FP32 precision for correct routing behavior.[^2]

In November 2025, NVIDIA published an NVFP4-quantized variant of V3.1 (and parallel versions for V3.1-Terminus and V3.2-Exp) targeting Blackwell B200 GPUs via TensorRT-LLM. NVFP4 reduces per-parameter storage from 8 to 4 bits, yielding roughly 1.6x lower disk and GPU memory footprint while preserving accuracy within reported margins; deployment requires 8 B200 GPUs and a TensorRT-LLM build from source.[^19]

Context window extension

DeepSeek V3's context window was 64K tokens. V3.1 extends this to 128K tokens through a two-phase long-context training procedure applied to V3.1-Base before the hybrid reasoning post-training stage. The two phases combine for approximately 840 billion tokens of continued pretraining over the V3 checkpoint, which is the principal investment in V3.1-Base relative to V3.[^2][^7]

Phase 1: 32K extension

The first phase extended the context from the original training length to 32K tokens using 630 billion tokens of continued pretraining. This represents a ten-fold increase in training data compared to V3-0324's long-context extension phase, indicating substantially more investment in long-context capability.[^2]

Phase 2: 128K extension

The second phase further extended the context to 128K tokens using 209 billion additional training tokens, a 3.3-fold extension from the 32K checkpoint.[^2] The model was evaluated on the Needle in a Haystack (NIAH) test suite to verify retrieval accuracy across the full 128K range.[^14]

Google Cloud's Vertex AI deployment of V3.1 reported a context window of 163,840 tokens for their managed API offering,[^10] and OpenRouter likewise advertises a 163,840-token window;[^11] this reflects infrastructure-level padding applied by hosting providers rather than a separate base context length.

Agent capabilities

V3.1 received substantial post-training focused on agentic workflows, which DeepSeek identified as a strategic priority. The model's agent-related performance improvements over V3-0324 are among the largest gains in the release.[^7]

Code agents

On SWE-bench Verified, V3.1 scored 66.0%, compared to 45.4% for V3-0324.[^2][^5] This benchmark tests whether a model can autonomously identify and fix real-world software bugs in GitHub repositories, making it a measure of practical software engineering capability rather than isolated code generation. The 45% relative improvement is substantially larger than gains on general benchmarks, reflecting the targeted post-training. V3.1 also scored 54.5% on SWE-bench Multilingual, compared with R1-0528's 30.5% on the same benchmark.[^2]

Search agents

On BrowseComp, a benchmark that evaluates web browsing and research tasks, V3.1 scored 30.0% versus 8.9% for R1-0528.[^2] A separate Chinese-language variant, BrowseComp-zh, scored 49.2% versus 35.7% for R1-0528.[^2] This roughly 237% improvement on the English benchmark reflects V3.1's specialized training for search-augmented workflows. The model uses dedicated control tokens (<search_begin> and <search_end>) to format web search calls within the thinking mode trace, enabling structured retrieval during reasoning.[^7] V3.1-Terminus subsequently shipped a revised search-agent template documented in assets/search_tool_trajectory.html in the model repository, which improved BrowseComp by another 8.5 percentage points.[^3]

Terminal tasks

On Terminal-bench, which evaluates autonomous task execution in a terminal environment, V3.1 scored 31.3% compared to 13.3% for V3-0324 and only 5.7% for R1-0528.[^2]

Tool calling

V3.1 introduced native structured tool calling with a standardized JSON-based format, using dedicated <｜tool▁calls▁begin｜> and corresponding end markers in the chat template.[^2] Tool definitions are specified in the system prompt or via the API's tools parameter, and the model returns tool calls in a structured format that downstream systems can parse and execute. This capability was not available in the original R1 release.[^5][^7] Alongside V3.1, DeepSeek also opened a beta of "Strict Function Calling," which constrains the model to produce JSON that strictly validates against the supplied tool schema, reducing post-processing for production agent pipelines.[^1] Anthropic-format API support shipped at the same time, allowing Claude-targeted clients to call V3.1 with minimal changes.[^1]

V3.1-Terminus

On September 22, 2025, DeepSeek released DeepSeek-V3.1-Terminus as an incremental update to V3.1.[^4] The update was motivated by user-reported issues in production deployments rather than research-driven improvements.[^17]

Changes from V3.1

The primary fix in V3.1-Terminus addressed language consistency: users had observed that the model occasionally produced outputs mixing Chinese and English when responding in an English context, and sometimes generated rare or anomalous characters. V3.1-Terminus reduced the frequency of these issues substantially.[^3][^4]

Agent capabilities also received additional optimization. The search agent template and tool set were updated (documented in assets/search_tool_trajectory.html in the repository), and both code agent and search agent performance improved measurably on benchmarks.[^3]

DeepSeek described the overall result as "more stable and reliable outputs across benchmarks compared to the previous version."[^4]

Benchmark improvements from V3.1 to V3.1-Terminus

The following improvements were reported on the reasoning mode benchmarks (without tool use):[^3]

Benchmark	V3.1	V3.1-Terminus	Change
MMLU-Pro	84.8%	85.0%	+0.2%
GPQA-Diamond	80.1%	80.7%	+0.6%
Humanity's Last Exam	15.9%	21.7%	+5.8%
LiveCodeBench	74.8%	74.9%	+0.1%
Codeforces Rating	2091	2046	-45

Agentic tool use showed larger improvements:[^3]

Benchmark	V3.1	V3.1-Terminus	Change
BrowseComp	30.0%	38.5%	+8.5%
BrowseComp-zh	49.2%	45.0%	-4.2%
SimpleQA	93.4%	96.8%	+3.4%
SWE-bench Verified	66.0%	68.4%	+2.4%
SWE-bench Multilingual	54.5%	57.8%	+3.3%
Terminal-bench	31.3%	36.7%	+5.4%

The negative deltas on Codeforces (-45 rating points) and BrowseComp-zh (-4.2 points) were noted by community reviewers, who interpreted the Codeforces drop as evidence that the Terminus post-training prioritized agent and tool-use stability over the kind of single-shot competitive-programming reasoning that drives the rating system upward.[^20] DeepSeek did not publicly comment on the regressions.

Availability

DeepSeek-V3.1-Terminus weights were released on Hugging Face under the MIT License.[^3] The model replaced V3.1 on DeepSeek's official API, web application, and mobile application on September 22, 2025.[^4] Third-party providers including SambaNova,[^16] OpenRouter, and DeepInfra added support for V3.1-Terminus within the following weeks.[^17]

Benchmarks

The following table compares DeepSeek V3.1 (both modes), V3-0324, and R1-0528 across major benchmark categories. All numbers other than the Codeforces rating are reported as Pass@1 or accuracy percentages.[^2][^5]

Benchmark	V3.1 (non-thinking)	V3.1 (thinking)	V3-0324	R1-0528
MMLU-Redux	91.8%	93.7%	90.5%	93.4%
MMLU-Pro	83.7%	84.8%	81.2%	85.0%
GPQA-Diamond	74.9%	80.1%	68.4%	81.0%
LiveCodeBench Pass@1	56.4%	74.8%	43.0%	73.3%
Aider-Polyglot	62.1%	76.3%	52.6%	74.9%
AIME 2024 Pass@1	66.3%	93.1%	59.4%	91.4%
AIME 2025 Pass@1	49.8%	88.4%	47.2%	87.5%
HMMT 2025	33.5%	84.2%	29.3%	82.9%
SWE-bench Verified	66.0%	n/a	45.4%	44.6%
SWE-bench Multilingual	54.5%	n/a	n/a	30.5%
BrowseComp	n/a	30.0%	n/a	8.9%
BrowseComp-zh	n/a	49.2%	n/a	35.7%
Terminal-bench	31.3%	n/a	13.3%	5.7%
SimpleQA	33.5%	93.4%	31.2%	92.7%

Benchmark notes: SWE-bench Verified results for V3.1 use non-thinking mode with tool calling enabled. BrowseComp results use thinking mode with search tool access. Codeforces ratings are not directly comparable to percentage pass scores and are listed separately. V3.1 thinking mode scored a Codeforces-Div1 rating of approximately 2091, above R1-0528's 1930.[^2]

License

DeepSeek V3.1 and V3.1-Terminus are both released under the MIT License, the same terms applied to earlier DeepSeek models including V3 and R1.[^2][^3] The MIT License permits commercial use, modification, distribution, and private use, with the only requirement being preservation of the copyright notice and license text in source distributions. No DeepSeek-specific acceptable-use restrictions or use-case limitations are attached to V3.1 weights.[^17]

The base model weights (DeepSeek-V3.1-Base and DeepSeek-V3.1-Terminus-Base) are available for download from Hugging Face and ModelScope.[^2][^3] Users deploying the full 671B model require approximately 1,400 to 1,500 GB of GPU memory for full-precision loading, which in practice requires multi-node configurations or quantized variants. Community-produced GGUF quantizations reduce memory requirements substantially while accepting some performance degradation, and the NVIDIA NVFP4 quantization released on Hugging Face reduces the disk and memory footprint by roughly 1.6x for Blackwell-class deployments.[^19]

Pricing

DeepSeek's V3.1 launch announcement on August 21, 2025 also confirmed a pricing reset that took effect on September 5, 2025 at 16:00 UTC, ending the off-peak discount tier that had applied to V3 and R1.[^1][^22] Under the post-reset schedule, both deepseek-chat (non-thinking) and deepseek-reasoner (thinking) follow the same per-million-token rates:[^22]

Tier	Rate
Input (cache hit)	$0.07 / 1M tokens
Input (cache miss)	$0.56 / 1M tokens
Output	$1.68 / 1M tokens

This represented an increase from V3-0324's off-peak pricing (cache-miss input rose from $0.27 to $0.56 per 1M tokens, and output rose from $1.10 to $1.68 per 1M tokens) but unified the rates across both modes for the first time.[^22] V3.1-Terminus inherited the same schedule when it replaced V3.1 on the official API on September 22, 2025.[^4]

A second price reduction followed one week later: on September 29, 2025, DeepSeek announced that with the release of V3.2-Exp, API prices on both endpoints would drop by 50% or more "effective immediately," reflecting the inference-efficiency gains from DeepSeek Sparse Attention.[^15]

Third-party providers set their own rates. OpenRouter lists V3.1 at $0.21 per million input tokens and $0.79 per million output tokens with a 163,840-token context, routing requests across multiple providers and aggregating roughly 60+ billion tokens per week of V3.1 traffic in early 2026.[^11] Together AI is the fastest non-thinking provider for V3.1 at approximately 254 output tokens per second, with SambaNova at 181 t/s and Amazon at 175 t/s in independent measurements.[^23]

These prices positioned V3.1 substantially below frontier Western models with comparable capabilities; OpenAI's o3 and Anthropic's Claude Opus 4, which offer broadly comparable reasoning performance, carried output prices an order of magnitude higher at the same time period.[^9]

DeepSeek's V3.1 API pricing was superseded by later updates. With the release of V3.2 in December 2025 and the V4 series in April 2026, V3.1 and V3.1-Terminus were scheduled for deprecation, with deepseek-chat and deepseek-reasoner endpoint aliases retiring at 15:59 UTC on July 24, 2026; during the transition the aliases route to V4-Flash's non-thinking and thinking modes respectively.[^1][^18]

DeepSeek V3.1 versus DeepSeek V3

DeepSeek V3 (including the V3-0324 update) is a non-thinking model. It cannot produce chain-of-thought reasoning and scored 59.4% on AIME 2024 compared to V3.1's 93.1% in thinking mode. V3.1's non-thinking mode also outperforms V3-0324 on most benchmarks, suggesting that the post-training improvements contributed to general capability gains independent of the reasoning mode addition.[^2][^5]

The context window grew from 64K (V3-0324) to 128K (V3.1), doubling the amount of material the model can consider in a single request,[^2] and the 840 billion additional tokens of long-context continued pretraining was the largest single training investment that distinguished V3.1-Base from V3.[^7]

DeepSeek V3.1 versus DeepSeek-R1

DeepSeek-R1 (and its May 2025 update R1-0528) is a dedicated reasoning model with no non-thinking mode. R1-0528 produces long chain-of-thought traces before every answer, resulting in substantially higher latency than V3.1's non-thinking mode (over 100 seconds to first token for complex tasks, versus approximately 2.9 seconds for V3.1 non-thinking).[^5] V3.1 thinking mode achieves comparable benchmark scores to R1-0528 on math and coding while producing shorter reasoning traces.[^2]

The most practically significant difference is tool calling. R1-0528 did not natively support structured tool use when it was released, making it unsuitable for direct integration into agentic pipelines without additional scaffolding. V3.1 includes native tool calling in both modes.[^7]

DeepSeek V3.1 versus DeepSeek V3.2-Exp

DeepSeek-V3.2-Exp was released on September 29, 2025, one week after V3.1-Terminus. It was built on V3.1-Terminus as its base and introduced DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism designed to improve inference efficiency for long-context workloads.[^15] DSA pairs a lightweight "lightning indexer" with a token selector to reduce the attention operation's complexity from quadratic O(L²) to linear O(Lk), where k is the small number of selected key tokens; DeepSeek reported 2-3x faster long-text processing and 30-40% lower memory use with negligible quality regression, alongside the simultaneous API price cut of 50% or more.[^15] V3.2-Exp performs roughly on par with V3.1-Terminus on most benchmarks, with the primary motivation being architectural efficiency research rather than raw capability improvement.[^14][^15] DeepSeek positioned V3.2-Exp as a research preview rather than a production-ready replacement, and the V3.2 stable release followed on December 1, 2025.[^1]

The following table summarizes the comparison across the V3 lineage and R1:

Feature	V3-0324	V3.1	V3.1-Terminus	V3.2-Exp	V3.2	R1-0528
Parameters (total, checkpoint)	685B	685B (671B + 14B MTP)	685B	685B	685B	685B
Parameters (active)	~37B	37B	37B	37B	37B	~37B
Context window	128K	128K	128K	128K	128K	128K
Thinking mode	No	Yes	Yes	Yes	Yes	Yes (only)
Non-thinking mode	Yes	Yes	Yes	Yes	Yes	No
Native tool calling	No	Yes	Yes	Yes	Yes	No
Sparse attention	No	No	No	Yes (DSA)	Yes (DSA)	No
SWE-bench Verified	45.4%	66.0%	68.4%	~68%	~68%	44.6%
AIME 2024 (thinking)	59.4%	93.1%	~93%	~93%	~93%	91.4%
License	MIT	MIT	MIT	MIT	MIT	MIT
Release date	March 2025	Aug 2025	Sep 2025	Sep 2025	Dec 2025	May 2025

Use cases

DeepSeek V3.1's combination of hybrid reasoning, expanded context, native tool calling, and open weights supports a range of deployment scenarios.

Software engineering automation

The 45% relative improvement on SWE-bench Verified over V3-0324 made V3.1 one of the strongest available models for automated software engineering at its release.[^2][^5] Developers using it through code agent frameworks (such as Aider, OpenHands, or SWE-agent) benefit from its ability to read and modify code across large codebases within the 128K context window, call external tools such as test runners, and apply chain-of-thought reasoning when debugging complex failures.

Document analysis and retrieval

The 128K context window allows V3.1 to ingest full-length technical documents, legal contracts, research papers, or codebases in a single request.[^7] In non-thinking mode, the model can respond quickly to document-level queries with minimal latency. In thinking mode, it can perform multi-step reasoning over document content before producing a structured analysis.

Research agents

V3.1's search agent capabilities, demonstrated by the BrowseComp score of 30.0% (and 38.5% in V3.1-Terminus),[^2][^3] allow it to operate in pipelines that call external search APIs during the reasoning trace via the dedicated <search_begin> and <search_end> control tokens.[^7] This makes it suitable for applications that require the model to retrieve and synthesize current information rather than relying entirely on pretraining knowledge.

Cost-sensitive production deployments

At $1.68 per million output tokens on DeepSeek's official API after the September 5, 2025 reset, V3.1 offered reasoning-capable performance at a cost substantially below Western frontier models with comparable benchmark results.[^22] Developers and organizations running high-volume inference found the cost differential meaningful for production use cases where reasoning quality matters but budget is constrained, and the cache-hit input rate of $0.07 per 1M tokens further reduced cost for workloads with repeated context (such as agents replaying the same system prompt and tool definitions across many requests).

Local and on-premises deployment

As an MIT-licensed open-weight model, V3.1 is available for on-premises deployment.[^2] Organizations with data residency requirements or those preferring not to send data to third-party APIs can self-host the model. The full 671B model requires large-scale GPU infrastructure, but community quantizations in GGUF and other formats allow partial capability with reduced hardware requirements, and NVIDIA's NVFP4 build supports 8 B200 GPUs in a single node for the full model.[^19]

Reception

DeepSeek V3.1 received positive attention from the developer community following its August 2025 release. The model became one of the most downloaded models on Hugging Face within days of launch, reaching a top-five position in overall download volume.[^13]

Developers testing V3.1 in code agent configurations reported that its SWE-bench improvements translated to real-world usability gains, with the model successfully resolving issues that V3-0324 had failed on.[^5] The hybrid mode was widely noted as a practical improvement: the ability to use the same model for both quick responses and deep analysis reduced infrastructure complexity for teams running multi-agent pipelines.[^7]

The low-key announcement style drew comment. Several observers noted that DeepSeek released a model with industry-leading agent benchmarks with less ceremony than competitors typically apply to incremental updates.[^8] This was consistent with DeepSeek's pattern of prioritizing Hugging Face and WeChat announcements over press-release-style launches.

The September 5, 2025 pricing reset was less well received than the model itself: external commentary noted that cache-miss input pricing more than doubled relative to the off-peak V3 rate, and output token prices rose by over 50%, eroding some of the cost advantage that had defined DeepSeek's earlier positioning.[^22] The subsequent V3.2-Exp price cut of 50%+ a few weeks later partly reversed that move.[^15]

The September 2025 V3.1-Terminus update was covered by VentureBeat and other technology publications as a quick turnaround on user-reported quality issues, with the language mixing fix being highlighted as particularly responsive to production feedback.[^17]

Some researchers noted that the simultaneous availability of V3.1 and V3.2-Exp within the same month (September 2025) created ambiguity about which model was the recommended production version. DeepSeek's API kept V3.1-Terminus as the default behind deepseek-chat and deepseek-reasoner while V3.2-Exp was offered as a separate endpoint, which largely resolved the practical question for API users[^1] until the V3.2 stable release in December 2025 superseded V3.1-Terminus on the legacy aliases.[^1]

Limitations

Despite strong benchmark performance, several limitations apply to V3.1 in practice.

The full 671B model requires approximately 1,400 to 1,500 GB of GPU memory for full-weight loading.[^2] This places full-precision local deployment out of reach for most individual researchers and many organizations without large GPU clusters. Quantized versions trade some performance for reduced memory requirements.

DeepSeek did not publish an explicit knowledge cutoff date for V3.1 in the model card or release announcement. V3.1-Base inherits pretraining data from V3, whose training data is reported by DeepSeek staff to extend through mid-2024; the additional 840 billion long-context tokens used in V3.1-Base extend coverage somewhat further, but no specific cutoff month was confirmed by DeepSeek for V3.1.[^2][^7] Events or developments after this period are not reflected in the model's pretraining knowledge, and the model may produce outdated or incorrect information about topics that evolved after the cutoff; in practice deployments use retrieval-augmented generation or the search agent control tokens to compensate.

The language consistency issues that V3.1-Terminus addressed were a real limitation of the original V3.1 release in production. Users deploying V3.1 for English-language applications encountered occasional Chinese-language intrusions in outputs, which required additional filtering or prompted migration to V3.1-Terminus.[^4][^17]

While V3.1 in thinking mode approaches R1-0528 on most reasoning benchmarks, R1-0528 retains a slight advantage on extended humanities reasoning and some creative tasks that benefit from prolonged deliberation.[^5][^14] The compression of thinking traces that makes V3.1 faster also means it invests less computation in some complex tasks than a dedicated reasoning model would. V3.1-Terminus's Codeforces rating regression of 45 points further illustrated that even within the V3.1 line, optimization for agent and tool-use stability can come at the cost of single-shot competitive-programming performance.[^20]

V3.1's training data composition and post-training details are not fully documented in any publicly available technical report, unlike the original DeepSeek V3, which had an accompanying paper (arXiv 2412.19437).[^21] The V3.1 release reused the same arXiv paper as documentation for architectural elements without a separate publication, limiting the degree to which the post-training procedure can be analyzed or reproduced by external researchers. A formal DeepSeek-V3.2 paper published in December 2025 (arXiv 2512.02556) covered DSA but did not retrospectively document V3.1's hybrid-reasoning training in detail.[^15]

V3.1 is also approaching end-of-life on the official API. The deepseek-chat and deepseek-reasoner endpoint aliases that V3.1 originally upgraded will be fully retired at 15:59 UTC on July 24, 2026, and users must migrate to deepseek-v4-flash or deepseek-v4-pro model IDs to continue using DeepSeek's hosted inference; the V3.1 and V3.1-Terminus open weights remain freely usable under the MIT License.[^18]

References

Background

DeepSeek V3

DeepSeek-R1

R1-0528 and the gap between V and R lines

Release

Hybrid thinking mode

How it works

Why one model instead of two

Thinking mode performance versus R1-0528

Architecture

Mixture of Experts

Multi-head Latent Attention

Multi-Token Prediction

FP8 precision

Context window extension

Phase 1: 32K extension

Phase 2: 128K extension

Agent capabilities

Code agents

Search agents

Terminal tasks

Tool calling

V3.1-Terminus

Changes from V3.1

Benchmark improvements from V3.1 to V3.1-Terminus

Availability

Benchmarks

License

Pricing

Comparison with related models

DeepSeek V3.1 versus DeepSeek V3

DeepSeek V3.1 versus DeepSeek-R1

DeepSeek V3.1 versus DeepSeek V3.2-Exp

Use cases

Software engineering automation

Document analysis and retrieval

Research agents

Cost-sensitive production deployments

Local and on-premises deployment

Reception

Limitations

See also

References

Improve this article

Related Articles

DeepSeek-R1-Distill

DeepSeek V3

DeepSeek V4

ZAYA1-8B

QwQ

Kimi K2

Background

DeepSeek V3

DeepSeek-R1

R1-0528 and the gap between V and R lines

Release

Hybrid thinking mode

How it works

Why one model instead of two

Thinking mode performance versus R1-0528

Architecture

Mixture of Experts

Multi-head Latent Attention

Multi-Token Prediction

FP8 precision

Context window extension

Phase 1: 32K extension

Phase 2: 128K extension

Agent capabilities

Code agents

Search agents

Terminal tasks

Tool calling

V3.1-Terminus

Changes from V3.1

Benchmark improvements from V3.1 to V3.1-Terminus

Availability

Benchmarks

License

Pricing