DeepSeek V4

DeepSeek V4 is a family of open-weight Mixture of Experts large language models developed by DeepSeek, a Hangzhou-based AI research lab. Released in preview on April 24, 2026, V4 comprises two variants: DeepSeek-V4-Pro, with 1.6 trillion total parameters (49 billion active per token), and DeepSeek-V4-Flash, with 284 billion total parameters (13 billion active per token). Both models support a one-million-token context window and are available as open weights under the MIT license. The release introduced a hybrid attention architecture that dramatically reduces inference costs at long contexts, set new low price points for frontier-class models, and was the first DeepSeek release validated for inference on Huawei Ascend processor infrastructure.

Background

DeepSeek was founded in 2023 as a research arm of the Chinese quantitative hedge fund High-Flyer Capital Management. The lab became globally known after releasing DeepSeek-V3 in December 2024, a 671-billion-parameter MoE model that outperformed or matched several leading Western models at a fraction of the training cost. Its companion reasoning model, DeepSeek-R1, released in January 2025, demonstrated that strong chain-of-thought reasoning could be elicited through reinforcement learning without relying solely on supervised distillation from proprietary models. The two releases together caused a brief shock in global financial markets, sent Nvidia stock down sharply on January 27, 2025, and intensified discussions about AI export controls.

Subsequent updates extended the V3 line through 2025. DeepSeek-V3-0324 arrived in March 2025 with improvements to AIME (+19.8 points) and GPQA (+9.3 points). DeepSeek-R1-0528 in May 2025 pushed reasoning further through increased post-training compute. DeepSeek-V3.1 in August 2025 introduced a hybrid architecture supporting both thinking and non-thinking modes within a single model and lifted SWE-bench Verified scores to 66.0. DeepSeek-V3.1-Terminus followed in September 2025 with agent capability improvements. DeepSeek-V3.2-Exp, also released in September 2025, introduced DeepSeek Sparse Attention as an experimental long-context optimization. The official V3.2 launched on December 1, 2025, with 685 billion total parameters. V3.2 was billed as the official successor to V3.2-Exp; a parallel release, V3.2-Speciale, attained gold-medal-level results on IMO, CMO, ICPC World Finals, and IOI 2025.

V4 represents the first architectural ground-up redesign since V3. The total parameter count roughly doubles V3.2, active parameters grow from 37 billion to 49 billion (Pro) or 13 billion (Flash), and the context window quadruples from 256K to 1M tokens. The underlying attention mechanism is entirely new, replacing the Multi-head Latent Attention that had defined the V3 line.

The release came after a publicly reported delay. Chinese tech journalist Zhou Xinyu, writing for 36Kr, attributed the postponement to a mid-2025 attempt to migrate the training framework from Nvidia GPUs to Huawei Ascend NPUs that ran into significant instability. ChinaTalk also reported on internal funding decisions that pushed multimodal capability to a future generation, and on departures of senior staff to Tencent, ByteDance, Xiaomi, and other Chinese tech firms during 2025.

Architecture

Model variants

DeepSeek V4 ships in two sizes:

Variant	Total parameters	Active parameters	Context	Precision	Size on disk
DeepSeek-V4-Flash	284B	13B	1M tokens	FP8 + FP4 mixed	160 GB
DeepSeek-V4-Pro	1.6T	49B	1M tokens	FP8 + FP4 mixed	862 GB

Both models also ship in base (pre-training checkpoint) variants. The instruct variants support three reasoning modes: Non-Think (fast, no extended reasoning), Think High (deliberate logical analysis), and Think Max (maximum reasoning depth). DeepSeek recommends a minimum 384K-token context window when using Think Max mode so that long chains of thought are not truncated.

Hybrid attention: CSA and HCA

The central architectural innovation in V4 is a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA). Both are departures from the Multi-head Latent Attention (MLA) used in V3 and from the DeepSeek Sparse Attention introduced in V3.2-Exp.

In CSA, a learned token-level compressor consolidates every m tokens along the sequence dimension into a single key-value entry through softmax-gated pooling with a learned positional bias. Queries then attend over these compressed KV representations using DeepSeek Sparse Attention, which selects only the top-k most relevant compressed blocks via a Lightning Indexer component implemented in FP4 precision. A complementary sliding window layer covers the most recent n_win tokens for local context. The net effect is a 4x compression of the KV cache along the sequence axis.

HCA is more aggressive. It consolidates every m' tokens (where m' is considerably larger than m, with a 128x compression ratio cited in the technical report) into a single KV entry and applies dense attention over those consolidated entries. The high compression ratio alone delivers efficiency gains, without requiring the sparse selection step that CSA uses.

Layers in V4-Pro's 61-layer stack use these mechanisms in a specific pattern: layers 0 and 1 apply HCA only, layers 2 through 60 alternate between CSA and HCA, and a final multi-token-prediction (MTP) block uses sliding-window attention only. This layered scheme avoids capacity waste by giving different layers different attention patterns suited to local and global context retrieval.

The result at 1M-token context is dramatic. According to the V4 technical report, DeepSeek-V4-Pro requires only 27% of the per-token inference FLOPs and 10% of the KV cache memory that V3.2 would need for the same context length. DeepSeek-V4-Flash is even leaner at 10% of FLOPs and 7% of KV cache relative to V3.2. From V3.2 to V4-Pro, KV cache memory is reduced by approximately 9.5x to 13.7x across context lengths, according to figures cited by The Register. Compared with a standard Grouped-Query Attention baseline (GQA-8), V4's KV cache is roughly 2% of the equivalent GQA-8 footprint.

Manifold-Constrained Hyper-Connections (mHC)

V4 replaces standard residual connections with Manifold-Constrained Hyper-Connections (mHC). The residual stream width is expanded by a factor of four (n_hc = 4 in both model variants). The residual mapping matrices are constrained to the Birkhoff polytope, the set of doubly stochastic matrices, using the Sinkhorn-Knopp algorithm with a maximum of 20 iterations during training. This constraint bounds the spectral norm of the mapping matrices at 1, which prevents signal amplification through the network and improves training stability without sacrificing model expressivity.

Muon optimizer

DeepSeek-V4 adopts the Muon optimizer for the majority of its parameters. Muon orthogonalizes gradient updates using Newton-Schulz iterations. V4 uses a hybrid schedule: 8 convergence iterations followed by 2 stabilization iterations. Embeddings, prediction heads, biases, and RMSNorm weights use AdamW. According to DeepSeek's technical report, Muon speeds convergence and improves training stability at trillion-parameter scale, and the hybrid Muon plus AdamW configuration was chosen after ablation studies on smaller model sizes.

Precision and quantization

Expert weights in the MoE layers use FP4 precision with quantization-aware training, roughly halving the weight storage footprint versus FP8. Non-expert parameters use FP8. The KV cache stores most entries in FP8, with BF16 reserved only for the rotary positional embedding (RoPE) dimensions where higher precision matters. Base model checkpoints use FP8 throughout, while instruct models combine FP4 (experts) with FP8 (everything else). FP4 adoption at the expert level is a key reason V4-Pro's 862 GB download size is manageable despite 1.6 trillion nominal parameters; an equivalent FP8-only model would be roughly 60% larger.

The Register specifically noted V4's adoption of MXFP4 (microscaling FP4) as a step away from Nvidia-specific FP8 formats, framing it as a deliberate move toward hardware portability across accelerator vendors.

Training stability mechanisms

Two techniques address instability specific to large-scale MoE training. Anticipatory Routing uses historical routing parameters from earlier in training (theta_{t minus delta_t}) to compute token assignments, decoupling backbone and router gradient updates so they do not interfere. SwiGLU Clamping constrains the linear components of SwiGLU activations to the range [-10, 10] and gates to a maximum of 10, preventing gradient explosions in expert layers when individual activations occasionally produce outlier values during training.

Training methodology

Pre-training

DeepSeek-V4-Pro was pre-trained on a corpus of more than 32 trillion tokens described as diverse and high-quality. DeepSeek-V4-Flash was trained on approximately 33 trillion tokens, slightly more than Pro. Both corpora include long-document data to support the 1M-token context objective. The specific composition of the training data, beyond these summary figures, is not disclosed in the technical report.

The V4 paper emphasizes that long-context training data is curated rather than synthesized, with emphasis on whole-document examples (codebases, books, legal corpora) rather than artificially concatenated short documents. The ratio of long-context to short-context examples is gradually increased during training.

Post-training: specialist-then-distill

V4's post-training pipeline is a two-stage specialist-then-distill approach.

In the first stage, separate specialist models are trained for distinct domains: mathematics, coding, agentic tasks, and instruction following, among others. Each specialist starts from the shared pre-trained base and undergoes domain-specific Supervised Fine-Tuning (SFT) followed by Reinforcement Learning using Group Relative Policy Optimization (GRPO) with domain-tailored reward signals. This produces a set of more than ten domain-expert teacher models, each strong in its area.

In the second stage, a single unified student model is distilled from all teachers simultaneously through On-Policy Distillation (OPD). The student generates its own outputs, then minimizes the reverse KL divergence against whichever teacher is most relevant to the current task's logit distribution. Full-vocabulary logit distillation, rather than top-k, is used for stable gradient estimates. The result is a single inference model that combines the strengths of all domain specialists.

This approach contrasts with V3's more uniform reinforcement learning across domains and allows V4 to optimize more precisely for different task types without maintaining multiple inference models in production.

DSEc: training infrastructure for agents

A significant infrastructure component disclosed in the V4 technical report is DeepSeek Elastic Compute (DSEc), a Rust-based platform that exposes four execution substrates for reinforcement learning rollouts: function calls, containers, microVMs (Firecracker), and full VMs (QEMU). DSEc is designed to run hundreds of thousands of concurrent sandboxes during training, with fast image loading via layered 3FS storage so containers do not incur cold-start delays during rollouts, and preemption-safe trajectory replay so an interrupted rollout can resume without re-running tool calls.

DSEc enabled V4's agent-focused post-training to use real tool environments at scale rather than simulated stubs. The interleaved thinking pattern (preserving chain-of-thought across tool calls when tools are present, discarding it across user messages when they are absent) was specifically tuned through DSEc rollouts.

Tool-call schema with dedicated tokens

V4 introduces a |DSML| special token that wraps an XML-based tool-call format. This was a deliberate move away from JSON tool-call formats (used by most Western models) because the XML form reduces escaping failures around quoted strings and structured parameters. Parameters are marked with string="true" or string="false" to distinguish quoted strings from numbers and booleans, which the technical report claims reduces parsing errors during multi-turn agent loops.

Benchmark performance

The following scores are for DeepSeek-V4-Pro-Max (the highest-effort inference setting using Think Max mode) unless otherwise noted, taken from DeepSeek's technical report and Hugging Face model card.

Coding

Benchmark	V4-Pro-Max	GPT-5.5	Claude Opus 4.6	Gemini 3.1 Pro
LiveCodeBench Pass@1	93.5	--	--	--
Codeforces Rating	3206	3168 (GPT-5.4-xHigh)	--	3052
SWE-bench Verified	80.6%	88.7%	80.8%	80.6%
SWE-bench Pro	55.4%	58.6%	64.3%	--
Terminal-Bench 2.0	67.9%	82.7%	69.4%	68.5%
MCPAtlas Public	73.6	--	73.8	--
Toolathlon	51.8	--	--	--

V4-Pro-Max's Codeforces rating of 3206 was the highest score achieved by any AI model at the time of release, surpassing GPT-5.4-xHigh's 3168 and Gemini's 3052. Estimates put the score at roughly the equivalent of rank 23 on Codeforces globally. On SWE-bench Verified, V4-Pro is at near parity with Claude Opus 4.6 (80.8%) and Gemini 3.1 Pro (80.6%) but trails GPT-5.5 (88.7%). V4-Pro-Max leads all models on LiveCodeBench at 93.5.

Mathematics and science

Benchmark	V4-Pro-Max	GPT-5.5	Claude Opus 4.6	Gemini 3.1 Pro
GPQA Diamond	90.1%	--	~91%	94.3%
HMMT 2026	95.2%	--	96.2%	--
Putnam 2025	120/120	--	--	--
AIME 2026	96.4%	--	--	--
GSM8K (base)	92.6%	--	--	--
MATH (base)	64.5%	--	--	--
IMOAnswerBench	89.8%	--	--	--

V4 achieves a perfect score on Putnam 2025, the undergraduate mathematics competition dataset, and 96.4% on AIME 2026. On GPQA Diamond, a PhD-level science benchmark, V4 scores 90.1% versus Gemini 3.1 Pro's 94.3%.

General knowledge

Benchmark	V4-Pro-Max	Gemini 3.1 Pro	GPT-5.2	Claude Opus 4.6
MMLU-Pro	87.5%	91.0%	91.4%	90.5%
SimpleQA Verified	57.9%	--	--	46.2%
Chinese SimpleQA	84.4%	--	--	--
HLE	37.7%	--	--	40.0%

On MMLU-Pro, V4-Pro trails Gemini 3.1 Pro and GPT-5.2. On SimpleQA Verified (a factual accuracy benchmark), V4 significantly outperforms Claude Opus 4.6 at 46.2% with 57.9%. DeepSeek's own technical paper acknowledges that V4 "trails state-of-the-art frontier models by approximately three to six months," placing it roughly at parity with mid-2025 frontier models such as GPT-5.2, Gemini 3.0 Pro, and Claude Opus 4.5.

Long-context

Benchmark	V4-Pro-Max	Claude Opus 4.6
MRCR 1M (8-needle)	83.5 MMR	92.9 MMR
CorpusQA 1M	62.0%	--
LongBench-V2 (base)	51.5%	--
BrowseComp	83.4%	--

At 1M-token retrieval (MRCR), V4-Pro scores 83.5 MMR, trailing Claude Opus 4.6's 92.9 but ahead of other open models. The Hugging Face blog reports that V4-Pro stays above 0.82 accuracy through 256K tokens on MRCR with 8 needles and holds at 0.59 at 1M tokens.

Pricing

DeepSeek charges per token on its official API. At launch, V4-Pro was offered at a 75% introductory discount, extended through May 31, 2026 (15:59 UTC). Standard (full) rates and discounted launch rates are listed separately below.

Official DeepSeek API pricing

Model	Input (cache miss)	Input (cache hit)	Output	Max output
V4-Flash	$0.14 / 1M	$0.0028 / 1M	$0.28 / 1M	384K tokens
V4-Pro (standard)	$1.74 / 1M	$0.003625 / 1M	$3.48 / 1M	384K tokens
V4-Pro (launch discount, until 2026-05-31)	$0.435 / 1M	$0.003625 / 1M	$0.87 / 1M	384K tokens

Cache-hit pricing represents automatic context caching. DeepSeek applies caching transparently without requiring developers to declare cache keys or set TTLs. On April 26, 2026 at 12:15 UTC, DeepSeek reduced cache-hit prices to one-tenth of their initial launch rates across the entire API. Combined with the V4-Pro 75% discount, the effective input cost for cache-heavy agentic workloads was reported in some technical analyses to be the lowest of any frontier-class model on the market in May 2026.

Comparison with competing models

Model	Input ($/1M)	Output ($/1M)	Context
DeepSeek V4-Flash	$0.14	$0.28	1M tokens
DeepSeek V4-Pro (standard)	$1.74	$3.48	1M tokens
Gemini 3.1 Pro	~$3.50	~$10.50	1M tokens
GPT-5.4	~$15.00	~$60.00	128K tokens
GPT-5.5	$5.00	$30.00	1M tokens
Claude Opus 4.6	$15.00	$75.00	1M tokens
Claude Opus 4.7	$5.00	$25.00	1M tokens

At standard rates, V4-Pro output costs $3.48 per million tokens versus $25 for Claude Opus 4.7 and $30 for GPT-5.5, roughly a 7x to 9x price gap in DeepSeek's favor. VentureBeat described V4 as offering "near state-of-the-art intelligence at 1/6th the cost" of top Western models. At launch discount rates, V4-Pro output drops to $0.87, roughly a 30x gap to Claude Opus 4.7 output.

Model variants

DeepSeek-V4-Pro

V4-Pro is the larger of the two variants, with 1.6 trillion total parameters and 49 billion active per token. It uses FP4 for expert weights and FP8 for other layers, resulting in an 862 GB checkpoint. It is aimed at demanding tasks: complex multi-step reasoning, advanced coding, scientific analysis, and long-document comprehension. On agentic benchmarks it rivals or approaches the performance of Claude Opus 4.6 and GPT-5.4. The V4-Pro Hugging Face page lists more than 1.06 million downloads in the first month after release.

DeepSeek-V4-Flash

V4-Flash has 284 billion total parameters with 13 billion active per token, weighing 160 GB. According to DeepSeek, "reasoning capabilities closely approach V4-Pro" despite the smaller scale, and it runs at substantially lower cost and latency. V4-Flash is positioned as a drop-in replacement for the legacy deepseek-chat and deepseek-reasoner API endpoints, which route to V4-Flash's non-thinking and thinking modes respectively. Legacy endpoints are scheduled for retirement on July 24, 2026.

Base models

Both Pro and Flash ship with base (pre-training checkpoint) variants in addition to instruction-tuned versions. Base models are available as DeepSeek-V4-Flash-Base and DeepSeek-V4-Pro-Base and are intended for downstream fine-tuning. Both base checkpoints use FP8 precision throughout, without the FP4 expert quantization applied to instruct models.

Open source release and licensing

DeepSeek released V4 weights under the MIT License, one of the most permissive open source licenses. This allows commercial use, redistribution, and modification without royalties or restrictions. The models are available for download on Hugging Face under the deepseek-ai organization. V4-Pro accumulated more than one million downloads in its first month, and the V4 collection was the most-downloaded large language model collection on Hugging Face in late April and early May 2026.

The technical report is available as a PDF on the Hugging Face model card, titled "DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence."

DeepSeek provides inference examples for multiple deployment frameworks including Hugging Face Transformers, vLLM, SGLang, and Docker Model Runner. Quantized versions compatible with llama.cpp, Ollama, LM Studio, and Jan are available from the community. Unsloth published a fine-tuning fork on Hugging Face within days of release, and Nvidia published an optimized variant for its NIM inference platform.

Deployment at the 862 GB scale of V4-Pro requires significant hardware. For a rough estimate, running at FP4 plus FP8 mixed precision on a single node typically requires multiple high-memory accelerators, such as several H100 80GB GPUs. V4-Flash at 160 GB is more accessible for organizations with modest compute, fitting within a single 8-GPU H100 or H200 node at full precision and on smaller hardware after quantization.

Capabilities

Agentic coding

V4-Pro showed particular strength on agentic coding tasks. On SWE-bench Verified, it resolves 80.6% of GitHub issues, near parity with Claude Opus 4.6's 80.8% and Gemini 3.1 Pro's 80.6%. On BrowseComp (web browsing and research), it scores 83.4%. V4-Pro-Max's Codeforces rating of 3206 represents the highest competitive programming score by an AI model at the time of release. Developer community testing, as reported by MindStudio and independent developers, put V4-Pro in the top two or three choices for coding use cases in the months following release. A MindStudio survey reported by ghost.codersera found that 52% of surveyed DeepSeek developers were ready to replace their primary coding model with V4-Pro, and another 39% were leaning toward yes.

The technical report also disclosed an internal R&D coding evaluation in which V4-Pro scored 67% pass rate, between Claude Sonnet 4.5 (47%) and Claude Opus 4.5 (70%).

Long-context comprehension

The 1M-token context window is fully operational at launch, rather than a theoretical maximum. On the MRCR 1M needle-in-a-haystack benchmark, V4-Pro scores 83.5 MMR, below Claude Opus 4.6 but ahead of other open models. At 1M tokens, V4-Pro's FLOPs requirement is 27% of V3.2's, making long-context inference economically viable. The Hugging Face blog article specifically frames V4 as "a million-token context that agents can actually use," noting that prior 1M-context releases tended to be theoretical maxima with degraded retrieval at high token counts.

Reasoning modes

Both V4 variants support three inference modes. Non-Think mode produces fast responses without extended reasoning chains, suitable for routine queries. Think High mode engages deliberate analysis with a moderate token budget for thinking. Think Max mode pushes reasoning to its full extent, requiring a minimum 384K-token context window to accommodate extended chain-of-thought traces. The choice of mode is exposed as a parameter in the API, with reasoning_effort set to off, high, or max.

The interleaved-thinking design preserves reasoning content across tool calls (within a single agent task) but discards it across user message boundaries (in conversational use without tools). This avoids context pollution in chat-style use while supporting long-horizon agent loops.

Instruction following and multilingual

V4 shows competitive multilingual performance. Chinese SimpleQA scores 84.4%, and the model performs well on Chinese-language tasks. On SimpleQA Verified (English factual accuracy), V4-Pro outperforms Claude Opus 4.6 at 46.2% with a score of 57.9%. Reviewer testing flagged that V4-Pro is less reliable than GPT-5.5 and Claude Opus 4.6 on prompts with many simultaneous structural constraints (precise output schemas, exact word counts, multi-format documents), where the larger Western models maintain higher consistency.

Hardware and infrastructure

Training hardware

DeepSeek's training infrastructure for V4 remains primarily Nvidia-based, according to reporting from ChinaTalk and The Register. Earlier DeepSeek models were trained on Nvidia A100 clusters obtained before export controls, and subsequent models used A800 chips. V4's training used Nvidia GPUs, though DeepSeek has not publicly specified which generation. U.S. government officials separately alleged that DeepSeek obtained banned Nvidia Blackwell chips through third-party intermediaries, though this has not been independently confirmed.

Huawei Ascend optimization

A significant V4 development is its validated support for Huawei Ascend 950PR processors for inference. DeepSeek's technical report describes the MegaMoE fused kernel as successfully running on Huawei Ascend hardware, and the report mentions validating a fine-grained Expert Parallel scheme on both Nvidia GPUs and Ascend NPU platforms. Huawei announced full Ascend platform support for V4 models at launch. The company's TileLang domain-specific language reduces CUDA dependency on the inference path, enabling portability to non-Nvidia accelerators.

The Register notes that V4 adopts MXFP4 (microscaling FP4) for post-training and inference, reducing dependence on Nvidia-specific FP8 formats. This was described by technical analysts as a deliberate step toward hardware portability.

DeepSeek reportedly experienced significant training failures during a mid-2025 attempt to migrate the training framework from Nvidia to Huawei Ascend, according to Chinese tech journalist Zhou Xinyu of 36Kr. The migration contributed to V4's delayed release relative to earlier expectations. For now, the Huawei integration covers inference, not training, with training still primarily on Nvidia hardware.

If V4's Huawei inference optimization proves reliable at scale, it could provide evidence that competitive frontier models can run on non-Nvidia infrastructure, with implications for the effectiveness of U.S. semiconductor export controls. Nvidia CEO Jensen Huang called the prospect of DeepSeek running on Huawei chips "a horrible outcome for America," in a statement reported by The Next Web.

Comparison with frontier models

Model	Org	Open weights	Params (active)	Input $/1M	Output $/1M	SWE-bench Verified	GPQA Diamond
DeepSeek V4-Pro	DeepSeek	Yes (MIT)	49B (1.6T total)	$1.74	$3.48	80.6%	90.1%
DeepSeek V4-Flash	DeepSeek	Yes (MIT)	13B (284B total)	$0.14	$0.28	--	--
Claude Opus 4.6	Anthropic	No	undisclosed	$15.00	$75.00	80.8%	~91%
Claude Opus 4.7	Anthropic	No	undisclosed	$5.00	$25.00	64.3%	--
GPT-5.5	OpenAI	No	undisclosed	$5.00	$30.00	88.7%	--
Gemini 3.1 Pro	Google	No	undisclosed	~$3.50	~$10.50	80.6%	94.3%
Kimi K2.6	Moonshot AI	Partial	undisclosed (1.1T total)	--	--	--	--
GLM-5.1	Zhipu AI	Partial	undisclosed (754B total)	--	--	--	--

V4-Pro is the largest open-weight model available as of its release date, surpassing Moonshot AI's Kimi K2.6 at 1.1T total parameters and Zhipu AI's GLM-5.1 at 754B. Among Chinese open models, V4-Pro leads across math, coding, and STEM benchmarks.

Industry impact

China-US AI competition

DeepSeek V4 landed on the same day Reuters reported the U.S. State Department had sent a diplomatic cable to embassies worldwide instructing staff to warn foreign governments about alleged IP theft by DeepSeek and other Chinese AI companies. The concurrent timing, whether coincidental or deliberate, brought significant geopolitical attention to the release.

The Council on Foreign Relations published an analysis that same day arguing V4 "signals a new phase in the U.S.-China AI rivalry," with the competition shifting from raw frontier capability toward economic adoption and global influence, particularly in the Global South.

CFR Senior Fellow Michael C. Horowitz emphasized the adoption race angle: "Success will not just be about having the best-performing models," he wrote, but about having good-enough solutions that deploy cheaply at scale. He noted that Chinese open models already had more downloads on Hugging Face than U.S. equivalents, a dynamic V4's open release was likely to amplify.

CFR Senior Fellow Jessica Brandt raised concerns that V4's capabilities "reflect, at least in part, access to illicitly obtained U.S. intellectual property," citing alleged large-scale model distillation attacks through fake accounts. Anthropic and OpenAI separately alleged that DeepSeek-affiliated actors had created tens of thousands of fake accounts conducting tens of millions of interactions to extract capabilities from frontier U.S. models. DeepSeek has not publicly responded to these allegations.

CFR Senior Fellow Chris McGuire offered a more measured technical assessment, noting that V4 "is not competitive with frontier U.S. models" and that DeepSeek remains significantly dependent on U.S. semiconductor technology. He flagged that DeepSeek itself had admitted compute shortages limit V4 deployment at scale, with the V4-Pro model unavailable to most API customers in the launch days.

Market reaction

Following the V4 release, shares of SMIC (China's leading chip foundry) rose roughly 9% in Hong Kong trading, with Hua Hong Semiconductor up about 15%, reflecting investor expectations that Huawei Ascend chip demand would increase. Competing Chinese AI startups MiniMax (HKG: 0100) and Knowledge Atlas (Zhipu, HKG: 2513) saw shares fall, with MiniMax sliding around 9-10% in the days after release and Zhipu down 3.4% in Monday trading following V4's launch. The pattern reflected investor rotation out of model developers facing pricing pressure into chip suppliers benefiting from compute demand.

V4's preview release coincided with DeepSeek's first-ever external financing round. According to reporting from Bloomberg, The Information, and CnTechPost, DeepSeek had been in advanced talks since mid-April 2026 to raise at least $300 million at an initial $10 billion target valuation. Within days of the V4 announcement, interest from investors including Tencent and Alibaba pushed the valuation discussion above $20 billion. The financing round marks a sharp pivot for a company that had spent its first two and a half years rejecting venture capital offers in favor of a research-first culture funded by High-Flyer Capital Management.

Developer and commercial adoption

MIT Technology Review identified three reasons V4 matters beyond raw benchmark scores: the compressed attention architecture reduces inference costs to levels that make 1M-token context economically practical for the first time; the open MIT license enables commercial use without licensing negotiations; and the pricing sets a new benchmark for cost pressure on proprietary model providers. Over 90% of developers surveyed by MindStudio included V4-Pro among their top coding model choices in the weeks following release. The legacy API endpoint transition (deepseek-chat to V4-Flash, deepseek-reasoner to V4-Flash thinking mode) signals that V4 is intended as a production replacement, not an experimental release.

V4 was rapidly integrated into agent frameworks and coding tools. The DeepSeek API release notes specifically highlighted compatibility with Claude Code, OpenClaw, and OpenCode, alongside support for both OpenAI ChatCompletions and Anthropic-style API formats. This dual-format support reduced switching costs for developers already using Western model APIs.

Reception

Technical coverage was broadly positive but included specific criticisms. Simon Willison, a widely-read developer blogger, described V4 as "almost on the frontier, a fraction of the price," noting that the efficiency improvements at long contexts are genuine and reproducible. He directly quoted DeepSeek's own admission that performance "falls marginally short of GPT-5.4 and Gemini-3.1-Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately 3 to 6 months." Willison also ran his pelican-on-bicycle SVG generation test, finding that V4-Flash produced a competent rendering while V4-Pro's pelican anatomy was somewhat off.

The Register covered the architecture in depth, describing the KV cache reductions as the most significant technical contribution and emphasizing the MXFP4 adoption as a potential break from Nvidia hardware lock-in. MIT Technology Review framed V4 as the moment when 1M-token context became economically practical for production use. Bloomberg called V4 "DeepSeek's newest flagship a year after [its] AI breakthrough," referencing the V3 and R1 releases of late 2024 and early 2025. CNBC and CNN both ran prominent international coverage on launch day.

Developer community reaction on platforms including Reddit (r/DeepSeek, r/singularity, r/LocalLLaMA) and Hugging Face was enthusiastic about the context window, pricing, and Codeforces score. The 3206 Codeforces rating was immediately flagged as a landmark for competitive programming AI.

Criticisms concentrated on a few areas. At launch, some chat instances reportedly identified themselves as V3, suggesting incomplete deployment. Coverage of third-party benchmark results was incomplete in the first week. Developers testing frontend code generation found V4-Pro's output functionally correct but less visually polished than GPT-5.5 on UI tasks. The API initially had reliability problems for V4-Pro under load, with many users encountering rate limits and queuing during peak hours.

The 36Kr report and ChinaTalk analysis added context that V4's development was delayed by training migration failures, talent departures to Tencent, ByteDance, Xiaomi, and other Chinese tech firms, and internal funding decisions that postponed multimodal capability to a later release. ChinaTalk specifically suggested that V4 is the "singularity of the Cambrian explosion of AI applications in China," framing it as a foundational platform for downstream Chinese AI products rather than a frontier-capability moonshot.

Limitations

Several limitations were noted at launch or in subsequent testing.

Performance gap with closed models. DeepSeek's own technical report states V4 trails state-of-the-art frontier models by approximately three to six months. On MMLU-Pro (87.5% versus Gemini 3.1 Pro at 91.0%) and HLE (37.7% versus Claude Opus 4.6 at 40.0%), the gap to the leading models is measurable. On long-context retrieval (MRCR 1M: 83.5 versus Claude Opus 4.6's 92.9), V4 is competitive but not leading.

No multimodal capability. V4 is text-only at release. Internal funding and compute constraints reportedly pushed multimodal features to a future release. This is a notable gap for teams building agents that require visual validation of outputs, document understanding with figures and tables, or screen-reading workflows.

Complex multi-constraint instruction following. Reviewer testing found that V4-Pro, while strong on structured tasks, shows degraded reliability on complex prompts with many simultaneous constraints, where GPT-5.5 and Claude Opus 4.6 perform more consistently.

Long-horizon agentic reliability. On Terminal-Bench 2.0, V4-Pro scores 67.9% versus GPT-5.5's 82.7%, indicating a real gap in multi-step tool-use reliability over extended autonomous task runs.

Scale deployment constraints. CFR analysis notes that compute shortages at DeepSeek itself limit V4 deployment at scale. The full 862 GB V4-Pro model is demanding to self-host, and DeepSeek's own API was reported to be unable to serve V4-Pro to most customers in the days immediately after launch.

IP concerns. Multiple U.S. government officials and analysts have alleged that DeepSeek conducted large-scale distillation from proprietary U.S. frontier models during training. If accurate, this raises questions about the provenance of some capabilities. DeepSeek has not addressed these allegations publicly.

References

Background

Architecture

Model variants

Hybrid attention: CSA and HCA

Manifold-Constrained Hyper-Connections (mHC)

Muon optimizer

Precision and quantization

Training stability mechanisms

Training methodology

Pre-training

Post-training: specialist-then-distill

DSEc: training infrastructure for agents

Tool-call schema with dedicated tokens

Benchmark performance

Coding

Mathematics and science

General knowledge

Long-context

Pricing

Official DeepSeek API pricing

Comparison with competing models

Model variants

DeepSeek-V4-Pro

DeepSeek-V4-Flash

Base models

Open source release and licensing

Capabilities

Agentic coding

Long-context comprehension

Reasoning modes

Instruction following and multilingual

Hardware and infrastructure

Training hardware

Huawei Ascend optimization

Comparison with frontier models

Industry impact

China-US AI competition

Market reaction

Developer and commercial adoption

Reception

Limitations

See also

References

Improve this article

Related Articles

DeepSeek V3

DeepSeek V3.1

DeepSeek-R1-Distill

Kimi K2

Hunyuan

GLM-4.5

Background

Architecture

Model variants

Hybrid attention: CSA and HCA

Manifold-Constrained Hyper-Connections (mHC)

Muon optimizer

Precision and quantization

Training stability mechanisms

Training methodology

Pre-training

Post-training: specialist-then-distill

DSEc: training infrastructure for agents

Tool-call schema with dedicated tokens

Benchmark performance

Coding

Mathematics and science

General knowledge

Long-context

Pricing

Official DeepSeek API pricing

Comparison with competing models

Model variants

DeepSeek-V4-Pro

DeepSeek-V4-Flash

Base models

Open source release and licensing

Capabilities

Agentic coding

Long-context comprehension