DeepSeek V3
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,675 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 ยท 5,675 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek V3 is an open-weights Mixture of Experts large language model developed by DeepSeek and released on December 26, 2024.[1][2] With 671 billion total parameters and 37 billion activated per token, it became the most capable openly available language model at the time of its release, matching or exceeding several leading proprietary systems on standard benchmarks while costing a reported $5.576 million in GPU rental time for its final training run, a fraction of the publicly estimated training budgets of comparable Western frontier systems.[1][3]
The model drew immediate international attention not just for its technical performance but for what it implied about the state of Chinese AI development under US semiconductor export controls. When the DeepSeek-R1 reasoning model followed roughly four weeks later on January 20, 2025, the combination triggered a sharp sell-off in AI-related equities, including a single-day loss of approximately $589 billion in Nvidia's market capitalisation on January 27, 2025, the largest one-day stock loss in US market history to that point.[4][5] The episode entered the AI industry's vocabulary as the "DeepSeek moment."
DeepSeek V3 is not a reasoning model. It is a general-purpose base language model and instruction-tuned chat model, analogous in positioning to GPT-4o or Claude 3.5 Sonnet. The reasoning counterpart, built on top of the V3-Base checkpoint, is DeepSeek-R1.[6] After the original December 2024 release, DeepSeek issued an updated checkpoint in March 2025 (V3-0324), a substantially retrained version called DeepSeek-V3.1 in August 2025, an experimental sparse-attention variant V3.2-Exp in September 2025, a stable V3.2 in December 2025, and the successor DeepSeek V4 preview in April 2026.[7]
The model is sometimes referred to colloquially as "DeepSeek 3.0," though the official designation in DeepSeek's technical report and Hugging Face repositories is "DeepSeek-V3."[2]
DeepSeek was established in July 2023 as a research subsidiary of High-Flyer, a Hangzhou-based quantitative hedge fund co-founded by Liang Wenfeng.[8][9] High-Flyer had been building GPU clusters for its own algorithmic trading research since at least 2021, accumulating a substantial inventory of Nvidia A100s before US export controls tightened in October 2022. The firm's "Fire-Flyer 2" cluster, completed in 2022, contained roughly 5,000 PCIe A100 GPUs across 625 nodes of eight GPUs each.[8] When DeepSeek spun out as a dedicated AI lab, it inherited both the hardware infrastructure and a team of researchers motivated primarily by scientific curiosity rather than near-term commercial return, which influenced how it approached the cost and openness tradeoffs that would later distinguish its releases.
DeepSeek's first major public model was DeepSeek-V1, a dense transformer released in late 2023. The more consequential architectural step came in May 2024 with DeepSeek-V2, which introduced two techniques that became defining features of the entire V-series lineage: Multi-head Latent Attention (MLA) for memory-efficient key-value caching, and the DeepSeekMoE design for expert routing.[10] V2 had 236 billion total parameters with 21 billion active per token and demonstrated that MoE models could be trained competitively at significantly lower per-token compute than equivalently capable dense models.
V3 scaled V2's approach substantially while introducing several additional technical improvements that collectively reduced training cost and improved final quality. The result was a model that outperformed open-source alternatives by a meaningful margin across most standard evaluations and stood roughly level with GPT-4o and Claude 3.5 Sonnet on several tasks.[1]
When V3 was first released in December 2024, the weights were distributed under a custom "DeepSeek Model License Agreement" that was modelled on OpenRAIL and permitted commercial use but imposed restrictions: derivative models must credit DeepSeek, output from DeepSeek models could not be used to train models intended to compete with DeepSeek products, and the license included standard safety clauses. The accompanying code under the GitHub repository used the MIT License.[11] The weights were thus open in the sense that anyone could download and run them, but the original license was not an OSI-approved open source license.
When V3-0324 was released in March 2025, DeepSeek migrated the model weights to MIT License as well, removing the custom restrictions and aligning the license with the simultaneously distributed DeepSeek-R1 weights.[12][13] This made V3-0324 and subsequent V3-series checkpoints fully permissive under a standard OSI-approved license.
DeepSeek V3 follows the transformer decoder architecture with two major structural specialisations: MLA for attention and DeepSeekMoE for the feed-forward layers.[1] The full configuration comprises 61 transformer layers with a hidden dimension of 7,168 and 128 attention heads (each of dimension 128).[14]
Total parameters are 671 billion in the main model. The HuggingFace release ships with an additional Multi-Token Prediction module that adds roughly 14 billion parameters, bringing the total uploaded weight count to approximately 685 billion.[15] References to "DeepSeek V3 has 685B parameters" generally include the MTP module; the 671B figure refers to the main model.
In a standard dense transformer, every token passes through every neuron in every feed-forward layer, making compute scale linearly with parameter count. A Mixture of Experts architecture instead uses a router to direct each token to a subset of specialised sub-networks ("experts"), so the active compute per token is far lower than the total parameter count would suggest.
DeepSeek V3 uses the DeepSeekMoE configuration, which divides the feed-forward component of each layer into 256 routed experts plus 1 shared expert (257 experts per layer in total).[1][14] For each token, the router selects 8 of the 256 routed experts (top-8), which are then invoked alongside the mandatory shared expert. Each expert has an intermediate hidden dimension of 2,048. This gives a total active parameter count of approximately 37 billion per token against a full model size of 671 billion.
Compared with V2, which had 160 routed experts per layer, V3 increased that count by 60 percent. Finer-grained expert specialisation allows the model to develop more distinct competencies across the expert pool, at the cost of making routing harder to balance.[1]
To prevent tokens from collapsing onto a small set of popular experts (routing collapse), and to keep hardware utilisation even across the 256 expert slots, DeepSeek V3 uses a routing constraint: no token may be sent to more than 4 nodes in the multi-node cluster. This binds the all-to-all communication in expert parallelism to a predictable upper bound, which was important for the pipeline design described below.[1]
Standard multi-head attention caches one key and one value vector per attention head per token in the context window. For a 128K context window with large hidden dimensions, this KV cache grows to tens of gigabytes per layer, imposing heavy memory pressure during inference.
MLA, introduced in DeepSeek-V2 and retained in V3, compresses the KV cache by projecting keys and values into a shared low-rank latent representation before caching.[10][1] During attention computation, the full keys and values are reconstructed from the latent vectors. The KV compression dimension is 512, versus the full 7,168-dimensional hidden state. The query compression dimension is separately set to 1,536.[14]
For the components that interact with rotary positional embeddings (RoPE), which require access to the actual position-dependent representations rather than the compressed latent, MLA uses a separate decoupled key dimension of 64 per head. These decoupled components are cached separately but are small.[14]
The practical effect is that V3 can serve long contexts with substantially lower memory than an equivalent dense model with standard grouped-query attention, which directly lowers inference cost and enables competitive throughput on a smaller GPU footprint.
Earlier MoE models, including DeepSeek-V2, used auxiliary loss terms added to the training objective to penalise uneven routing. The idea was to steer the router toward balanced expert utilisation. In practice, auxiliary losses create a tension: the routing objective and the language modelling objective are simultaneously optimised, and they can work against each other, resulting in slightly degraded downstream quality.[16]
DeepSeek V3 eliminates all auxiliary losses for load balancing. Instead, it applies a learned bias term to each expert's routing score that is updated only outside the gradient pass. When an expert is overloaded in a training step, its bias is decremented by a small fixed amount (gamma = 0.001). When it is underloaded, the bias is incremented. The bias shifts routing probability toward underused experts without ever interfering with the gradient signal for the language modelling objective.[1][16]
A small sequence-wise balance loss is applied alongside the bias mechanism to prevent extreme imbalance within individual sequences, but it carries a very low weight so that it has negligible effect on the primary language modelling objective.
The technical report notes that while load balance is measurably worse than under the auxiliary-loss approach, the tradeoff produces better final model quality, because the routing objective no longer corrupts gradient updates for the primary task.[1]
Standard causal language modelling trains a model to predict the next single token given all preceding tokens. DeepSeek V3 adds a secondary multi-token prediction (MTP) module that, at each position, also predicts the token two steps ahead. The MTP depth is set to 1, meaning exactly one additional future token is predicted.[1]
The MTP module uses sequential chaining: the first additional head predicts token n+2 given a representation of the output up to n+1, and these representations maintain causal structure at each depth. Output head and embedding weights are shared with the main model.
MTP provides two benefits. During training, predicting multiple positions from the same context passes more gradient signal per forward pass and appears to improve the quality of learned representations. During inference, MTP supports speculative decoding: the MTP predictions can be used as draft tokens that the main model verifies in parallel, potentially increasing throughput when the MTP predictions are accurate.[1]
The model is pre-trained on sequences up to 4,096 tokens. After pre-training, DeepSeek applies a two-stage YaRN-based context extension: first to 32,768 tokens (with 630 billion additional tokens of data), then to 131,072 tokens (128K, with 209 billion additional tokens).[1][17] The longer context training uses the full DeepSeekMoE and MLA stack with no architectural changes; only the positional embedding scaling and training data composition differ. In testing, DeepSeek V3 maintains robust retrieval performance on inputs up to 128K tokens, including the "Needle in a Haystack" evaluation suite.
DeepSeek V3 was pre-trained on 14.8 trillion tokens drawn from a multilingual corpus weighted toward English and Chinese, with enhanced ratios of mathematical and programming content compared with V2.[1] The vocabulary uses 128,000 tokens with a byte-level fallback. Approximately 10 percent of training sequences use the Fill-in-Middle (FIM) strategy, which adds variety to the training objective by asking the model to reconstruct masked middle spans rather than only predicting the next token.
Key pre-training hyperparameters included:
| Hyperparameter | Value |
|---|---|
| Sequence length (pre-training) | 4,096 tokens |
| Optimiser | AdamW (beta1 = 0.9, beta2 = 0.95, weight decay 0.1) |
| Peak learning rate | 2.2e-4 (cosine decay to 7.3e-6) |
| Batch size (sequences) | 3,072 ramping up to 15,360 |
| Gradient clipping norm | 1.0 |
DeepSeek V3 was the first large-scale model to validate FP8 mixed-precision training at this parameter count.[1][18] FP8 uses only 8 bits per weight value versus the standard 16-bit (BF16) formats, roughly halving memory bandwidth and storage for matrix operations.
The implementation uses the E4M3 format (4 bits exponent, 3 bits mantissa) across most tensor types. Rather than applying a single global scaling factor per tensor, DeepSeek uses fine-grained quantisation: activations use 1x128 quantisation tiles, and weights use 128x128 quantisation blocks. This finer granularity reduces the quantisation error that would otherwise accumulate across long training runs. Compute-intensive matrix multiplications run in FP8 with higher-precision accumulation in BF16 or FP32 to maintain numerical stability.[1]
Certain sensitive operations, including embedding layers, output projections, attention score computations, and the gating routing in MoE layers, are kept in BF16 or FP32 throughout. Master weights (for the optimiser) are stored in FP32, and optimiser states (first- and second-moment terms in AdamW) are kept in BF16 to reduce memory footprint. Activation checkpoints cached for the backward pass are stored in FP8.
DeepSeek reports that this framework produced training loss curves with at most 0.25 percent relative loss error compared with BF16 baselines at scales up to 671 billion parameters, validating FP8 as a viable training format for frontier-scale models.[1][18]
Training ran on 2,048 Nvidia H800 GPUs.[1] H800s are the export-controlled variant of H100s sold to Chinese customers; they have the same nominal compute capacity per chip but lower NVLink interconnect bandwidth (400 GB/s bidirectional versus 900 GB/s on H100). Within each node, GPUs are connected by NVLink at 160 GB/s. Across nodes, the cluster uses InfiniBand at 50 GB/s.
The parallelism strategy combined three layers:
| Parallelism layer | Configuration | Role |
|---|---|---|
| Pipeline parallelism (PP16) | 16-way, with the DualPipe scheduler | Splits transformer layers across GPUs and overlaps forward/backward phases with communication |
| Expert parallelism (EP64) | 64-way across 8 nodes | Distributes the 256 routed experts and overlaps cross-node all-to-all dispatch with compute |
| Data parallelism (ZeRO-1) | Optimiser-state sharding | Reduces redundant memory used to store optimiser state |
To work around the lower cross-node bandwidth of H800s, DeepSeek engineered a communication-compute overlap scheme. The DualPipe pipeline scheduler reduces "pipeline bubbles" (idle GPU time between micro-batch stages) and overlaps forward-pass compute for one microbatch with backward-pass compute and all-to-all communication for another, yielding what DeepSeek described as near-zero pipeline overhead.[1]
Custom dispatch and combine kernels for the all-to-all step adapt to both InfiniBand and NVLink bandwidth and limit GPU streaming-multiprocessor (SM) usage to roughly 20 SMs per device, leaving the rest of the GPU available for matrix-multiplication compute. A recomputation strategy further reduces memory pressure by recomputing certain layers (for example RMSNorm) during the backward pass rather than storing their activations.
The technical report states that the training ran to completion with no irrecoverable loss spikes and no rollbacks, which is noteworthy given that large training runs at this scale sometimes require intervention when numerical instability causes gradient explosions.[1]
The full training consumed 2.788 million H800 GPU hours, broken down as follows:[1]
| Stage | GPU hours |
|---|---|
| Pre-training (14.8T tokens) | 2,664,000 |
| Context length extension | 119,000 |
| Post-training (SFT + RL) | 5,000 |
| Total | 2,788,000 |
At an assumed rental price of $2 per H800 GPU-hour, the total works out to approximately $5.576 million.[1][19] This figure was widely reported in media coverage as "$5.5 million" or "$6 million" and became a focal point of the January 2025 market reaction.
The number is precise about what it measures and what it does not. The DeepSeek technical report explicitly states that "the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data."[1] It also does not include the capital cost of acquiring the 2,048 H800 GPUs (industry estimates of the cluster hardware run into the hundreds of millions of dollars), data acquisition and preparation, salaries, electricity, or the years of prior research on V1 and V2 that produced the techniques used in V3.[19][20] The figure is accurate as stated (rental-equivalent cost of running the final training job) but does not represent total R&D expenditure on the V3 program.
After pre-training and context extension, DeepSeek applies supervised fine-tuning followed by reinforcement learning. The post-training stage is responsible for transforming the base model into the instruction-tuned chat model that DeepSeek released as the primary checkpoint.[1]
The SFT phase used a curated instruction dataset of approximately 1.5 million examples spanning code, mathematics, role-play, factual question answering, and general knowledge. An important element of SFT is knowledge distillation from DeepSeek-R1: reasoning traces generated by an in-development R1 model are included in the SFT data, allowing V3 to internalise chain-of-thought reasoning patterns without going through a full RL-based reasoning training cycle. This is one reason the V3 chat model performs substantially better on math and coding benchmarks than the base model's raw pre-training would predict.[1]
The RL phase used two reward modelling approaches in combination:[1]
DeepSeek used Group Relative Policy Optimisation (GRPO) for the RL update step. GRPO replaces the large critic network used in standard PPO with a group-based sampling procedure that estimates advantage values by comparing multiple sampled completions for the same prompt. This reduces memory requirements during RL and was first introduced by DeepSeek for DeepSeek-Math (arXiv:2402.03300) in early 2024.[21]
The following table compares DeepSeek-V3 (chat, unless noted) with major contemporaneous models on standard benchmarks. Scores are from the DeepSeek technical report and third-party evaluations as of late 2024 and early 2025.[1]
| Benchmark | DeepSeek-V3 | GPT-4o | Claude 3.5 Sonnet | Llama 3.1 405B | Qwen 2.5 72B |
|---|---|---|---|---|---|
| MMLU (5-shot) | 88.5 | 87.2 | 88.3 | 88.6 | 86.1 |
| MMLU-Pro | 75.9 | 74.4 | 78.0 | 73.3 | 71.1 |
| GPQA Diamond | 59.1 | 53.6 | 65.0 | 51.1 | 49.0 |
| MATH-500 | 90.2 | 76.6 | 78.3 | 73.8 | 80.0 |
| AIME 2024 (pass@1) | 39.2% | 9.3% | 16.0% | n/a | n/a |
| HumanEval (pass@1) | 82.6% | 90.2% | 92.0% | 89.0% | 86.6% |
| LiveCodeBench | 29.2% | 32.9% | 36.3% | 29.9% | 31.1% |
| SWE-Bench Verified | 42.0% | 46.0% | 49.0% | n/a | n/a |
| BBH (3-shot) | 87.5 | 83.1 | 88.0 | 82.9 | 79.8 |
| Arena-Hard | 85.5 | 80.4 | 85.2 | n/a | n/a |
| AlpacaEval 2.0 | 70.0 | 57.5 | 52.0 | n/a | n/a |
| C-Eval (Chinese) | 90.1 | 76.0 | n/a | n/a | 86.6 |
DeepSeek-V3 led the open-source field by a significant margin at launch, particularly on math (MATH-500, AIME) and Chinese-language tasks. On coding, it trailed GPT-4o and Claude 3.5 Sonnet on some benchmarks while leading on others, reflecting the different strengths of each model's training mix. The math performance is notable because V3 achieves it without explicit "long chain-of-thought" prompting; the underlying reasoning ability is built into the base model through SFT distillation from R1.[1]
Generative speed was approximately 60 tokens per second on DeepSeek's own infrastructure, roughly three times faster than DeepSeek-V2, an improvement enabled by the MLA architecture's lower memory bandwidth demand during decoding.[22]
DeepSeek V3 is distributed under two licenses across its various releases. The relationship between the two has changed over time:
The original December 2024 licensing arrangement placed the weights in the category sometimes called "open weights" rather than "open source." The March 2025 move to MIT made V3-0324 fully open source by OSI criteria. In practice, even the more restrictive original license did not prevent wide commercial adoption. DeepSeek V3 weights are accessible via Hugging Face (deepseek-ai/DeepSeek-V3) and the official GitHub repository, and they have been integrated into major inference platforms including vLLM, SGLang, LMDeploy, and TensorRT-LLM.[2]
DeepSeek V3 has been distributed through multiple channels since its December 2024 release:
The sequence of events that triggered the "DeepSeek moment" began with the December 26, 2024 release of V3, accelerated with the January 20, 2025 release of DeepSeek-R1, and culminated on January 27, 2025 when US equity markets reopened after the weekend.[4][5][24] That morning, investors processed the implication that a Chinese lab had produced a model competitive with US frontier systems at a tiny fraction of the reported training cost, using hardware that was export-controlled to disadvantage Chinese AI development.
Nvidia's stock fell approximately 17 percent that day, erasing roughly $589 billion in market capitalisation, the largest single-session market cap loss in US stock market history at that point.[4][5] Broadcom fell about 17 percent, Micron about 12 percent, and AMD about 6 percent. The selling reflected a concern that if capable AI models could be trained and run on less hardware than the prevailing investment thesis assumed, demand for high-end AI chips might be lower and arrive slower than projected.
The DeepSeek chatbot also reached number 1 on the Apple App Store free apps chart in the United States by January 27, displacing ChatGPT, a milestone widely cited as evidence that Chinese AI had reached consumer-grade competitiveness.[25]
The stock recovered substantially in the months that followed. Most analysts ultimately concluded the sell-off had overestimated the near-term hardware demand reduction, partly because DeepSeek's efficiency gains tended to increase total utilisation rather than substitute for it (Jevons paradox), and partly because frontier training continued to require massive GPU clusters regardless of per-token improvements.[26]
DeepSeek V3's release came against the backdrop of intensifying US-China competition in AI. The US government had tightened export controls on high-end Nvidia GPUs (A100 and H100) to Chinese buyers in October 2022, with further restrictions added in October 2023. DeepSeek trained V3 on H800s, the lower-bandwidth China-market variant allowed under those controls.[1] The fact that a model trained on export-controlled hardware matched US frontier systems trained on the latest H100s and H200s produced considerable debate about the effectiveness of chip export controls as a policy tool.
Some analysts argued DeepSeek's results showed the controls were too narrow (focusing on compute rather than other bottlenecks like algorithms and data). Others argued the results were partly enabled by stockpiling of chips before controls took effect, and that future tightening would be more consequential.[27] The US government moved to tighten controls further in early 2025, with the "AI Diffusion Rule" adding restrictions on exports to many additional countries, though the new administration in Washington partially rolled this back later in 2025.
A separate but related concern was whether DeepSeek had access to chips beyond H800s. SemiAnalysis and other industry observers estimated that High-Flyer and DeepSeek collectively had access to a much larger GPU pool than the 2,048-H800 cluster used for the V3 training run, including older A100s acquired before sanctions and possibly some H100s acquired through informal channels.[27] DeepSeek has not commented publicly on the broader inventory.
On March 24, 2025, DeepSeek released an updated checkpoint designated V3-0324, with improved post-training that borrowed reinforcement learning techniques validated in DeepSeek-R1.[7][13] The update improved performance on reasoning-heavy tasks without changing the underlying architecture. The release also migrated the model weights from the custom DeepSeek Model License to the MIT License, making V3-0324 fully open source under an OSI-approved license. Reported MATH-500 accuracy increased from 90.2 to roughly 94 percent depending on the evaluation harness; LiveCodeBench accuracy rose from 29.2 to 34.4 percent.[28] At the time of its release, independent evaluators placed V3-0324 as the leading non-reasoning open-weights model.
DeepSeek-V3.1 was released on August 21, 2025, representing a more substantial post-training improvement over V3-0324.[7] The base architecture (671B total, 37B active, 128K context) remained unchanged, but the model was retrained with an extended context expansion phase and a post-training pipeline that added "hybrid thinking": a single model checkpoint that can operate in both standard completion mode and a chain-of-thought reasoning mode activated via a chat template flag. V3.1 added strict function calling support and compatibility with the Anthropic Messages API format, expanding the infrastructure that could route to it without modification. Its thinking mode performed comparably to DeepSeek-R1 on reasoning tasks while generating responses faster on many queries.
On September 22, 2025, DeepSeek released DeepSeek-V3.1-Terminus, a refinement of V3.1 that improved coding agent and search agent behaviour, increased the consistency of output language for multilingual prompts, and refined the thinking mode's stopping behaviour to avoid runaway chain-of-thought.[7] Architectures and parameter counts remained unchanged from V3.1.
On September 29, 2025, DeepSeek released DeepSeek-V3.2-Exp, an experimental model centred on a new attention mechanism called DeepSeek Sparse Attention (DSA).[7][29] Standard self-attention has quadratic computational complexity in sequence length (O(L^2) where L is the number of tokens), which makes very long contexts expensive to process. DSA reduces this to approximately linear complexity (O(Lk) where k is the number of selected tokens and k is much smaller than L) by using a two-stage selection process: a "lightning indexer" first identifies relevant segments of the context, then a fine-grained token selection module within those segments retrieves the specific tokens to attend over.
The practical effects included substantially lower compute cost for cache-miss requests and reduced inference latency on long documents. API pricing dropped by over 50 percent simultaneously with the V3.2-Exp release. The V3.2-Exp model weights, technical report, and GPU kernels (written in TileLang and CUDA) were published on Hugging Face and GitHub.
On December 1, 2025, DeepSeek promoted the experimental V3.2-Exp to a stable release, DeepSeek-V3.2, and simultaneously released DeepSeek-V3.2-Speciale, a reasoning-focused API-only variant.[30] V3.2 was the first DeepSeek model to integrate thinking directly into tool-use, supporting tool calls in both thinking and non-thinking modes. DeepSeek introduced a "massive agent training data synthesis method covering 1,800+ environments and 85,000+ complex instructions" for the post-training stage. V3.2-Speciale was positioned as a maximal-reasoning configuration described internally as rivalling Gemini 3.0 Pro on reasoning benchmarks.
DeepSeek-R1, released on January 20, 2025, is built on the V3-Base checkpoint with a reasoning-focused RL pipeline that uses primarily rule-based rewards.[6] R1 inherits V3's MoE architecture, MLA attention, and 671B/37B parameter profile, but its post-training emphasises long chain-of-thought generation. The two models share infrastructure but serve different use cases: V3 is the general-purpose chat model and R1 is the reasoning model. The two were released close enough in time that they were often discussed together in the January 2025 reaction; the technical lineage runs V3-Base then forks into V3 (chat) and R1 (reasoning).
On April 24, 2026, DeepSeek released the preview of DeepSeek V4, available as two model variants: DeepSeek-V4-Pro and DeepSeek-V4-Flash.[7] The release came as both open-weights models on Hugging Face and via the DeepSeek API, with new model IDs deepseek-v4-pro and deepseek-v4-flash. DeepSeek labelled V4 as a preview release pending further refinement. The V4 family marks the end of incremental V3-series updates and the beginning of a new architectural generation, though DeepSeek has continued to maintain V3.2 endpoints alongside V4 during the preview phase.
DeepSeek V3 accelerated a trend toward open-weights frontier models. Before its release, the most capable openly available models (Llama 3.1 405B, Qwen 2.5 72B) trailed proprietary systems by a meaningful margin on many benchmarks. V3 closed much of that gap while being deployable on consumer or small-business hardware with enough GPU memory. This prompted increased attention from US AI labs to open-weights releases, and Meta's Llama 4 and subsequent open-weights releases from US, European, and Chinese labs were launched into a more competitive open-weights environment than their predecessors had faced.
The $5.5 million training cost figure became a reference point in debates about compute requirements for frontier AI, prompting reassessment of some "compute threshold" proposals in AI governance and export control policy, which had assumed that frontier capability required far higher expenditure.[31] At the same time, more careful analysis emphasised the distinction between final-run cost and total R&D cost, which the original figure does not capture.[19][20]
Technically, the auxiliary-loss-free load balancing mechanism, the fine-grained FP8 quantisation scheme, and the MTP training objective have all been adopted or studied by other groups. The DeepSeek-V3 technical report has accumulated thousands of citations within months of release and is among the most-cited LLM papers of 2024 and 2025.[1]
As of 2026, DeepSeek V3 is no longer the primary served checkpoint; the deepseek-chat API endpoint has been continuously upgraded to V3-0324, V3.1, V3.1-Terminus, V3.2-Exp, V3.2, and (in preview) V4 over the course of 2025 and 2026. The original V3 weights from December 2024 remain available for download on HuggingFace for research reproducibility and historical interest.[2]
DeepSeek V3 exhibits limitations common to large language models as well as some specific to its design and origin.
Hallucination remains an issue across all tasks, particularly for factual claims about specific entities, dates, or statistics. DeepSeek's own evaluations and third-party red-teaming have found that V3 generates plausible but incorrect information with roughly the frequency expected of models in its capability class.
Content restrictions in the hosted API and chat interface are aligned with Chinese regulatory requirements. The hosted model refuses to answer a high proportion of questions about politically sensitive topics including Tiananmen Square, Taiwan's political status, Xinjiang, and criticism of Chinese leadership.[32] These restrictions apply to the hosted service; models run locally from downloaded weights may be fine-tuned to modify this behaviour, though users are legally and contractually responsible for such modifications.
Safety alignment against harmful outputs was weaker than comparable models from US labs at the time of original release. Research from Cisco and the University of Pennsylvania found that jailbreak attempts succeeded at higher rates on DeepSeek models than on GPT-4o and Claude 3.5 Sonnet. Subsequent post-training in V3.1 and later improved this to some degree.
Data privacy in the hosted service falls under Chinese legal jurisdiction. DeepSeek's servers are located in mainland China, and the company is subject to China's 2017 National Intelligence Law, which can compel assistance with national security matters. Several countries and organisations have prohibited use of the hosted service for sensitive or regulated workloads as a result, while continuing to use the locally deployed open-weight versions.
Deployment requires multi-node setups for full-precision inference. Although the model's MoE design activates only 37B parameters per token, the full weight set must be loaded somewhere in the cluster: full FP8 weights occupy approximately 355 GB, and BF16 converted weights occupy approximately 685 GB including the MTP module.[2] This puts full-precision local deployment out of reach for organisations without multi-GPU infrastructure with significant aggregate VRAM, and even with optimised inference frameworks the latency profile favours high-throughput batch serving over real-time single-user interaction.