DeepSeek 3.0 is an open-source Mixture-of-Experts (MoE) large language model (LLM) developed by the DeepSeek-AI team. With a total of 671 billion parameters, of which 37 billion are activated per token, DeepSeek-V3 balances high performance on a wide range of tasks with a cost-effective training scheme. Notably, it integrates Multi-head Latent Attention (MLA) and a specialized MoE architecture called DeepSeekMoE, leveraging advancements in load balancing and multi-token prediction.
DeepSeek-V3 is designed to push the boundaries of open-source large language models, achieving strong results in knowledge, code, and mathematical reasoning. The model was pre-trained on 14.8 trillion diverse tokens and subsequently underwent Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). It is released under an open-source license, with checkpoints publicly available at https://github.com/deepseek-ai/DeepSeek-V3.
Despite the large model size, DeepSeek-V3’s architectural and engineering optimizations enable its training to be completed with approximately 2.788 million H800 GPU hours. The total cost is estimated at USD $5.576 million (assuming $2 per GPU hour), which includes:
2.664 million GPU hours for pre-training 119 thousand GPU hours for context-length extension (up to 128K tokens) 5 thousand GPU hours for post-training
DeepSeek-V3 follows the Transformer framework, extending it with:
Multi-head Latent Attention (MLA): Compresses key and value vectors into a low-dimensional latent representation. This approach lowers memory usage in the key-value cache for autoregressive generation.
DeepSeekMoE: Uses a large number of fine-grained experts to improve training cost-efficiency. Each MoE layer contains both routed experts (each token is directed to a subset of them based on learned affinity) and shared experts.
Traditional MoE models rely on an auxiliary loss to prevent routing collapse. DeepSeek-V3 introduces an “auxiliary-loss-free” approach, adding a bias to the affinity scores for gating and adjusting them step-by-step to prevent any expert from becoming overloaded. This eliminates the performance penalty often seen with large auxiliary loss values.
A small sequence-wise balance loss is additionally applied to prevent extreme imbalance within individual sequences, but it is set to a very low weight to minimize its impact on the model’s main objective.
Besides predicting the next token, DeepSeek-V3 also predicts an additional token at each timestep.
Implementation: Sequential modules generate multiple tokens (for example token+1, token+2), each with its own output head but sharing embedding and other components.
Benefits: Densifies training signals, boosts final performance on various benchmarks, and enables speculative decoding during inference.
DeepSeek-V3 is trained on a cluster of 2,048 NVIDIA H800 GPUs. Each node has 8 GPUs connected via NVLink/NVSwitch, while nodes are interconnected via InfiniBand (IB).
Key parallelization strategies include:
DeepSeek-V3 adopts a specialized low-precision framework to accelerate training. By default, compute-intensive matrix multiplications run in FP8, combined with carefully designed fine-grained quantization and higher-precision accumulation (BF16 or FP32) for stability. The approach significantly reduces both memory use and training time.
Recomputation Strategy: Recomputes certain layers (for example RMSNorm) in the backward pass to lower memory usage.
Low-Precision Optimizer States: Stores first- and second-moment terms in BF16 to reduce memory footprint.
Dispatch & Combine Kernels: Customized all-to-all kernels that adapt to both IB and NVLink bandwidth, limiting SM usage to only ~20 SMs per GPU.
DeepSeek-V3 is pretrained on 14.8 trillion tokens featuring multilingual text (English, Chinese, etc.), mathematics, programming data, and more. It employs the Fill-in-Middle (FIM) strategy for ~10% of sequences, adding variety to the training objective. Hyper-parameters for pre-training include: *Sequence Length: 4K tokens (later extended up to 128K)
After base pre-training, DeepSeek-V3’s context window is extended from 4K to 32K, then from 32K to 128K, using the YaRN method. This process preserves the model’s capabilities while enabling it to handle extremely long input sequences. In testing, DeepSeek-V3 maintains robust performance on tasks with inputs up to 128K tokens (for example the “Needle in a Haystack” test).
Post-training involves two main phases:
A curated 1.5M-instance instruction dataset covers code, math, role-play, and knowledge Q&A. Distillation from DeepSeek-R1 ensures high reasoning accuracy. The final SFT stage takes the base model and adapts it to user queries and various instructions.
DeepSeek-V3 uses: *Rule-Based Reward Models: For tasks with hard-checkable correctness (for example math solutions, code testcases).
Additionally, the Group Relative Policy Optimization (GRPO) approach replaces large critic models with group-based sampling to estimate advantages. This method significantly improves alignment and generation quality.
DeepSeek-V3 achieves state-of-the-art results among open-source LLMs, and often rivals or matches popular closed-source systems (for example GPT-4o and Claude-3.5). Highlights include:
Large Deployment Units: Efficient inference typically requires multi-node (multiple GPUs) setups, which can be resource-intensive for smaller organizations.
Throughput vs. Latency Balance: Despite major improvements (2× speedup vs. previous versions), further optimizations could be pursued for real-time user interactions.
According to the DeepSeek-AI team, future work for DeepSeek-V3 (and subsequent iterations) involves: *Further exploration of efficient architectures (potentially beyond Transformers)