DeepSeek LLM
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,318 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,318 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek LLM is the first foundational large language model series released by the Chinese AI company DeepSeek. It was published on 29 November 2023 in two sizes, 7 billion and 67 billion parameters, each shipped as a pretrained base model and an aligned chat model [1][2]. The accompanying paper, "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" (arXiv:2401.02954), appeared on 5 January 2024 and set out the team's own scaling-law study together with benchmark results in which the 67B model surpassed Llama 2 70B on code, mathematics, and reasoning [3]. The release marked the start of the model line that later produced DeepSeek-V2, DeepSeek-V3, and the R1 reasoning model.
DeepSeek grew out of High-Flyer, a Chinese quantitative hedge fund co-founded in February 2016 by Liang Wenfeng. High-Flyer had built large GPU clusters for trading research, and on 17 July 2023 it spun off a dedicated artificial-intelligence lab as the independent company DeepSeek, with High-Flyer as its principal backer [2]. DeepSeek LLM was the new company's first publicly released model family and arrived roughly four months after the spin-off. The "longtermism" in the paper title reflects the team's stated intention to treat open-source foundation models as a long-running research program rather than a one-off release, with scaling laws used to guide how future, larger models should be built.
The series uses a pre-norm decoder-only Transformer with RMSNorm normalization and a SwiGLU feed-forward network, broadly following the LLaMA design but with the team's own depth-over-width choices. The 7B model uses standard multi-head attention, while the 67B model uses grouped-query attention (GQA) with 8 key-value heads to reduce inference memory. Both models use a 4096-token context window and a byte-level BPE tokenizer with a vocabulary of about 100,000 tokens (100,015 entries after special tokens, with the embedding table sized at 102,400 for training) [3].
| Variant | Parameters | Layers | Hidden size | Attention heads | Attention type | Context length | Training tokens |
|---|---|---|---|---|---|---|---|
| DeepSeek LLM 7B | 7B | 30 | 4096 | 32 | Multi-head attention | 4096 | 2.0T |
| DeepSeek LLM 67B | 67B | 95 | 8192 | 64 | Grouped-query attention (8 KV heads) | 4096 | 2.0T |
Each size was released in two forms: a base model trained only with next-token prediction, and a chat model further tuned for instruction following and dialogue. All four checkpoints (7B base, 7B chat, 67B base, 67B chat) were published openly [1].
Both models were trained from scratch on a corpus of 2 trillion tokens, described in the paper as continuously expanding, drawn predominantly from English and Chinese text [3]. The data pipeline combined deduplication, filtering, and remixing stages, with sources spanning general web text, books, code, and mathematical content. The team reported that aggressive deduplication across the full Common Crawl corpus removed far more near-duplicate documents than deduplicating within a single dump, and they argued that this data-quality work was as important to final performance as raw scale.
A large part of the paper is devoted to an empirical scaling-law study used to decide the 7B and 67B configurations before committing to full training runs. Rather than measuring model scale by parameter count, the authors introduced a metric they call non-embedding FLOPs per token, which counts the floating-point operations needed to process one token while excluding the embedding layers, and they argued this gives a more accurate compute measure than parameter count alone [3].
Using this metric, the team fit power laws that allocate a fixed compute budget between model size and training data. Their fitted relations were M_opt proportional to C^0.5243 for the optimal model scale and D_opt proportional to C^0.4757 for the optimal number of training tokens, where C is the compute budget. They also studied how the optimal batch size and learning rate change with compute, providing fitted formulas for both. A notable finding was that the optimal data-to-model allocation depends on the quality of the training data: higher-quality data shifts the compute budget toward larger models. The paper reports that the loss and downstream performance of the 7B and 67B runs landed close to what the scaling laws predicted, which the authors presented as evidence that the laws can guide larger future models.
DeepSeek LLM 67B Base was positioned directly against Llama 2 70B Base. On the standard suite reported in the model documentation, the 67B base model led on knowledge, code, mathematics, reasoning, and Chinese benchmarks [1][3].
| Benchmark | Llama 2 70B Base | DeepSeek 67B Base |
|---|---|---|
| MMLU | 69.0 | 71.3 |
| BBH | 62.9 | 68.7 |
| HumanEval (Pass@1) | 28.7 | 42.7 |
| GSM8K | 58.4 | 63.4 |
| C-Eval | 51.4 | 66.1 |
| CMMLU | 53.1 | 70.8 |
The gaps were largest on code (HumanEval) and on the Chinese examinations C-Eval and CMMLU, where DeepSeek's bilingual training data gave it a clear advantage. The 67B chat model was reported at 73.78 on HumanEval Pass@1, 84.1 on GSM8K (0-shot), 32.6 on the MATH benchmark (0-shot), and 71.1 on MMLU [1].
For open-ended chat quality, the team used MT-Bench in English and AlignBench in Chinese. DeepSeek 67B Chat scored 8.35 on MT-Bench, and applying Direct Preference Optimization raised it to 8.76, slightly ahead of GPT-3.5-turbo at 8.39 and below GPT-4-1106-preview at 9.26 [3]. On AlignBench, the DPO-tuned 67B chat model scored 6.69 overall, ahead of ChatGPT and trailing only the GPT-4 variants on that leaderboard, which the authors cited as evidence that the model was especially strong in Chinese [3].
The base models were trained with a multi-step learning-rate schedule rather than cosine decay, which the team noted made it easier to resume and extend training when more data became available, a practical choice consistent with the "longtermism" framing. The chat models were produced by supervised fine-tuning (SFT) on roughly 1.5 million instruction examples followed by Direct Preference Optimization to improve helpfulness and reduce repetitive or low-quality generations [3]. The paper reports that DPO improved open-ended generation scores while leaving standard benchmark accuracy largely unchanged.
The model weights and inference code were published on GitHub (deepseek-ai/DeepSeek-LLM) and Hugging Face on 29 November 2023 [1]. The source code is released under the MIT License, while the model weights are governed by a separate DeepSeek model license that permits commercial use subject to a use-based restrictions appendix [1]. Releasing competitive 7B and 67B base and chat checkpoints under a commercially usable license made DeepSeek LLM one of the more openly available frontier-class model families at the time of its release.
DeepSeek LLM established the conventions that the company carried forward: open weights, detailed technical reports, bilingual training, and an emphasis on cost-efficient scaling. Later models changed the architecture substantially. DeepSeek-V2, released in May 2024, replaced standard attention with multi-head latent attention (MLA) and adopted a mixture-of-experts design for the feed-forward layers, and it was trained on 8.1 trillion tokens. DeepSeek-V3, released in December 2024, kept the MLA and MoE approach, added multi-token prediction, and was trained on 14.8 trillion tokens. The R1 reasoning model, released in January 2025, was initialized from DeepSeek-V3-Base and trained with reinforcement learning to elicit long chains of reasoning. Against that later work, the original DeepSeek LLM is a dense, LLaMA-style model, but it is the release that defined the team's open and scaling-law-driven approach.