DeepSeek LLM

AI Models Chinese AI Large Language Models

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v2 · 1,315 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepSeek LLM is the first foundational large language model series released by the Chinese AI company DeepSeek. It was published on 29 November 2023 in two sizes, 7 billion and 67 billion parameters, each shipped as a pretrained base model and an aligned chat model ^[1]^[2]. The accompanying paper, "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism" (arXiv:2401.02954), appeared on 5 January 2024 and set out the team's own scaling-law study together with benchmark results in which the 67B model surpassed Llama 2 70B on code, mathematics, and reasoning ^[3]. The release marked the start of the model line that later produced DeepSeek-V2, DeepSeek-V3, and the R1 reasoning model.

Background

DeepSeek grew out of High-Flyer, a Chinese quantitative hedge fund co-founded in February 2016 by Liang Wenfeng. High-Flyer had built large GPU clusters for trading research, and on 17 July 2023 it spun off a dedicated artificial-intelligence lab as the independent company DeepSeek, with High-Flyer as its principal backer ^[2]. DeepSeek LLM was the new company's first publicly released model family and arrived roughly four months after the spin-off. The "longtermism" in the paper title reflects the team's stated intention to treat open-source foundation models as a long-running research program rather than a one-off release, with scaling laws used to guide how future, larger models should be built.

Model sizes

The series uses a pre-norm decoder-only Transformer with RMSNorm normalization and a SwiGLU feed-forward network, broadly following the LLaMA design but with the team's own depth-over-width choices. The 7B model uses standard multi-head attention, while the 67B model uses grouped-query attention (GQA) with 8 key-value heads to reduce inference memory. Both models use a 4096-token context window and a byte-level BPE tokenizer with a vocabulary of about 100,000 tokens (100,015 entries after special tokens, with the embedding table sized at 102,400 for training) ^[3].

Variant	Parameters	Layers	Hidden size	Attention heads	Attention type	Context length	Training tokens
DeepSeek LLM 7B	7B	30	4096	32	Multi-head attention	4096	2.0T
DeepSeek LLM 67B	67B	95	8192	64	Grouped-query attention (8 KV heads)	4096	2.0T

Each size was released in two forms: a base model trained only with next-token prediction, and a chat model further tuned for instruction following and dialogue. All four checkpoints (7B base, 7B chat, 67B base, 67B chat) were published openly ^[1].

Training data

Both models were trained from scratch on a corpus of 2 trillion tokens, described in the paper as continuously expanding, drawn predominantly from English and Chinese text ^[3]. The data pipeline combined deduplication, filtering, and remixing stages, with sources spanning general web text, books, code, and mathematical content. The team reported that aggressive deduplication across the full Common Crawl corpus removed far more near-duplicate documents than deduplicating within a single dump, and they argued that this data-quality work was as important to final performance as raw scale.

Scaling laws

A large part of the paper is devoted to an empirical scaling-law study used to decide the 7B and 67B configurations before committing to full training runs. Rather than measuring model scale by parameter count, the authors introduced a metric they call non-embedding FLOPs per token, which counts the floating-point operations needed to process one token while excluding the embedding layers, and they argued this gives a more accurate compute measure than parameter count alone ^[3].

Using this metric, the team fit power laws that allocate a fixed compute budget between model size and training data. Their fitted relations were M_opt proportional to C^0.5243 for the optimal model scale and D_opt proportional to C^0.4757 for the optimal number of training tokens, where C is the compute budget. They also studied how the optimal batch size and learning rate change with compute, providing fitted formulas for both. A notable finding was that the optimal data-to-model allocation depends on the quality of the training data: higher-quality data shifts the compute budget toward larger models. The paper reports that the loss and downstream performance of the 7B and 67B runs landed close to what the scaling laws predicted, which the authors presented as evidence that the laws can guide larger future models.

Benchmarks

DeepSeek LLM 67B Base was positioned directly against Llama 2 70B Base. On the standard suite reported in the model documentation, the 67B base model led on knowledge, code, mathematics, reasoning, and Chinese benchmarks ^[1]^[3].

Benchmark	Llama 2 70B Base	DeepSeek 67B Base
MMLU	69.0	71.3
BBH	62.9	68.7
HumanEval (Pass@1)	28.7	42.7
GSM8K	58.4	63.4
C-Eval	51.4	66.1
CMMLU	53.1	70.8

The gaps were largest on code (HumanEval) and on the Chinese examinations C-Eval and CMMLU, where DeepSeek's bilingual training data gave it a clear advantage. The 67B chat model was reported at 73.78 on HumanEval Pass@1, 84.1 on GSM8K (0-shot), 32.6 on the MATH benchmark (0-shot), and 71.1 on MMLU ^[1].

For open-ended chat quality, the team used MT-Bench in English and AlignBench in Chinese. DeepSeek 67B Chat scored 8.35 on MT-Bench, and applying Direct Preference Optimization raised it to 8.76, slightly ahead of GPT-3.5-turbo at 8.39 and below GPT-4-1106-preview at 9.26 ^[3]. On AlignBench, the DPO-tuned 67B chat model scored 6.69 overall, ahead of ChatGPT and trailing only the GPT-4 variants on that leaderboard, which the authors cited as evidence that the model was especially strong in Chinese ^[3].

Training and alignment

The base models were trained with a multi-step learning-rate schedule rather than cosine decay, which the team noted made it easier to resume and extend training when more data became available, a practical choice consistent with the "longtermism" framing. The chat models were produced by supervised fine-tuning (SFT) on roughly 1.5 million instruction examples followed by Direct Preference Optimization to improve helpfulness and reduce repetitive or low-quality generations ^[3]. The paper reports that DPO improved open-ended generation scores while leaving standard benchmark accuracy largely unchanged.

Release and licensing

The model weights and inference code were published on GitHub (deepseek-ai/DeepSeek-LLM) and Hugging Face on 29 November 2023 ^[1]. The source code is released under the MIT License, while the model weights are governed by a separate DeepSeek model license that permits commercial use subject to a use-based restrictions appendix ^[1]. Releasing competitive 7B and 67B base and chat checkpoints under a commercially usable license made DeepSeek LLM one of the more openly available frontier-class model families at the time of its release.

Place in the DeepSeek lineage

DeepSeek LLM established the conventions that the company carried forward: open weights, detailed technical reports, bilingual training, and an emphasis on cost-efficient scaling. Later models changed the architecture substantially. DeepSeek-V2, released in May 2024, replaced standard attention with multi-head latent attention (MLA) and adopted a mixture-of-experts design for the feed-forward layers, and it was trained on 8.1 trillion tokens. DeepSeek-V3, released in December 2024, kept the MLA and MoE approach, added multi-token prediction, and was trained on 14.8 trillion tokens. The R1 reasoning model, released in January 2025, was initialized from DeepSeek-V3-Base and trained with reinforcement learning to elicit long chains of reasoning. Against that later work, the original DeepSeek LLM is a dense, LLaMA-style model, but it is the release that defined the team's open and scaling-law-driven approach.

References

DeepSeek-AI, "DeepSeek LLM (GitHub repository)", https://github.com/deepseek-ai/DeepSeek-LLM ↩
"DeepSeek", Wikipedia, https://en.wikipedia.org/wiki/DeepSeek ↩
Xiao Bi et al., "DeepSeek LLM: Scaling Open-Source Language Models with Longtermism", arXiv:2401.02954, https://arxiv.org/abs/2401.02954 ↩
DeepSeek-AI, "deepseek-llm-67b-base (model card)", Hugging Face, https://huggingface.co/deepseek-ai/deepseek-llm-67b-base
DeepSeek-AI, "deepseek-llm-67b-chat (model card)", Hugging Face, https://huggingface.co/deepseek-ai/deepseek-llm-67b-chat

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DeepSeek-VL