DeepSeek (Chinese: 杭州深度求索人工智能基础技术研究有限公司; commonly DeepSeek AI or simply DeepSeek) is a Chinese artificial intelligence company known for developing large language models (LLMs) and releasing several prominent open-source and research models. Founded in 2023 by hedge fund entrepreneur Liang Wenfeng, the company has gained international recognition for achieving competitive performance with leading Western AI models at dramatically lower training costs.[1][2]
DeepSeek rose to global prominence in January 2025 when its mobile app briefly topped the Apple App Store's free charts in the United States, following the release of its reasoning-focused DeepSeek-R1 models. The company's claim of training competitive models for under $6 million using Nvidia H800 GPUs, compared to over $100 million for Western equivalents, caused significant market disruption, with Nvidia losing nearly $600 billion in market capitalization.[3][4]
History
Background and Origins (2016–2023)
DeepSeek's origins trace back to High-Flyer Capital Management, a Chinese quantitative hedge fund co-founded in February 2016 by Liang Wenfeng and two classmates from Zhejiang University.[1] High-Flyer began adopting deep learning models for stock trading on October 21, 2016, transitioning from CPU-based linear models to GPU-dependent systems. By 2021, the fund relied exclusively on AI for trading operations.[5]
In 2019, High-Flyer built its first computing cluster, Fire-Flyer (萤火一号), at a cost of 200 million yuan, equipped with 1,100 GPUs. Anticipating U.S. export restrictions on advanced chips to China, Liang acquired 10,000 Nvidia A100 units before restrictions took effect. Construction of Fire-Flyer 2 (萤火二号) began in 2021 with a 1 billion yuan budget, incorporating 5,000 PCIe A100 GPUs across 625 nodes by 2022.[5][6]
Founding and Early Development (2023–2024)
On April 14, 2023, High-Flyer announced the establishment of an artificial general intelligence (AGI) research lab. This lab was formally incorporated as DeepSeek on July 17, 2023, with High-Flyer serving as the principal investor. Venture capital firms were initially reluctant to invest, considering the lack of short-term exit opportunities.[1][7]
The company released its first model, DeepSeek Coder, on November 2, 2023, followed by the DeepSeek-LLM series on November 29, 2023. Throughout 2024, DeepSeek continued releasing specialized models:
January 2024: DeepSeek-MoE models (Base and Chat variants)
April 2024: DeepSeek-Math models (Base, Instruct, and RL)
In December 2024, DeepSeek released DeepSeek-V3, featuring a Mixture of Experts architecture with 671 billion total parameters. On January 20, 2025, the company announced DeepSeek-R1, a reasoning-centric model using pure reinforcement learning that matched performance of OpenAI's o1 family at significantly lower costs.[8][9]
DeepSeek's mobile app reached #1 among free apps on the U.S. Apple App Store on January 27–28, 2025. This surge coincided with an 18% drop in Nvidia's share price and over $1 trillion erased from U.S. tech market capitalization. Prominent tech investor Marc Andreessen described this as "AI's Sputnik moment."[3][10][4]
On January 27-28, 2025, DeepSeek reported large-scale malicious attacks on its services, temporarily restricting new sign-ups.[11]
DeepSeek's models employ a Mixture of Experts architecture, which allows massive parameter counts while maintaining computational efficiency. The MoE framework in DeepSeek-V3 consists of:[12][13]
671 billion total parameters
37 billion activated parameters per forward pass
256 routed experts per layer (increased from 160 in V2)
1 shared expert per layer that is always activated
DeepSeek-V2 and subsequent models incorporate Multi-head Latent Attention (MLA), a modified attention mechanism that compresses the key-value (KV) cache. MLA achieves:[2][14]
KV-cache reduction to 5-13% of traditional methods
Significant memory overhead reduction during inference
Support for 128K-164K token context windows
Lower computational cost for long-context processing
Training Methodology
DeepSeek-R1 employs a distinctive training pipeline:[8][15]
1. Cold Start Phase: Fine-tuning base model with curated chain-of-thought reasoning examples
2. Reasoning-Oriented Reinforcement Learning: Large-scale RL focusing on rule-based evaluation tasks
3. Supervised Fine-Tuning: Combining reasoning and non-reasoning data
4. RL for All Scenarios: Final refinement for helpfulness and harmlessness
DeepSeek Sparse Attention (DSA)
Introduced in DeepSeek-V3.2-Exp (September 2025), DSA is a fine-grained sparse attention mechanism optimized for long-context training and inference efficiency with minimal performance impact.[16]
DeepSeek-OCR (2025)
In October 2025, DeepSeek released DeepSeek-OCR, an open-source end-to-end document OCR and understanding system that explores “contexts optical compression”—representing long text as images and decoding it back with a vision–language stack to save tokens for long-context LLM applications.[17][18]
Architecture: A ~380M-parameter DeepEncoder (SAM-base window attention → 16× token compression via 2-layer conv → CLIP-large global attention) feeds a 3B MoE decoder (DeepSeek-3B-MoE-A570M; ~570M active params at inference). Multiple resolution modes control vision-token budgets: Tiny (64 tokens, 512²), Small (100, 640²), Base (256, 1024²), Large (400, 1280²), plus tiled Gundam (n×100 + 256 tokens) and Gundam-M modes for ultra-high-res pages.[17]
Reported compression/accuracy: On a Fox benchmark subset (English pages with 600–1,300 text tokens), the paper reports ≈97% decoding precision when text tokens are <10× vision tokens, and ~60% accuracy around 20× compression. On OmniDocBench (edit distance; lower is better), Small (100 tokens) outperforms GOT-OCR 2.0 (256 tokens), and Gundam (<~800 tokens) surpasses MinerU-2.0 (~6,790 tokens) in the reported setup.[17]
Throughput/uses: DeepSeek positions the system as a data-engine for LLM/VLM pretraining—claiming >200k pages/day on a single A100-40G, scalable to tens of millions per day on clusters—plus “deep parsing” of charts, chemical structures (SMILES), and planar geometry into structured outputs (for example HTML tables or dictionaries).[18]
Availability/ecosystem: Source code and weights are hosted on GitHub and Hugging Face, with examples for Transformers/vLLM inference. Community walkthroughs (for example Simon Willison) documented running the 6.6-GB model on diverse hardware and shared setup notes.[19][20][21]
Infrastructure
DeepSeek operates two primary computing clusters:[5]
Fire-Flyer 1 (萤火一号): Built 2019, retired after 1.5 years
Fire-Flyer 2 (萤火二号): Operational since 2022, featuring:
Nvidia GPUs with 200 Gbps interconnects
Fat tree topology for high bisection bandwidth
3FS distributed file system with Direct I/O and RDMA