DeepSeek-OCR
Last reviewed
May 10, 2026
Sources
19 citations
Review status
Source-backed
Revision
v3 ยท 2,915 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
19 citations
Review status
Source-backed
Revision
v3 ยท 2,915 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek-OCR is an open-source end-to-end document OCR and layout understanding system released by DeepSeek in October 2025. It introduces a contexts optical compression paradigm that represents long textual content as compact vision tokens and then decodes them back into text using a lightweight decoder.[1] The system consists of a purpose-built vision encoder (DeepEncoder) and a 3-billion-parameter Mixture-of-Experts (MoE) decoder (DeepSeek-3B-MoE-A570M). In the accompanying preprint, the authors report ~97% OCR precision at about 10x vision-text compression and ~60% at about 20x on the Fox benchmark, and competitive end-to-end performance on OmniDocBench while using far fewer vision tokens than many baselines.[1]
The source code and BF16 weights are available on GitHub and Hugging Face under the MIT license.[2][3] On Hugging Face the model has been downloaded more than 2.7 million times in its first month, reflecting heavy adoption among document-AI practitioners.[3]
| Property | Value |
|---|---|
| Developer | DeepSeek-AI |
| Lead author | Haoran Wei |
| Co-authors | Yaofeng Sun, Yukun Li |
| Initial release | 20 October 2025 (code/weights) |
| Paper release | 21 October 2025 (arXiv 2510.18234) |
| License | MIT |
| Total parameters | ~3 billion (decoder) + ~380M (encoder) |
| Activated parameters | ~570M per token (decoder) |
| Precision | BF16 (safetensors) |
| Tasks | OCR, layout parsing, chart/table extraction, geometry, chemistry, captioning, grounding |
| Languages | ~100 (PDF training set) |
| Backends | vLLM, Hugging Face Transformers, SGLang |
DeepSeek-OCR frames long-context handling as an image-to-text transduction problem: documents are rendered as images and passed through a vision encoder that emits a small set of vision tokens, which are then decoded into text (for example Markdown or HTML tables) by the MoE decoder. This approach reduces token costs that transformer LLMs incur for long sequences by leveraging vision as a high-density "optical compression" medium.[1]
Lead author Haoran Wei previously worked on GOT-OCR2.0, and DeepSeek-OCR builds on that lineage by pushing the encoder to emit far fewer tokens while still enabling near-lossless reconstruction.[1]
DeepSeek-OCR is a unified encoder-decoder VLM tailored for OCR-centric compression. The two halves are designed to keep activation memory bounded at high resolution while emitting a small, dense set of vision tokens that the decoder can read.
DeepEncoder (~380M parameters) is engineered to keep activation memory low at high resolutions while outputting few vision tokens. It chains three components in series:
For a 1024 by 1024 input, 4,096 patch tokens are compressed to 256 before the CLIP stage, which keeps GPU memory usage manageable even at high resolution.[1] The split between a window-attention front end and a global-attention back end is the central architectural trick: window attention scales linearly so the SAM stage chews through high-resolution pixels cheaply, while global attention is reserved for the much smaller post-compression token grid.
To support different compression ratios with a single model, DeepEncoder exposes native and dynamic resolution modes:
| Mode | Native resolution | Avg. vision tokens | Process | Notes |
|---|---|---|---|---|
| Tiny | 512 x 512 | 64 | resize | Compact pages, slides |
| Small | 640 x 640 | 100 | resize | Widely used in benchmarks |
| Base | 1024 x 1024 | 256 | padding | Preserves aspect ratio; valid tokens < actual tokens |
| Large | 1280 x 1280 | 400 | padding | Higher fidelity |
| Gundam (dynamic) | n x 640 x 640 (tiles) + 1024 x 1024 (global) | n x 100 + 256 (n in [2,9]) | resize + padding | For ultra-dense pages such as newspapers |
For padded modes, the count of valid tokens (tokens covering the original page rather than padding) is lower than the actual token count. The Gundam-M variant adds a 200 dpi global view for the densest documents, raising the budget but improving fidelity on small fonts.[1]
The decoder is DeepSeek-3B-MoE-A570M, a 3B-parameter Mixture-of-Experts language model that activates roughly 570M parameters per token. At inference it uses 6 of 64 routed experts plus 2 shared experts, the standard DeepSeekMoE configuration introduced in DeepSeek-V2 and refined in DeepSeek-V3.[1][4][5] It maps the compressed vision-token sequence back into text, optionally formatted as Markdown, HTML tables, SMILES strings or structured dictionaries depending on the prompt. The MoE design gives the decoder enough capacity for multilingual recognition and structured output while keeping per-token activation cost close to that of a 570M dense model.
DeepSeek-OCR is trained on a mixture of document OCR ("OCR 1.0"), synthetic structure parsing ("OCR 2.0"), general vision data and text-only corpora. The published training-data shares are roughly 70% OCR (1.0 plus 2.0), 20% general vision, and 10% text-only.[1]
Training proceeds in two stages. First, DeepEncoder is pretrained with next-token prediction on text rendered into images. Second, the full encoder-decoder is trained jointly on the OCR/general/text-only mixture above. The model uses pipeline parallelism with four stages: SAM and the 16x compressor are treated as a frozen "vision tokenizer" (PP0), CLIP-large acts as an input embedding layer (PP1), and the 3B MoE decoder occupies PP2 and PP3.[1]
| Aspect | Value |
|---|---|
| Hardware | 20 nodes x 8 NVIDIA A100-40G GPUs (160 GPUs total) |
| Data parallelism | DP=40 |
| Optimizer | AdamW with step schedule (cosine annealing in published configs); learning rate around 5e-5 |
| Sequence length | 4,096 (multimodal); 8,192 (text-only) |
| Throughput | ~70B multimodal tokens/day or ~90B text-only tokens/day |
On the deployment side, the reference implementation supports vLLM acceleration and Transformers inference. The tested environment is Python 3.12.9, CUDA 11.8, PyTorch 2.6.0 and FlashAttention 2.7.3. The README reports a vLLM PDF concurrency rate of about 2,500 tokens per second on a single A100-40G, which is configuration-dependent but gives a useful upper bound for back-of-envelope planning.[2][3]
On the English subset of the Fox benchmark (documents with 600 to 1,300 text tokens), the preprint reports decoding precision as a function of compression. The numbers below give precision for the Tiny mode (64 tokens) and the Small mode (100 tokens):[1][12]
| Text tokens (range) | Precision (64) | Compression (64) | Precision (100) | Compression (100) | Pages |
|---|---|---|---|---|---|
| 600-700 | 96.5% | 10.5x | 98.5% | 6.7x | 7 |
| 700-800 | 93.8% | 11.8x | 97.3% | 7.5x | 28 |
| 800-900 | 83.8% | 13.2x | 96.8% | 8.5x | 28 |
| 900-1000 | 85.9% | 15.1x | 96.8% | 9.7x | 14 |
| 1000-1100 | 79.3% | 16.5x | 91.5% | 10.6x | 11 |
| 1100-1200 | 76.4% | 17.7x | 89.8% | 11.3x | 8 |
| 1200-1300 | 59.1% | 19.7x | 87.1% | 12.6x | 4 |
Precision degrades smoothly as the compression ratio rises: below 10x the model is essentially lossless (96 to 98%), in the 10 to 12x range it stays near 90%, and accuracy collapses past 20x.
On OmniDocBench (CVPR 2025), DeepSeek-OCR reports competitive overall edit distance while using far fewer vision tokens than most end-to-end baselines (lower is better; "Tokens" are average vision tokens per page). Selected rows are reproduced below from the paper:[1][13]
| Model | Tokens (avg.) | Overall (English) | Overall (Chinese) |
|---|---|---|---|
| Nougat | 2352 | 0.452 | 0.973 |
| SmolDocling | 392 | 0.493 | 0.816 |
| InternVL2-76B | 6790 | 0.440 | 0.443 |
| Qwen2.5-VL-7B | 3949 | 0.316 | 0.399 |
| OLMOCR | 3949 | 0.326 | 0.469 |
| GOT-OCR2.0 | 256 | 0.287 | 0.411 |
| dots.ocr | 3949 | 0.182 | 0.261 |
| MinerU-2.0 | 6790 | 0.133 | 0.115 |
| DeepSeek-OCR (Tiny) | 64 | 0.386 | 0.361 |
| DeepSeek-OCR (Small) | 100 | 0.221 | 0.284 |
| DeepSeek-OCR (Base) | 256 (182 valid) | 0.137 | 0.205 |
| DeepSeek-OCR (Large) | 400 (285 valid) | 0.138 | 0.143 |
| DeepSeek-OCR (Gundam) | 795 | 0.127 | 0.097 |
| DeepSeek-OCR (Gundam-M, 200 dpi) | 1853 | 0.123 | 0.087 |
Two headline comparisons are worth pulling out. The Small mode (100 vision tokens) already beats GOT-OCR2.0 (256 tokens), and the Gundam-M mode roughly matches MinerU-2.0 while using less than a third of the tokens.[1]
The Hugging Face model card also reports olmOCR-bench results, where DeepSeek-OCR scores 75.7% overall, with 80.2% on table tests, 96.1% on header/footer extraction, 77.2% on arXiv math, 79.4% on long tiny text and 66.4% on multi-column layouts.[3]
Some document classes need very few tokens (slides, books), whereas dense layouts such as newspapers benefit from the dynamic Gundam modes:[1]
| Mode | Book | Slides | Financial report | Textbook | Exam paper | Magazine | Academic papers | Notes | Newspaper | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| Tiny | 0.147 | 0.116 | 0.207 | 0.173 | 0.294 | 0.201 | 0.395 | 0.297 | 0.940 | 0.320 |
| Small | 0.085 | 0.111 | 0.079 | 0.147 | 0.171 | 0.107 | 0.131 | 0.187 | 0.744 | 0.205 |
| Base | 0.037 | 0.080 | 0.027 | 0.100 | 0.130 | 0.073 | 0.052 | 0.176 | 0.645 | 0.156 |
| Large | 0.038 | 0.108 | 0.022 | 0.084 | 0.109 | 0.060 | 0.053 | 0.155 | 0.353 | 0.117 |
| Gundam | 0.035 | 0.085 | 0.289 | 0.095 | 0.094 | 0.059 | 0.039 | 0.153 | 0.122 | 0.083 |
| Gundam-M | 0.052 | 0.090 | 0.034 | 0.091 | 0.079 | 0.079 | 0.048 | 0.100 | 0.099 | 0.077 |
The newspaper column shows the value of dynamic resolution most clearly: edit distance falls from 0.940 in Tiny mode to 0.099 in Gundam-M, a roughly 10x improvement on the hardest layout class.
The preprint highlights that a single A100-40G can generate 200,000+ pages per day, and a 20-node cluster (8 x A100-40G each) reaches roughly 33 million pages per day for large-scale LLM and VLM pretraining data production.[1] At those rates the system was designed less as a consumer OCR tool and more as a data factory: feed in raw PDFs, get out clean training corpora.
The official instructions provide both vLLM and Transformers inference paths, with SGLang added by the community soon after release. Tested environment notes include Python 3.12.9, CUDA 11.8, PyTorch 2.6.0 and FlashAttention 2.7.3; example prompts cover layout and non-layout OCR, table extraction and figure parsing.[2][3] The repository ships scripts that take an image or a PDF directory and emit Markdown, with optional bounding-box annotations for downstream verification.
DeepSeek-OCR empirically studies the mapping from N text tokens to the minimum number of vision tokens needed for decoding, supporting the view that near-lossless 10x "optical" context compression is feasible for many documents. The paper also outlines a memory-decay analogy: older conversational history could be rendered at progressively lower resolutions to simulate a forgetting curve, trading fidelity for token savings while keeping recent context sharp.[1] A future long-context model might keep recent turns in Large or Gundam mode, page back into Small as turns age, and finally let very old context fall to Tiny, mirroring how human memory blurs older details. The framing has drawn attention from researchers working on long-context transformer memory because it suggests an axis orthogonal to standard attention-window extension: instead of stretching the window, optical compression rewrites the unit of context itself.[1]
DeepSeek-OCR builds on ideas from encoder-decoder OCR (for example GOT-OCR2.0) and high-resolution VLMs (for example Qwen-VL and its NaViT-style packing, plus InternVL tiling). Its decoder leverages the DeepSeekMoE designs introduced in DeepSeek-V2 and DeepSeek-V3.[14][15][16][4][5] The window-attention front end is borrowed conceptually from the Segment Anything Model (SAM), and the global-attention back end from CLIP.[17][18]
The closest predecessor is GOT-OCR2.0, which Haoran Wei co-authored: it already pushed end-to-end document OCR with a 256-token vision encoder, and DeepSeek-OCR extends that idea by training for variable token budgets and pairing the encoder with an MoE decoder rather than a dense one.[1][14]
In early 2026 the same authors released DeepSeek-OCR-2, also titled Visual Causal Flow, on arXiv. The follow-up reports an OmniDocBench v1.5 score of 91.09%, an improvement of about 3.73 percentage points over the original system, with most of the gain on reading-order recognition.[19] The successor uses a similar DeepEncoder split but tightens the causal flow between vision and text tokens during decoding.