Template:Infobox AI model
DeepSeek‑OCR is an open‑source end‑to‑end document OCR and layout understanding system released by DeepSeek in October 2025. It introduces a contexts optical compression paradigm that represents long textual content as compact vision tokens and then decodes them back into text using a lightweight decoder.[1] The system consists of a purpose‑built vision encoder (DeepEncoder) and a 3‑billion‑parameter Mixture‑of‑Experts (MoE) decoder (DeepSeek‑3B‑MoE‑A570M). In the accompanying preprint, the authors report ~97% OCR precision at ≈10× vision‑text compression and ~60% at ≈20× on the Fox benchmark, and competitive end‑to‑end performance on OmniDocBench while using far fewer vision tokens than many baselines.[1]
The source code and BF16 weights are available on GitHub and Hugging Face under the MIT license.[2][3]
DeepSeek‑OCR frames long‑context handling as an image–text transduction problem: documents are rendered as images and passed through a vision encoder that emits a small set of vision tokens, which are then decoded into text (for example Markdown or HTML tables) by the MoE decoder. This approach reduces token costs that transformer LLMs incur for long sequences by leveraging vision as a high‑density “optical compression” medium.[1]
DeepSeek‑OCR is a unified encoder–decoder VLM tailored for OCR‑centric compression.
DeepEncoder (≈380M parameters) is designed to keep activation memory low at high resolutions while outputting few vision tokens. It chains three components in series: 1) a window‑attention backbone based on SAM‑base (~80M params), 2) a 2‑layer 16× convolutional token compressor (kernel=3, stride=2, padding=1; channels 256→1024), and 3) a dense global‑attention backbone based on CLIP‑large (first patch embedding removed since inputs are tokens, not pixels).
For a 1024×1024 input, 4096 patch tokens are compressed to 256 before global attention, keeping memory usage controlled.[1]
To support different compression ratios with a single model, DeepEncoder exposes native and dynamic resolution modes:
| Mode | Native resolution | Avg. vision tokens | Process | Notes |
|---|---|---|---|---|
| Tiny | 512 × 512 | 64 | resize | Compact pages |
| Small | 640 × 640 | 100 | resize | Widely used in tests |
| Base | 1024 × 1024 | 256 | padding | Preserves aspect ratio; valid tokens < actual tokens |
| Large | 1280 × 1280 | 400 | padding | Higher fidelity |
| Gundam (dynamic) | n×640 × 640 (tiles) + 1024 × 1024 (global) | n×100 + 256 (n∈[2,9]) | resize + padding | For ultra‑dense pages (for example newspapers) |
For padded modes, the valid‑token count is approximately:
The decoder is DeepSeek‑3B‑MoE‑A570M, a 3B‑parameter MoE that activates ~570M parameters per token (6 of 64 routed experts + 2 shared experts in inference). It maps compressed vision tokens back to text and leverages the DeepSeekMoE designs introduced in DeepSeek‑V2 and DeepSeek‑V3.[1][4][5]
DeepSeek‑OCR is trained on a mixture of document OCR (“OCR 1.0”), synthetic structure parsing (“OCR 2.0”), general vision data, and text‑only corpora:[1]
Training proceeds in two stages: (1) pretrain DeepEncoder with next‑token prediction; (2) train the full encoder–decoder. The model uses pipeline parallelism (4 stages): SAM+compressor are treated as a frozen “vision tokenizer” (PP0), CLIP‑large acts as an input embedding layer (PP1), and the 3B MoE decoder occupies PP2–PP3. Reported setup: 20 nodes × 8 NVIDIA A100‑40G (DP=40); AdamW with a step schedule; ~70B multimodal tokens/day throughput.[1]
On the deployment side, the reference implementation supports vLLM acceleration and Transformers inference (tested with CUDA 11.8, PyTorch 2.6 and FlashAttention 2.7.3). The README provides an example noting “PDF: concurrency ~2,500 tokens/s (A100‑40G)” for the vLLM path (configuration‑dependent).[2][3]
On the English subset of the Fox benchmark (documents with 600–1300 text tokens), the preprint reports decoding precision as a function of compression (Tiny=64 tokens; Small=100 tokens):[1][12]
| Text tokens (range) | Precision (64) | Compression (64) | Precision (100) | Compression (100) | Pages |
|---|---|---|---|---|---|
| 600–700 | 96.5% | 10.5× | 98.5% | 6.7× | 7 |
| 700–800 | 93.8% | 11.8× | 97.3% | 7.5× | 28 |
| 800–900 | 83.8% | 13.2× | 96.8% | 8.5× | 28 |
| 900–1000 | 85.9% | 15.1× | 96.8% | 9.7× | 14 |
| 1000–1100 | 79.3% | 16.5× | 91.5% | 10.6× | 11 |
| 1100–1200 | 76.4% | 17.7× | 89.8% | 11.3× | 8 |
| 1200–1300 | 59.1% | 19.7× | 87.1% | 12.6× | 4 |
On OmniDocBench (CVPR 2025), DeepSeek‑OCR reports competitive overall edit distance while using far fewer vision tokens than most end‑to‑end baselines (lower is better; “Tokens” are average vision tokens/page). Selected rows are reproduced below from the paper:[1][13]
| Model | Tokens (avg.) | Overall (English) | Overall (Chinese) |
|---|---|---|---|
| Nougat | 2352 | 0.452 | 0.973 |
| SmolDocling | 392 | 0.493 | 0.816 |
| InternVL2‑76B | 6790 | 0.440 | 0.443 |
| Qwen2.5‑VL‑7B | 3949 | 0.316 | 0.399 |
| OLMOCR | 3949 | 0.326 | 0.469 |
| GOT‑OCR2.0 | 256 | 0.287 | 0.411 |
| dots.ocr | 3949 | 0.182 | 0.261 |
| MinerU‑2.0 | 6790 | 0.133 | 0.115 |
| DeepSeek‑OCR (Tiny) | 64 | 0.386 | 0.361 |
| DeepSeek‑OCR (Small) | 100 | 0.221 | 0.284 |
| DeepSeek‑OCR (Base) | 256 (182 valid) | 0.137 | 0.205 |
| DeepSeek‑OCR (Large) | 400 (285 valid) | 0.138 | 0.143 |
| DeepSeek‑OCR (Gundam) | 795 | 0.127 | 0.097 |
| DeepSeek‑OCR (Gundam‑M, 200 dpi) | 1853 | 0.123 | 0.087 |
Some document classes require very few tokens (for example slides), whereas dense layouts (for example newspapers) benefit from dynamic “Gundam” modes:[1]
| Mode | Book | Slides | Financial report | Textbook | Exam paper | Magazine | Academic papers | Notes | Newspaper | Overall |
|---|---|---|---|---|---|---|---|---|---|---|
| Tiny | 0.147 | 0.116 | 0.207 | 0.173 | 0.294 | 0.201 | 0.395 | 0.297 | 0.940 | 0.320 |
| Small | 0.085 | 0.111 | 0.079 | 0.147 | 0.171 | 0.107 | 0.131 | 0.187 | 0.744 | 0.205 |
| Base | 0.037 | 0.080 | 0.027 | 0.100 | 0.130 | 0.073 | 0.052 | 0.176 | 0.645 | 0.156 |
| Large | 0.038 | 0.108 | 0.022 | 0.084 | 0.109 | 0.060 | 0.053 | 0.155 | 0.353 | 0.117 |
| Gundam | 0.035 | 0.085 | 0.289 | 0.095 | 0.094 | 0.059 | 0.039 | 0.153 | 0.122 | 0.083 |
| Gundam‑M | 0.052 | 0.090 | 0.034 | 0.091 | 0.079 | 0.079 | 0.048 | 0.100 | 0.099 | 0.077 |
The preprint highlights that a single A100‑40G can generate 200k+ pages/day, and that a 20‑node cluster (8×A100‑40G each) reaches ~33 million pages/day for large‑scale LLM/VLM pretraining data production.[1]
The official instructions provide both vLLM and Transformers inference paths. Tested environment notes include Python 3.12.9, CUDA 11.8, PyTorch 2.6.0 and FlashAttention 2.7.3; example prompts cover layout/non‑layout OCR and figure parsing.[2][3]
DeepSeek‑OCR empirically studies the mapping from N text tokens to the minimum number of vision tokens needed for decoding, supporting the view that near‑lossless ~10× “optical” context compression is feasible for many documents. The paper also outlines a memory‑decay analogy: older conversational history could be rendered at progressively lower resolutions to simulate a “forgetting curve,” trading fidelity for token savings while keeping recent context sharp.[1]
DeepSeek‑OCR builds on ideas from encoder–decoder OCR (for example GOT‑OCR2.0) and high‑resolution VLMs (for example Qwen‑VL’s NaViT‑style packing and InternVL tiling). Its decoder leverages the DeepSeekMoE designs introduced in DeepSeek‑V2 and DeepSeek‑V3.[14][15][16][4][5]
Cite error: <ref> tag defined in <references> has no name attribute.
Cite error: <ref> tag defined in <references> has no name attribute.