# DeepSeek-OCR

> Source: https://aiwiki.ai/wiki/deepseek-ocr
> Updated: 2026-06-23
> Categories: Chinese AI, Computer Vision, Multimodal AI, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**DeepSeek-OCR** is an open-source [optical character recognition (OCR)](/wiki/optical_character_recognition) and document-understanding system released by [DeepSeek](/wiki/deepseek) on 20 October 2025 that pioneers a **contexts optical compression** paradigm: it encodes long passages of text as a small set of compact **vision tokens** and decodes them back into text, achieving about 97% OCR precision at roughly 10x text-to-token compression and about 60% accuracy even at 20x compression.[1] The system pairs a purpose-built vision encoder (*DeepEncoder*) with a 3-billion-parameter [Mixture-of-Experts (MoE)](/wiki/mixture_of_experts) [vision-language model (VLM)](/wiki/vision_language_model) decoder (*DeepSeek-3B-MoE-A570M*, ~570M activated parameters). Its central claim, that one image token can stand in for roughly ten text tokens at near-lossless fidelity, has made it an influential reference point for long-context LLM research.[1]

The authors frame the work plainly: "We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping."[1] In the accompanying preprint they report competitive end-to-end document parsing on OmniDocBench while using far fewer vision tokens than most baselines, and note that DeepSeek-OCR "surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens."[1] The source code and BF16 weights are available on GitHub and [Hugging Face](/wiki/hugging_face) under the MIT license,[2][3] where the model has logged more than 2.2 million downloads in a recent 30-day window, reflecting heavy adoption among document-AI practitioners.[3]

## Overview

| Property | Value |
| --- | --- |
| Developer | [DeepSeek-AI](/wiki/deepseek) |
| Lead author | Haoran Wei |
| Co-authors | Yaofeng Sun, Yukun Li |
| Initial release | 20 October 2025 (code/weights) |
| Paper release | 21 October 2025 (arXiv 2510.18234) |
| License | MIT |
| Total parameters | ~3 billion (decoder) + ~380M (encoder) |
| Activated parameters | ~570M per token (decoder) |
| Precision | BF16 (safetensors) |
| Tasks | OCR, layout parsing, chart/table extraction, geometry, chemistry, captioning, grounding |
| Languages | ~100 (PDF training set) |
| Backends | [vLLM](/wiki/vllm), Hugging Face Transformers, SGLang |

DeepSeek-OCR frames long-context handling as an image-to-text transduction problem: documents are rendered as images and passed through a vision encoder that emits a small set of *vision tokens*, which are then decoded into text (for example Markdown or HTML tables) by the MoE decoder. This approach reduces token costs that transformer [LLMs](/wiki/large_language_model) incur for long sequences by leveraging vision as a high-density "optical compression" medium.[1]

Lead author Haoran Wei previously worked on [GOT-OCR2.0](/wiki/got_ocr), and DeepSeek-OCR builds on that lineage by pushing the encoder to emit far fewer tokens while still enabling near-lossless reconstruction.[1]

## What is contexts optical compression?

Contexts optical compression is the core idea DeepSeek-OCR investigates: rather than feeding a long document to a language model as thousands of discrete text tokens, the document is rendered as a 2D image and compressed by a vision encoder into a far smaller number of vision tokens, which the decoder then expands back into text. The paper poses this as an empirical question about the mapping from N text tokens to the minimum number of vision tokens needed for faithful decoding. The headline result is a compression-versus-fidelity curve: "when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%."[1] In other words, a single vision token can reliably carry the information of roughly ten text tokens, which the authors argue "shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs."[1]

## History and release

- **20 October 2025, initial open-source release.** The repository, usage examples and the five inference modes were published on GitHub under the MIT license.[2]
- **21 October 2025, preprint posted.** The arXiv paper "DeepSeek-OCR: Contexts Optical Compression" by Haoran Wei, Yaofeng Sun and Yukun Li describes the method, data and experiments.[1]
- **Weights and model card.** A 3B-parameter model (BF16 safetensors) was hosted on Hugging Face the same week with example prompts and environment requirements.[3]
- **Community adoption.** Within weeks of release, third-party deployments appeared on DeepInfra, [Replicate](/wiki/replicate), Google Colab, Kaggle and Docker Model Runner, and [SGLang](/wiki/sglang) added a server backend alongside the official vLLM and Transformers paths.[2][3]

## How is DeepSeek-OCR built?

DeepSeek-OCR is a unified encoder-decoder [VLM](/wiki/vlm) tailored for OCR-centric compression. The two halves are designed to keep activation memory bounded at high resolution while emitting a small, dense set of vision tokens that the decoder can read.

### DeepEncoder (vision side)

DeepEncoder (~380M parameters) is engineered to keep activation memory low at high resolutions while outputting few vision tokens. It chains three components in series:

1. A window-attention backbone based on [SAM-base](/wiki/segment_anything) (~80M params) that handles initial visual perception at high resolution without quadratic-attention blowup.
2. A 2-layer 16x convolutional token compressor (kernel=3, stride=2, padding=1; channels 256 to 1024) that downsamples the patch grid before any global attention is applied.
3. A dense global-attention backbone based on CLIP-large (~300M params, with the first patch embedding removed because the inputs are tokens, not pixels).

For a 1024 by 1024 input, 4,096 patch tokens are compressed to 256 before the CLIP stage, which keeps GPU memory usage manageable even at high resolution.[1] The split between a window-attention front end and a global-attention back end is the central architectural trick: window attention scales linearly so the SAM stage chews through high-resolution pixels cheaply, while global attention is reserved for the much smaller post-compression token grid.

#### Multi-resolution modes

To support different compression ratios with a single model, DeepEncoder exposes native and dynamic resolution modes:

| Mode | Native resolution | Avg. vision tokens | Process | Notes |
| --- | --- | --- | --- | --- |
| Tiny | 512 x 512 | 64 | resize | Compact pages, slides |
| Small | 640 x 640 | 100 | resize | Widely used in benchmarks |
| Base | 1024 x 1024 | 256 | padding | Preserves aspect ratio; valid tokens < actual tokens |
| Large | 1280 x 1280 | 400 | padding | Higher fidelity |
| Gundam (dynamic) | n x 640 x 640 (tiles) + 1024 x 1024 (global) | n x 100 + 256 (n in [2,9]) | resize + padding | For ultra-dense pages such as [newspapers](/wiki/newspaper) |

For padded modes, the count of *valid* tokens (tokens covering the original page rather than padding) is lower than the *actual* token count. The Gundam-M variant adds a 200 dpi global view for the densest documents, raising the budget but improving fidelity on small fonts.[1]

### Decoder (language side)

The decoder is *DeepSeek-3B-MoE-A570M*, a 3B-parameter [Mixture-of-Experts](/wiki/mixture_of_experts) language model that activates roughly 570M parameters per token. At inference it uses 6 of 64 routed experts plus 2 shared experts, the standard *DeepSeekMoE* configuration introduced in [DeepSeek-V2](/wiki/deepseek_v2) and refined in [DeepSeek-V3](/wiki/deepseek_v3).[1][4][5] It maps the compressed vision-token sequence back into text, optionally formatted as Markdown, HTML tables, [SMILES](/wiki/smiles) strings or structured dictionaries depending on the prompt. The MoE design gives the decoder enough capacity for multilingual recognition and structured output while keeping per-token activation cost close to that of a 570M dense model.

## What data is DeepSeek-OCR trained on?

DeepSeek-OCR is trained on a mixture of document OCR ("OCR 1.0"), synthetic structure parsing ("OCR 2.0"), general vision data and text-only corpora. The published training-data shares are roughly 70% OCR (1.0 plus 2.0), 20% general vision, and 10% text-only.[1]

- **OCR 1.0 (documents and scenes).** 30M PDF pages spanning roughly 100 languages, with about 25M Chinese and English pages and 5M others. Coarse labels are extracted directly from PDFs; fine labels (about 2M pages each for Chinese and English) are built with layout and OCR models such as PP-DocLayout and PaddleOCR/MinerU/GOT-OCR2.0. Natural-scene OCR uses LAION and Wukong images with PaddleOCR labels (about 10M Chinese plus 10M English).[6][7][8][9][1]
- **OCR 2.0 (structure parsing).** About 10M charts rendered via pyecharts/matplotlib and labeled as HTML tables (cf. [OneChart](/wiki/onechart)); about 5M chemical diagrams generated from PubChem [SMILES](/wiki/smiles) strings rendered with [RDKit](/wiki/rdkit); about 1M plane geometry images generated following *Slow Perception*.[10][11][1]
- **General vision.** Captioning, detection and grounding data to preserve a general VLM interface (about 20% of total).[1]
- **Text-only.** About 10% in-house text-only pretraining at 8,192 sequence length, used to keep the decoder fluent as a language model.[1]

## Training and inference

Training proceeds in two stages. First, DeepEncoder is pretrained with next-token prediction on text rendered into images. Second, the full encoder-decoder is trained jointly on the OCR/general/text-only mixture above. The model uses pipeline parallelism with four stages: SAM and the 16x compressor are treated as a frozen "vision tokenizer" (PP0), CLIP-large acts as an input embedding layer (PP1), and the 3B MoE decoder occupies PP2 and PP3.[1]

| Aspect | Value |
| --- | --- |
| Hardware | 20 nodes x 8 [NVIDIA](/wiki/nvidia) A100-40G GPUs (160 GPUs total) |
| Data parallelism | DP=40 |
| Optimizer | AdamW with step schedule (cosine annealing in published configs); learning rate around 5e-5 |
| Sequence length | 4,096 (multimodal); 8,192 (text-only) |
| Throughput | ~70B multimodal tokens/day or ~90B text-only tokens/day |

On the deployment side, the reference implementation supports vLLM acceleration and Transformers inference. The tested environment is [Python](/wiki/python) 3.12.9, [CUDA](/wiki/cuda) 11.8, [PyTorch](/wiki/pytorch) 2.6.0 and FlashAttention 2.7.3.[3] The README reports a vLLM PDF concurrency rate of about 2,500 tokens per second on a single A100-40G, which is configuration-dependent but gives a useful upper bound for back-of-envelope planning.[2][3]

## How well does DeepSeek-OCR perform?

### Compression study (Fox benchmark)

On the English subset of the [Fox](/wiki/fox_benchmark) benchmark (documents with 600 to 1,300 text tokens), the preprint reports decoding precision as a function of compression. The numbers below give precision for the Tiny mode (64 tokens) and the Small mode (100 tokens):[1][12]

| Text tokens (range) | Precision (64) | Compression (64) | Precision (100) | Compression (100) | Pages |
| --- | --- | --- | --- | --- | --- |
| 600-700 | 96.5% | 10.5x | 98.5% | 6.7x | 7 |
| 700-800 | 93.8% | 11.8x | 97.3% | 7.5x | 28 |
| 800-900 | 83.8% | 13.2x | 96.8% | 8.5x | 28 |
| 900-1000 | 85.9% | 15.1x | 96.8% | 9.7x | 14 |
| 1000-1100 | 79.3% | 16.5x | 91.5% | 10.6x | 11 |
| 1100-1200 | 76.4% | 17.7x | 89.8% | 11.3x | 8 |
| 1200-1300 | 59.1% | 19.7x | 87.1% | 12.6x | 4 |

Precision degrades smoothly as the compression ratio rises: below 10x the model is essentially lossless (96 to 98%), in the 10 to 12x range it stays near 90%, and accuracy collapses past 20x.

### OmniDocBench (end-to-end document parsing)

On OmniDocBench (CVPR 2025), DeepSeek-OCR reports competitive *overall* edit distance while using far fewer vision tokens than most end-to-end baselines (lower is better; "Tokens" are average vision tokens per page). Selected rows are reproduced below from the paper:[1][13]

| Model | Tokens (avg.) | Overall (English) | Overall (Chinese) |
| --- | --- | --- | --- |
| Nougat | 2352 | 0.452 | 0.973 |
| SmolDocling | 392 | 0.493 | 0.816 |
| InternVL2-76B | 6790 | 0.440 | 0.443 |
| Qwen2.5-VL-7B | 3949 | 0.316 | 0.399 |
| OLMOCR | 3949 | 0.326 | 0.469 |
| GOT-OCR2.0 | 256 | 0.287 | 0.411 |
| dots.ocr | 3949 | 0.182 | 0.261 |
| MinerU-2.0 | 6790 | 0.133 | 0.115 |
| **DeepSeek-OCR (Tiny)** | 64 | 0.386 | 0.361 |
| **DeepSeek-OCR (Small)** | 100 | 0.221 | 0.284 |
| **DeepSeek-OCR (Base)** | 256 (182 valid) | 0.137 | 0.205 |
| **DeepSeek-OCR (Large)** | 400 (285 valid) | 0.138 | 0.143 |
| **DeepSeek-OCR (Gundam)** | 795 | 0.127 | 0.097 |
| **DeepSeek-OCR (Gundam-M, 200 dpi)** | 1853 | 0.123 | 0.087 |

Two headline comparisons are worth pulling out. The Small mode (100 vision tokens) already beats GOT-OCR2.0 (256 tokens), and the Gundam-M mode roughly matches MinerU-2.0 while using less than a third of the tokens.[1]

### olmOCR-bench

The Hugging Face model card also reports olmOCR-bench results, where DeepSeek-OCR scores 75.7% overall, with 77.2% on arXiv math and 73.6% on old-scans math among the published categories.[3]

#### Category-specific results (OmniDocBench)

Some document classes need very few tokens (slides, books), whereas dense layouts such as newspapers benefit from the dynamic Gundam modes:[1]

| Mode | Book | Slides | Financial report | Textbook | Exam paper | Magazine | Academic papers | Notes | Newspaper | Overall |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Tiny | 0.147 | 0.116 | 0.207 | 0.173 | 0.294 | 0.201 | 0.395 | 0.297 | 0.940 | 0.320 |
| Small | 0.085 | 0.111 | 0.079 | 0.147 | 0.171 | 0.107 | 0.131 | 0.187 | 0.744 | 0.205 |
| Base | 0.037 | 0.080 | 0.027 | 0.100 | 0.130 | 0.073 | 0.052 | 0.176 | 0.645 | 0.156 |
| Large | 0.038 | 0.108 | 0.022 | 0.084 | 0.109 | 0.060 | 0.053 | 0.155 | 0.353 | 0.117 |
| Gundam | 0.035 | 0.085 | 0.289 | 0.095 | 0.094 | 0.059 | 0.039 | 0.153 | 0.122 | 0.083 |
| Gundam-M | 0.052 | 0.090 | 0.034 | 0.091 | 0.079 | 0.079 | 0.048 | 0.100 | 0.099 | 0.077 |

The newspaper column shows the value of dynamic resolution most clearly: edit distance falls from 0.940 in Tiny mode to 0.099 in Gundam-M, a roughly 10x improvement on the hardest layout class.

### Throughput and data generation

The preprint highlights that a single A100-40G can generate **200,000+ pages per day**, and a 20-node cluster (8 x A100-40G each) reaches roughly **33 million** pages per day for large-scale LLM and VLM pretraining data production.[1] At those rates the system was designed less as a consumer OCR tool and more as a data factory: feed in raw PDFs, get out clean training corpora.

## What is DeepSeek-OCR used for?

- **Layout-aware OCR** with optional detection and structure output (Markdown, HTML, JSON) controlled via prompts.[1]
- **Deep parsing** of charts, simple geometric figures and chemical diagrams embedded in documents (chart to HTML table; geometry to structured dict; chemistry to [SMILES](/wiki/smiles)).[1]
- **Multilingual recognition** covering roughly 100 languages in the PDF training set.[1]
- **General visual understanding** since the model retains image captioning, detection and grounding interfaces alongside the OCR head, which keeps it useful for broader VLM research.[1]
- **Configurable token budget** at inference time: callers pick a mode (Tiny through Gundam-M) to trade speed for fidelity on each request, without retraining.[2]
- **Training-data generation** at scale, turning raw PDF archives into clean, structured corpora for LLM and VLM pretraining.[1]

## Is DeepSeek-OCR open source?

Yes. DeepSeek released both the source code and the BF16 model weights under the permissive MIT license, with the code on GitHub and the weights on Hugging Face.[2][3] The official instructions provide both [vLLM](/wiki/vllm) and [Transformers](/wiki/transformers_library) inference paths, with [SGLang](/wiki/sglang) added by the community soon after release. Tested environment notes include Python 3.12.9, [CUDA](/wiki/cuda) 11.8, PyTorch 2.6.0 and FlashAttention 2.7.3; example prompts cover layout and non-layout OCR, table extraction and figure parsing.[2][3] The repository ships scripts that take an image or a PDF directory and emit Markdown, with optional bounding-box annotations for downstream verification.

## Why does DeepSeek-OCR matter for long-context LLMs?

DeepSeek-OCR empirically studies the mapping from N text tokens to the minimum number of vision tokens needed for decoding, supporting the view that near-lossless 10x "optical" context compression is feasible for many documents. The paper also outlines a memory-decay analogy: older conversational history could be rendered at progressively lower resolutions to simulate a forgetting curve, trading fidelity for token savings while keeping recent context sharp.[1] A future long-context model might keep recent turns in Large or Gundam mode, page back into Small as turns age, and finally let very old context fall to Tiny, mirroring how human memory blurs older details. The framing has drawn attention from researchers working on long-context [transformer](/wiki/transformer) memory because it suggests an axis orthogonal to standard attention-window extension: instead of stretching the window, optical compression rewrites the unit of context itself.[1]

## How does DeepSeek-OCR relate to prior work?

DeepSeek-OCR builds on ideas from encoder-decoder OCR (for example [GOT-OCR2.0](/wiki/got_ocr)) and high-resolution VLMs (for example [Qwen-VL](/wiki/qwen_vl) and its NaViT-style packing, plus [InternVL](/wiki/internvl) tiling). Its decoder leverages the *DeepSeekMoE* designs introduced in [DeepSeek-V2](/wiki/deepseek_v2) and [DeepSeek-V3](/wiki/deepseek_v3).[14][15][16][4][5] The window-attention front end is borrowed conceptually from the [Segment Anything Model (SAM)](/wiki/segment_anything), and the global-attention back end from [CLIP](/wiki/clip).[17][18]

The closest predecessor is [GOT-OCR2.0](/wiki/got_ocr), which Haoran Wei co-authored: it already pushed end-to-end document OCR with a 256-token vision encoder, and DeepSeek-OCR extends that idea by training for variable token budgets and pairing the encoder with an MoE decoder rather than a dense one.[1][14]

## Successor: DeepSeek-OCR-2

In January 2026 the same authors released DeepSeek-OCR-2, titled *Visual Causal Flow*, on arXiv.[19] The follow-up introduces a DeepEncoder V2 that can dynamically reorder visual tokens according to image semantics rather than processing them in a rigid raster-scan order, which the authors describe as giving the encoder "causal reasoning capabilities."[19] DeepSeek-OCR-2 reports an OmniDocBench v1.5 overall score of 91.09%, an improvement of about 3.73 percentage points over the original system, with much of the gain coming from better reading-order recognition.[19][20]

## See also

- [GOT-OCR2.0](/wiki/got_ocr)
- [Qwen-VL](/wiki/qwen_vl)
- [InternVL](/wiki/internvl)
- [CLIP](/wiki/clip)
- [Segment Anything](/wiki/segment_anything)
- [OmniDocBench](/wiki/omnidocbench)
- [DeepSeek-V3](/wiki/deepseek_v3)
- [Mixture of Experts](/wiki/mixture_of_experts)
- [Hugging Face](/wiki/hugging_face)

## References

1. Wei, Haoran; Sun, Yaofeng; Li, Yukun. "DeepSeek-OCR: Contexts Optical Compression." arXiv preprint arXiv:2510.18234, 21 October 2025. https://arxiv.org/abs/2510.18234
2. DeepSeek-AI. "DeepSeek-OCR" GitHub repository (MIT license, released 20 October 2025). https://github.com/deepseek-ai/DeepSeek-OCR
3. DeepSeek-AI. "deepseek-ai/DeepSeek-OCR" Hugging Face model card. https://huggingface.co/deepseek-ai/DeepSeek-OCR
4. DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, May 2024. https://arxiv.org/abs/2405.04434
5. DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437
6. PaddlePaddle. "PaddleOCR" toolkit. https://github.com/PaddlePaddle/PaddleOCR
7. Schuhmann, Christoph et al. "LAION-5B: An open large-scale dataset for training next generation image-text models." NeurIPS 2022. https://laion.ai/blog/laion-5b/
8. Gu, Jiaxi et al. "Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark." NeurIPS 2022. https://arxiv.org/abs/2202.06767
9. Wang, Bin et al. "MinerU: An Open-Source Solution for Precise Document Content Extraction." arXiv:2409.18839, September 2024. https://arxiv.org/abs/2409.18839
10. Chen, Xinhua et al. "OneChart: Purify the Chart Structural Extraction via One Auxiliary Token." arXiv:2404.09987, April 2024. https://arxiv.org/abs/2404.09987
11. RDKit, Open-source cheminformatics. https://www.rdkit.org/
12. Liu, Chenglong et al. "Focus Anywhere for Fine-grained Multi-page Document Understanding." arXiv:2405.14295 (Fox benchmark), May 2024. https://arxiv.org/abs/2405.14295
13. Ouyang, Linke et al. "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations." CVPR 2025. https://github.com/opendatalab/OmniDocBench
14. Wei, Haoran et al. "General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model." arXiv:2409.01704 (GOT-OCR2.0), September 2024. https://arxiv.org/abs/2409.01704
15. Bai, Jinze et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv:2308.12966, August 2023. https://arxiv.org/abs/2308.12966
16. Chen, Zhe et al. "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks." CVPR 2024. https://arxiv.org/abs/2312.14238
17. Kirillov, Alexander et al. "Segment Anything." ICCV 2023. https://arxiv.org/abs/2304.02643
18. Radford, Alec et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021 (CLIP). https://arxiv.org/abs/2103.00020
19. Wei, Haoran; Sun, Yaofeng; Li, Yukun. "DeepSeek-OCR 2: Visual Causal Flow." arXiv:2601.20552, January 2026. https://arxiv.org/abs/2601.20552
20. DeepSeek-AI. "deepseek-ai/DeepSeek-OCR-2" Hugging Face model card and paper page. https://huggingface.co/papers/2601.20552