DeepSeek-OCR

DeepSeek-OCR is an open-source end-to-end document OCR and layout understanding system released by DeepSeek in October 2025. It introduces a contexts optical compression paradigm that represents long textual content as compact vision tokens and then decodes them back into text using a lightweight decoder.^[1] The system consists of a purpose-built vision encoder (DeepEncoder) and a 3-billion-parameter Mixture-of-Experts (MoE) decoder (DeepSeek-3B-MoE-A570M). In the accompanying preprint, the authors report ~97% OCR precision at about 10x vision-text compression and ~60% at about 20x on the Fox benchmark, and competitive end-to-end performance on OmniDocBench while using far fewer vision tokens than many baselines.^[1]

The source code and BF16 weights are available on GitHub and Hugging Face under the MIT license.^[2]^[3] On Hugging Face the model has been downloaded more than 2.7 million times in its first month, reflecting heavy adoption among document-AI practitioners.^[3]

Overview

Property	Value
Developer	DeepSeek-AI
Lead author	Haoran Wei
Co-authors	Yaofeng Sun, Yukun Li
Initial release	20 October 2025 (code/weights)
Paper release	21 October 2025 (arXiv 2510.18234)
License	MIT
Total parameters	~3 billion (decoder) + ~380M (encoder)
Activated parameters	~570M per token (decoder)
Precision	BF16 (safetensors)
Tasks	OCR, layout parsing, chart/table extraction, geometry, chemistry, captioning, grounding
Languages	~100 (PDF training set)
Backends	vLLM, Hugging Face Transformers, SGLang

DeepSeek-OCR frames long-context handling as an image-to-text transduction problem: documents are rendered as images and passed through a vision encoder that emits a small set of vision tokens, which are then decoded into text (for example Markdown or HTML tables) by the MoE decoder. This approach reduces token costs that transformer LLMs incur for long sequences by leveraging vision as a high-density "optical compression" medium.^[1]

Lead author Haoran Wei previously worked on GOT-OCR2.0, and DeepSeek-OCR builds on that lineage by pushing the encoder to emit far fewer tokens while still enabling near-lossless reconstruction.^[1]

History and release

20 October 2025, initial open-source release. The repository, usage examples and the five inference modes were published on GitHub under the MIT license.^[2]
21 October 2025, preprint posted. The arXiv paper "DeepSeek-OCR: Contexts Optical Compression" by Haoran Wei, Yaofeng Sun and Yukun Li describes the method, data and experiments.^[1]
Weights and model card. A 3B-parameter model (BF16 safetensors) was hosted on Hugging Face the same week with example prompts and environment requirements.^[3]
Community adoption. Within weeks of release, third-party deployments appeared on DeepInfra, Replicate, Google Colab, Kaggle and Docker Model Runner, and SGLang added a server backend alongside the official vLLM and Transformers paths.^[2]^[3]

Architecture

DeepSeek-OCR is a unified encoder-decoder VLM tailored for OCR-centric compression. The two halves are designed to keep activation memory bounded at high resolution while emitting a small, dense set of vision tokens that the decoder can read.

DeepEncoder (vision side)

DeepEncoder (~380M parameters) is engineered to keep activation memory low at high resolutions while outputting few vision tokens. It chains three components in series:

A window-attention backbone based on SAM-base (~80M params) that handles initial visual perception at high resolution without quadratic-attention blowup.
A 2-layer 16x convolutional token compressor (kernel=3, stride=2, padding=1; channels 256 to 1024) that downsamples the patch grid before any global attention is applied.
A dense global-attention backbone based on CLIP-large (~300M params, with the first patch embedding removed because the inputs are tokens, not pixels).

For a 1024 by 1024 input, 4,096 patch tokens are compressed to 256 before the CLIP stage, which keeps GPU memory usage manageable even at high resolution.^[1] The split between a window-attention front end and a global-attention back end is the central architectural trick: window attention scales linearly so the SAM stage chews through high-resolution pixels cheaply, while global attention is reserved for the much smaller post-compression token grid.

Multi-resolution modes

To support different compression ratios with a single model, DeepEncoder exposes native and dynamic resolution modes:

Mode	Native resolution	Avg. vision tokens	Process	Notes
Tiny	512 x 512	64	resize	Compact pages, slides
Small	640 x 640	100	resize	Widely used in benchmarks
Base	1024 x 1024	256	padding	Preserves aspect ratio; valid tokens < actual tokens
Large	1280 x 1280	400	padding	Higher fidelity
Gundam (dynamic)	n x 640 x 640 (tiles) + 1024 x 1024 (global)	n x 100 + 256 (n in [2,9])	resize + padding	For ultra-dense pages such as newspapers

For padded modes, the count of valid tokens (tokens covering the original page rather than padding) is lower than the actual token count. The Gundam-M variant adds a 200 dpi global view for the densest documents, raising the budget but improving fidelity on small fonts.^[1]

Decoder (language side)

The decoder is DeepSeek-3B-MoE-A570M, a 3B-parameter Mixture-of-Experts language model that activates roughly 570M parameters per token. At inference it uses 6 of 64 routed experts plus 2 shared experts, the standard DeepSeekMoE configuration introduced in DeepSeek-V2 and refined in DeepSeek-V3.^[1]^[4]^[5] It maps the compressed vision-token sequence back into text, optionally formatted as Markdown, HTML tables, SMILES strings or structured dictionaries depending on the prompt. The MoE design gives the decoder enough capacity for multilingual recognition and structured output while keeping per-token activation cost close to that of a 570M dense model.

Data and tasks

DeepSeek-OCR is trained on a mixture of document OCR ("OCR 1.0"), synthetic structure parsing ("OCR 2.0"), general vision data and text-only corpora. The published training-data shares are roughly 70% OCR (1.0 plus 2.0), 20% general vision, and 10% text-only.^[1]

OCR 1.0 (documents and scenes). 30M PDF pages spanning roughly 100 languages, with about 25M Chinese and English pages and 5M others. Coarse labels are extracted directly from PDFs; fine labels (about 2M pages each for Chinese and English) are built with layout and OCR models such as PP-DocLayout and PaddleOCR/MinerU/GOT-OCR2.0. Natural-scene OCR uses LAION and Wukong images with PaddleOCR labels (about 10M Chinese plus 10M English).^[6]^[7]^[8]^[9]^[1]
OCR 2.0 (structure parsing). About 10M charts rendered via pyecharts/matplotlib and labeled as HTML tables (cf. OneChart); about 5M chemical diagrams generated from PubChem SMILES strings rendered with RDKit; about 1M plane geometry images generated following Slow Perception.^[10]^[11]^[1]
General vision. Captioning, detection and grounding data to preserve a general VLM interface (about 20% of total).^[1]
Text-only. About 10% in-house text-only pretraining at 8,192 sequence length, used to keep the decoder fluent as a language model.^[1]

Training and inference

Training proceeds in two stages. First, DeepEncoder is pretrained with next-token prediction on text rendered into images. Second, the full encoder-decoder is trained jointly on the OCR/general/text-only mixture above. The model uses pipeline parallelism with four stages: SAM and the 16x compressor are treated as a frozen "vision tokenizer" (PP0), CLIP-large acts as an input embedding layer (PP1), and the 3B MoE decoder occupies PP2 and PP3.^[1]

Aspect	Value
Hardware	20 nodes x 8 NVIDIA A100-40G GPUs (160 GPUs total)
Data parallelism	DP=40
Optimizer	AdamW with step schedule (cosine annealing in published configs); learning rate around 5e-5
Sequence length	4,096 (multimodal); 8,192 (text-only)
Throughput	~70B multimodal tokens/day or ~90B text-only tokens/day

On the deployment side, the reference implementation supports vLLM acceleration and Transformers inference. The tested environment is Python 3.12.9, CUDA 11.8, PyTorch 2.6.0 and FlashAttention 2.7.3. The README reports a vLLM PDF concurrency rate of about 2,500 tokens per second on a single A100-40G, which is configuration-dependent but gives a useful upper bound for back-of-envelope planning.^[2]^[3]

Evaluation

Compression study (Fox benchmark)

On the English subset of the Fox benchmark (documents with 600 to 1,300 text tokens), the preprint reports decoding precision as a function of compression. The numbers below give precision for the Tiny mode (64 tokens) and the Small mode (100 tokens):^[1]^[12]

Text tokens (range)	Precision (64)	Compression (64)	Precision (100)	Compression (100)	Pages
600-700	96.5%	10.5x	98.5%	6.7x	7
700-800	93.8%	11.8x	97.3%	7.5x	28
800-900	83.8%	13.2x	96.8%	8.5x	28
900-1000	85.9%	15.1x	96.8%	9.7x	14
1000-1100	79.3%	16.5x	91.5%	10.6x	11
1100-1200	76.4%	17.7x	89.8%	11.3x	8
1200-1300	59.1%	19.7x	87.1%	12.6x	4

Precision degrades smoothly as the compression ratio rises: below 10x the model is essentially lossless (96 to 98%), in the 10 to 12x range it stays near 90%, and accuracy collapses past 20x.

OmniDocBench (end-to-end document parsing)

On OmniDocBench (CVPR 2025), DeepSeek-OCR reports competitive overall edit distance while using far fewer vision tokens than most end-to-end baselines (lower is better; "Tokens" are average vision tokens per page). Selected rows are reproduced below from the paper:^[1]^[13]

Model	Tokens (avg.)	Overall (English)	Overall (Chinese)
Nougat	2352	0.452	0.973
SmolDocling	392	0.493	0.816
InternVL2-76B	6790	0.440	0.443
Qwen2.5-VL-7B	3949	0.316	0.399
OLMOCR	3949	0.326	0.469
GOT-OCR2.0	256	0.287	0.411
dots.ocr	3949	0.182	0.261
MinerU-2.0	6790	0.133	0.115
DeepSeek-OCR (Tiny)	64	0.386	0.361
DeepSeek-OCR (Small)	100	0.221	0.284
DeepSeek-OCR (Base)	256 (182 valid)	0.137	0.205
DeepSeek-OCR (Large)	400 (285 valid)	0.138	0.143
DeepSeek-OCR (Gundam)	795	0.127	0.097
DeepSeek-OCR (Gundam-M, 200 dpi)	1853	0.123	0.087

Two headline comparisons are worth pulling out. The Small mode (100 vision tokens) already beats GOT-OCR2.0 (256 tokens), and the Gundam-M mode roughly matches MinerU-2.0 while using less than a third of the tokens.^[1]

olmOCR-bench

The Hugging Face model card also reports olmOCR-bench results, where DeepSeek-OCR scores 75.7% overall, with 80.2% on table tests, 96.1% on header/footer extraction, 77.2% on arXiv math, 79.4% on long tiny text and 66.4% on multi-column layouts.^[3]

Category-specific results (OmniDocBench)

Some document classes need very few tokens (slides, books), whereas dense layouts such as newspapers benefit from the dynamic Gundam modes:^[1]

Mode	Book	Slides	Financial report	Textbook	Exam paper	Magazine	Academic papers	Notes	Newspaper	Overall
Tiny	0.147	0.116	0.207	0.173	0.294	0.201	0.395	0.297	0.940	0.320
Small	0.085	0.111	0.079	0.147	0.171	0.107	0.131	0.187	0.744	0.205
Base	0.037	0.080	0.027	0.100	0.130	0.073	0.052	0.176	0.645	0.156
Large	0.038	0.108	0.022	0.084	0.109	0.060	0.053	0.155	0.353	0.117
Gundam	0.035	0.085	0.289	0.095	0.094	0.059	0.039	0.153	0.122	0.083
Gundam-M	0.052	0.090	0.034	0.091	0.079	0.079	0.048	0.100	0.099	0.077

The newspaper column shows the value of dynamic resolution most clearly: edit distance falls from 0.940 in Tiny mode to 0.099 in Gundam-M, a roughly 10x improvement on the hardest layout class.

Throughput and data generation

The preprint highlights that a single A100-40G can generate 200,000+ pages per day, and a 20-node cluster (8 x A100-40G each) reaches roughly 33 million pages per day for large-scale LLM and VLM pretraining data production.^[1] At those rates the system was designed less as a consumer OCR tool and more as a data factory: feed in raw PDFs, get out clean training corpora.

Features and capabilities

Layout-aware OCR with optional detection and structure output (Markdown, HTML, JSON) controlled via prompts.^[1]
Deep parsing of charts, simple geometric figures and chemical diagrams embedded in documents (chart to HTML table; geometry to structured dict; chemistry to SMILES).^[1]
Multilingual recognition covering roughly 100 languages in the PDF training set.^[1]
General visual understanding since the model retains image captioning, detection and grounding interfaces alongside the OCR head, which keeps it useful for broader VLM research.^[1]
Configurable token budget at inference time: callers pick a mode (Tiny through Gundam-M) to trade speed for fidelity on each request, without retraining.^[2]

Installation and usage

The official instructions provide both vLLM and Transformers inference paths, with SGLang added by the community soon after release. Tested environment notes include Python 3.12.9, CUDA 11.8, PyTorch 2.6.0 and FlashAttention 2.7.3; example prompts cover layout and non-layout OCR, table extraction and figure parsing.^[2]^[3] The repository ships scripts that take an image or a PDF directory and emit Markdown, with optional bounding-box annotations for downstream verification.

Research implications

DeepSeek-OCR empirically studies the mapping from N text tokens to the minimum number of vision tokens needed for decoding, supporting the view that near-lossless 10x "optical" context compression is feasible for many documents. The paper also outlines a memory-decay analogy: older conversational history could be rendered at progressively lower resolutions to simulate a forgetting curve, trading fidelity for token savings while keeping recent context sharp.^[1] A future long-context model might keep recent turns in Large or Gundam mode, page back into Small as turns age, and finally let very old context fall to Tiny, mirroring how human memory blurs older details. The framing has drawn attention from researchers working on long-context transformer memory because it suggests an axis orthogonal to standard attention-window extension: instead of stretching the window, optical compression rewrites the unit of context itself.^[1]

Relation to prior work

DeepSeek-OCR builds on ideas from encoder-decoder OCR (for example GOT-OCR2.0) and high-resolution VLMs (for example Qwen-VL and its NaViT-style packing, plus InternVL tiling). Its decoder leverages the DeepSeekMoE designs introduced in DeepSeek-V2 and DeepSeek-V3.^[14]^[15]^[16]^[4]^[5] The window-attention front end is borrowed conceptually from the Segment Anything Model (SAM), and the global-attention back end from CLIP.^[17]^[18]

The closest predecessor is GOT-OCR2.0, which Haoran Wei co-authored: it already pushed end-to-end document OCR with a 256-token vision encoder, and DeepSeek-OCR extends that idea by training for variable token budgets and pairing the encoder with an MoE decoder rather than a dense one.^[1]^[14]

Successor: DeepSeek-OCR-2

In early 2026 the same authors released DeepSeek-OCR-2, also titled Visual Causal Flow, on arXiv. The follow-up reports an OmniDocBench v1.5 score of 91.09%, an improvement of about 3.73 percentage points over the original system, with most of the gain on reading-order recognition.^[19] The successor uses a similar DeepEncoder split but tightens the causal flow between vision and text tokens during decoding.

References

Wei, Haoran; Sun, Yaofeng; Li, Yukun. "DeepSeek-OCR: Contexts Optical Compression." arXiv preprint arXiv:2510.18234, 21 October 2025. https://arxiv.org/abs/2510.18234
DeepSeek-AI. "DeepSeek-OCR" GitHub repository (MIT license, released 20 October 2025). https://github.com/deepseek-ai/DeepSeek-OCR
DeepSeek-AI. "deepseek-ai/DeepSeek-OCR" Hugging Face model card. https://huggingface.co/deepseek-ai/DeepSeek-OCR
DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, May 2024. https://arxiv.org/abs/2405.04434
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437
PaddlePaddle. "PaddleOCR" toolkit. https://github.com/PaddlePaddle/PaddleOCR
Schuhmann, Christoph et al. "LAION-5B: An open large-scale dataset for training next generation image-text models." NeurIPS 2022. https://laion.ai/blog/laion-5b/
Gu, Jiaxi et al. "Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark." NeurIPS 2022. https://arxiv.org/abs/2202.06767
Wang, Bin et al. "MinerU: An Open-Source Solution for Precise Document Content Extraction." arXiv:2409.18839, September 2024. https://arxiv.org/abs/2409.18839
Chen, Xinhua et al. "OneChart: Purify the Chart Structural Extraction via One Auxiliary Token." arXiv:2404.09987, April 2024. https://arxiv.org/abs/2404.09987
RDKit, Open-source cheminformatics. https://www.rdkit.org/
Liu, Chenglong et al. "Focus Anywhere for Fine-grained Multi-page Document Understanding." arXiv:2405.14295 (Fox benchmark), May 2024. https://arxiv.org/abs/2405.14295
Ouyang, Linke et al. "OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations." CVPR 2025. https://github.com/opendatalab/OmniDocBench
Wei, Haoran et al. "General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model." arXiv:2409.01704 (GOT-OCR2.0), September 2024. https://arxiv.org/abs/2409.01704
Bai, Jinze et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv:2308.12966, August 2023. https://arxiv.org/abs/2308.12966
Chen, Zhe et al. "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks." CVPR 2024. https://arxiv.org/abs/2312.14238
Kirillov, Alexander et al. "Segment Anything." ICCV 2023. https://arxiv.org/abs/2304.02643
Radford, Alec et al. "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021 (CLIP). https://arxiv.org/abs/2103.00020
Wei, Haoran; Sun, Yaofeng; Li, Yukun. "DeepSeek-OCR 2: Visual Causal Flow." arXiv:2601.20552, January 2026. https://arxiv.org/abs/2601.20552

Overview

History and release

Architecture

DeepEncoder (vision side)

Multi-resolution modes

Decoder (language side)

Data and tasks

Training and inference

Evaluation

Compression study (Fox benchmark)

OmniDocBench (end-to-end document parsing)

olmOCR-bench

Category-specific results (OmniDocBench)

Throughput and data generation

Features and capabilities

Installation and usage

Research implications

Relation to prior work

Successor: DeepSeek-OCR-2

See also

References

Improve this article

Related Articles

SmolVLA

Image-to-Text Models

OCR Models

LLaVA (Large Language and Vision Assistant)

InclusionAI

DeepSeek-R1

Overview

History and release

Architecture

DeepEncoder (vision side)

Multi-resolution modes

Decoder (language side)

Data and tasks

Training and inference

Evaluation

Compression study (Fox benchmark)

OmniDocBench (end-to-end document parsing)

olmOCR-bench

Category-specific results (OmniDocBench)

Throughput and data generation

Features and capabilities

Installation and usage

Research implications

Relation to prior work

Successor: DeepSeek-OCR-2

See also

References

Related Articles

SmolVLA

Image-to-Text Models

OCR Models

LLaVA (Large Language and Vision Assistant)

InclusionAI

DeepSeek-R1