DeepSeek-VL2
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,809 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,809 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek-VL2 is an open-weights family of Mixture-of-Experts (MoE) vision-language models released by the Chinese AI laboratory DeepSeek in December 2024.[^1][^2] The series comprises three sizes, DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with roughly 1.0 billion, 2.8 billion, and 4.5 billion activated parameters respectively, drawn from sparse MoE backbones of 3, 16, and 27 billion total parameters.[^1][^3][^4] Compared with its dense predecessor DeepSeek-VL (March 2024), DeepSeek-VL2 swaps in the DeepSeekMoE language backbone with Multi-head Latent Attention (MLA) for compressed key-value cache, and introduces a dynamic tiling vision encoder that handles arbitrary aspect ratios up to a 1152 by 1152 effective field (nine 384 by 384 tiles plus a global thumbnail).[^1][^5] On document and OCR benchmarks, the 4.5B-active flagship reaches 93.3 on DocVQA, 86.0 on ChartQA, 81.4 on AI2D, and 834 on OCRBench while activating roughly half the parameters of dense 7B-8B baselines such as Qwen2-VL-7B and InternVL2-8B.[^1][^5]
| Item | Value |
|---|---|
| Developer | DeepSeek (DeepSeek-AI) |
| Initial release | December 13, 2024 |
| Paper | arXiv:2412.10302 |
| Variants | Tiny (3B total / 1.0B active), Small (16B total / 2.8B active), Full (27B total / 4.5B active) |
| Vision encoder | SigLIP-SO400M-384 with dynamic tiling |
| Language backbone | DeepSeekMoE with Multi-head Latent Attention |
| Adaptor | 2x2 pixel shuffle plus two-layer MLP |
| Context length | 4096 tokens |
| Code license | MIT |
| Model license | DeepSeek Model License (commercial use permitted) |
| Repository | github.com/deepseek-ai/DeepSeek-VL2 |
DeepSeek was founded in 2023 in Hangzhou and grew out of the quantitative trading firm High-Flyer. By the spring of 2024 the lab had released two multimodal lines of work. DeepSeek-VL (arXiv:2403.05525, March 8, 2024) was a dense vision-language model published in 1.3B and 7B sizes built around a hybrid vision encoder that combined a SigLIP variant with a SAM-derived high-resolution branch, aimed at 1024 by 1024 inputs and "real-world" use cases such as web screenshots, PDFs, OCR, charts, and knowledge questions.[^6] DeepSeek-VL emphasised preserving language ability by mixing text-only data into pretraining from the beginning, and structured its instruction tuning around a taxonomy derived from real user queries.[^6]
In parallel, DeepSeek released the language-only DeepSeek-V2 in May 2024, introducing the DeepSeekMoE architecture and the Multi-head Latent Attention (MLA) mechanism that compresses the key-value cache into a low-rank latent space to cut memory bandwidth during decoding.[^7] DeepSeek-VL2, posted to arXiv on December 13, 2024 as arXiv:2412.10302, replaces the dense DeepSeek-LLM backbone of the original DeepSeek-VL with this MoE plus MLA backbone, while simultaneously upgrading the visual front end from the hybrid SigLIP/SAM design to a single SigLIP-SO400M-384 encoder operating on a dynamic grid of tiles.[^1][^5]
The paper is led by Zhiyu Wu and Xiaokang Chen with a team of authors from DeepSeek and lists multimodal understanding (visual question answering, OCR, document/table/chart parsing, and visual grounding) as its target capabilities.[^1] The release fits into DeepSeek's broader 2024 pattern of open-weights releases under permissive terms: model code is published under the MIT License while weights are governed by the DeepSeek Model License, which allows commercial use.[^2][^3]
DeepSeek-VL2 has three modules stacked in the conventional VLM pattern: a vision encoder, a vision-language adaptor, and an MoE language model.[^5][^8]
The visual front end is the SigLIP-SO400M-384 model, a Shape-Optimised ViT variant of SigLIP trained at a fixed 384 by 384 input resolution.[^5][^8] To handle the high-resolution, variable-aspect-ratio images that arise in OCR, document parsing, and chart understanding, DeepSeek-VL2 wraps SigLIP in a dynamic tiling pipeline. The image is resized to one of a set of candidate resolutions of the form (m by 384, n by 384), where m and n are positive integers and the product m times n is bounded by nine for standard evaluation.[^5][^8] The candidate that minimises padding is selected, the image is split into m by n non-overlapping local tiles, and a separate global thumbnail tile is also produced. Each tile is independently encoded by SigLIP, yielding 27 by 27 = 729 patch embeddings per tile.[^5][^8] At the maximum 3 by 3 layout the effective resolution is 1152 by 1152, and the model uses up to ten tiles (nine local plus one global) in standard configurations. For the InfoVQA evaluation, which contains very elongated infographics, the cap is loosened to m times n less than or equal to eighteen.[^5]
The output of SigLIP is passed through a 2 by 2 pixel shuffle that compresses each tile's 27 by 27 grid of visual tokens to 14 by 14 = 196 tokens, followed by a two-layer multilayer perceptron that projects into the language model embedding space.[^5][^8] The model then assembles a sequence of visual tokens organised by tile, with special delimiters: <tile_newline> marks the end of each row of patches within a tile, and <view_separator> separates the global thumbnail view from the local tiles.[^5][^8] For grounding tasks, the special tokens <|ref|>...<|/ref|> and <|det|>[x1,y1,x2,y2]<|/det|> encode referring expressions and bounding boxes respectively, and <|grounding|> marks grounded captioning queries.[^2][^9]
The language model is a DeepSeekMoE network with Multi-head Latent Attention inherited from the DeepSeek-V2/V3 line.[^1][^5] MLA replaces standard multi-head attention with a design that compresses the key and value vectors into a single low-rank latent vector that is cached during autoregressive decoding, shrinking the KV cache memory footprint and increasing throughput relative to vanilla multi-head self-attention or even grouped-query attention.[^7]
The MoE blocks follow the DeepSeekMoE pattern of routed experts plus a small number of always-on shared experts. The three sizes differ in the number of routed experts and the embedding width, while sharing the top-K = 6 routing pattern and two shared experts:[^5]
| Variant | Total params | Activated params | Routed experts | Shared experts | Routing |
|---|---|---|---|---|---|
| DeepSeek-VL2-Tiny | 3B (DeepSeekMoE-3B base) | 1.0B (0.57B LLM) | 64 | 2 | Softmax |
| DeepSeek-VL2-Small | 16B (DeepSeekMoE-16B base) | 2.8B (2.4B LLM) | 64 | 2 | Softmax |
| DeepSeek-VL2 | 27B (DeepSeekMoE-27B base) | 4.5B (4.1B LLM) | 72 | 2 | Sigmoid with expert bias |
The Tiny variant disables MLA in some implementations because its base LLM uses a different attention configuration, and downstream inference engines such as SGLang note this as a special case.[^4][^10] All three variants share the same vision encoder and adaptor design.[^5]
DeepSeek-VL2 is trained in three stages, all of which operate on a Mixture of Experts (MoE) backbone.[^5][^8]
The language model is held fixed while the vision-language MLP adaptor (and, in this work, the vision encoder) is trained on roughly 1.2 million caption and conversation samples drawn from ShareGPT4V.[^5][^8] This stage establishes a shared embedding space without disturbing the pretrained language weights.
In the pretraining stage, all parameters, including the vision encoder, adaptor, and MoE language model, are unfrozen and jointly trained on a much larger mixture.[^5][^8][^11] The pretraining mix is approximately 70 percent vision-language data and 30 percent text-only data sourced from DeepSeek-V2's corpus, totalling roughly 800 billion image-text tokens for each variant (798.5B for Tiny, 808.9B for Small, 796.5B for the full model).[^5] The vision-language component combines:[^5]
The final stage is SFT over roughly 19.5-20 billion tokens of mixed multimodal and text-only instruction data.[^5] The multimodal fine-tuning corpus covers general VQA, OCR and document understanding, table/chart QA, mathematical reasoning, textbook question answering, web-to-code, visual grounding, and grounded dialogue, with text-only conversations from the DeepSeek-V2 SFT corpus folded back in to maintain language quality.[^5][^8]
The training pipeline uses bf16 mixed precision, with the recommended inference temperature for downstream sampling capped at 0.7.[^3]
DeepSeek-VL2 was evaluated on a broad suite of multimodal benchmarks, including document and OCR tasks (DocVQA, OCRBench, TextVQA, ChartQA, AI2D), general reasoning (MMBench-V1.1, MME), reasoning over scientific diagrams (MathVista), and broad multimodal stress tests.[^5][^9]
| Benchmark | DeepSeek-VL2-Tiny (1.0B active) | DeepSeek-VL2-Small (2.8B active) | DeepSeek-VL2 (4.5B active) |
|---|---|---|---|
| OCRBench | 809 | 834 | 811 |
| DocVQA (test) | 88.9 | 92.3 | 93.3 |
| ChartQA | 81.0 | 84.5 | 86.0 |
| TextVQA | 80.7 | 83.4 | 84.2 |
| MMBench-V1.1 | 68.3 | 79.3 | 79.2 |
| MMStar | 45.9 | 57.0 | 61.3 |
| MathVista | 53.6 | 60.7 | 62.8 |
| AI2D | 71.6 | 80.0 | 81.4 |
Source: aggregated from the arXiv paper and the third-party Moonlight literature review.[^5][^12]
Several observations follow from the paper's comparison tables. First, the Small variant frequently outperforms or ties the full model on OCR-heavy benchmarks: OCRBench peaks at 834 for the Small model rather than the full 4.5B-active variant, illustrating that the document corpus is small enough that the full model's extra experts mainly help on harder general-reasoning tasks like MMStar and MathVista.[^5][^12] Second, DeepSeek-VL2 matches or exceeds dense 7B-class peers on document tasks while activating roughly half the parameters: 93.3 on DocVQA against GPT-4o's 92.8 and 84.2 on TextVQA against Qwen2-VL-7B's 84.3, with OCRBench 834 well above GPT-4o's 736.[^9][^13]
On general multimodal benchmarks the picture is mixed. DeepSeek-VL2 reaches 83.1 on MMBench against 85.0 for Qwen2-VL-7B and 85.0 for InternVL2-8B, and 2,253 on MME against 2,327 for Qwen2-VL-7B, leaving a small gap to the larger dense models on language-heavy reasoning but closing it on every OCR and chart task.[^13]
Third, the parameter-efficiency curve is steep. DeepSeek-VL2-Small at 2.8B active activates roughly one third of the parameters of a 7B-class dense model but lands within two points of the larger model on most document benchmarks (92.3 DocVQA versus the full model's 93.3, and 834 OCRBench, the highest score of the family). DeepSeek-VL2-Tiny at 1.0B active is the smallest member but still scores 88.9 on DocVQA and 80.7 on TextVQA, comparable to InternVL2-2B and Qwen2-VL-2B on document tasks despite running with fewer activated parameters per forward pass.[^5][^13] The MathVista numbers (53.6, 60.7, 62.8) and the AI2D scores (71.6, 80.0, 81.4) show monotonic gains with scale, suggesting that for general reasoning and diagram tasks the model benefits from the larger expert pool.[^5]
The three published checkpoints are hosted on Hugging Face as deepseek-ai/deepseek-vl2-tiny, deepseek-ai/deepseek-vl2-small, and deepseek-ai/deepseek-vl2.[^3][^4][^14] All three use the same processor class (DeepseekVLV2Processor) and the same modelling code (DeepseekVLV2ForCausalLM), with the variant selected by the model path.[^3] The context window is 4096 tokens for all three variants, which is a practical limit for very long documents but is augmented by the dynamic tiling that compresses each tile to 196 visual tokens.[^2][^3]
The dynamic tiling strategy is applied when at most two images are present in the prompt; for three or more images the system falls back to padding each image to a single 384 by 384 tile to keep token budgets manageable.[^3][^4] Visual grounding output is produced as structured tokens of the form <|ref|>description<|/ref|><|det|>[x1,y1,x2,y2]<|/det|>, allowing downstream parsers to extract bounding boxes without coordinate-token engineering.[^2][^9]
Beyond the official Transformers integration, DeepSeek-VL2 is supported by inference engines including vLLM and SGLang, and the repository includes a Gradio demo added on December 25, 2024 and a Hugging Face Space added on February 6, 2025.[^2] The DeepSeek-VL2-Tiny variant is small enough to run on consumer GPUs with quantisation, and community tutorials describe fitting all three variants under standard 24GB and 48GB GPU budgets.[^15]
In sparse-MoE inference, DeepSeek-VL2 benefits from two distinct sources of efficiency: the MLA-compressed KV cache reduces memory bandwidth for autoregressive decoding, while the routed-expert pattern only loads a small fraction of expert weights for any given token. The combination means a 27B-total model can serve roughly the same latency budget as a 4-5B dense model when properly batched.[^7][^11] The implementation note in the SGLang issue tracker indicates that DeepSeek-VL2-Tiny disables MLA in favour of standard attention because its DeepSeekMoE-3B base predates the V2-style MLA design, which means runtime engines need to switch attention kernels based on the model variant.[^10]
DeepSeek-VL2 sits in a competitive segment of late-2024 open-weights vision-language model releases. The most relevant comparisons are with its own dense predecessor and with three other model families: Qwen2-VL (released by Alibaba in September 2024), InternVL2 (Shanghai AI Lab, July 2024), and LLaVA-OneVision (released August 2024 from the LLaVA line).[^13][^16]
| Model | Released | Active / total params | Vision encoder | Notable design |
|---|---|---|---|---|
| DeepSeek-VL (7B) | March 2024 | 7B dense | SigLIP-L plus SAM-B hybrid | Real-world taxonomy, joint LLM pretraining |
| DeepSeek-VL2 (Full) | December 2024 | 4.5B active / 27B total | SigLIP-SO400M-384 plus dynamic tiles | MoE backbone, MLA, dynamic tiling |
| Qwen2-VL-7B | September 2024 | 7B dense | ViT-bigG with native dynamic resolution | Native dynamic-resolution ViT, M-RoPE |
| InternVL2-8B | July 2024 | 8B dense | InternViT-300M | Pixel-unshuffle high-resolution input |
| LLaVA-OneVision-7B | August 2024 | 7B dense | SigLIP-SO400M-384 | Single-image, multi-image, and video unification |
DeepSeek-VL2 differs from these peers along two architectural axes. First, the language backbone is sparse: DeepSeek-VL2-Small at 2.8B active competes with 7B and 8B dense models in roughly the same activated-parameter budget as Qwen2-VL-2B, while the full DeepSeek-VL2 at 4.5B active is positioned below 7B dense competitors on parameters but above them on document benchmarks.[^5][^13] Second, where Qwen2-VL embeds native dynamic resolution inside the ViT itself and InternVL2 uses pixel-unshuffle to ingest higher resolutions through a fixed encoder, DeepSeek-VL2 keeps the SigLIP encoder at its native 384 by 384 resolution and instead handles arbitrary aspect ratios at the tiling level. This trades extra forward passes (one per tile) for the ability to reuse a strong off-the-shelf encoder without modification, an approach also used by LLaVA-OneVision's "Higher AnyRes" scheme on the same SigLIP backbone.[^5][^16]
Against its own predecessor DeepSeek-VL, the gap is substantial. The dense 7B DeepSeek-VL pre-dates the dynamic-tiling and MoE eras and operates on a fixed 1024 by 1024 hybrid encoder; DeepSeek-VL2 reports large gains on document and OCR benchmarks at fewer activated parameters and adds new capabilities such as visual grounding tokens, multi-image conversations, and explicit grounded captioning.[^5][^6]
DeepSeek itself released a contemporaneous "unified understanding and generation" multimodal line called Janus in October 2024 and Janus-Pro in January 2025; these are decoder-only models that produce both text and images, and are largely orthogonal to DeepSeek-VL2's strictly understanding-focused design.[^17]
DeepSeek-VL2's training mix and benchmark profile point to a small set of applications that the model is explicitly tuned for. Document understanding and OCR are the strongest cluster: financial filings, scientific papers, and tabular data benefit from the dynamic tiling, with DocVQA, OCRBench, ChartQA, and TextVQA all serving as proxy benchmarks for these tasks.[^5][^11] Visual grounding through the <|ref|> and <|det|> token vocabulary supports downstream pipelines that need bounding boxes for objects referenced in natural language, including grounded captioning and weakly-supervised detection workflows.[^2][^9]
Other reported uses include visual question answering, table-to-text and chart-to-text conversion, plot-to-Python translation, web-page-to-HTML reconstruction (web-to-code), and instruction-following in mixed text and image contexts.[^5][^11] Because both the model code (MIT) and the weights (DeepSeek Model License) permit commercial use, the family has been picked up by inference providers and downstream toolchains.[^2][^3] DeepSeek-VL2 was also among the open-weights VLMs evaluated by document-AI surveys and OCR benchmarks during 2025, often appearing in head-to-head comparisons with Qwen2.5-VL and InternVL-class models.[^18]
A separate strand of applications exploits the dynamic-tiling capability for non-standard aspect ratios. Tall scrolling screenshots and wide panoramic figures, which truncate or pad heavily on fixed-resolution VLMs, fit naturally into DeepSeek-VL2's m by n tile grid. The InfoVQA evaluation in the paper, which uses a relaxed cap of m times n less than or equal to eighteen tiles, demonstrates that the system can run on inputs with extreme aspect ratios when the document content justifies the additional compute.[^5] Inference providers including SiliconFlow expose the full DeepSeek-VL2 model behind an OpenAI-compatible API, while DeepWiki and Roboflow have published practical guides covering grounded captioning, table extraction, and OCR pipelines.[^9][^18]
The paper, model cards, and downstream documentation note several explicit limitations.[^3][^5]
DeepSeek-VL2's importance in the open-weights multimodal landscape comes from three contributions identified in the paper and corroborated by downstream coverage.[^1][^5][^11] First, it is one of the earliest large-scale Mixture-of-Experts vision-language models with publicly released weights, alongside Aria-MoE and MM1.5, demonstrating that MoE pretraining transfers cleanly to a multimodal target. Second, the combination of a fixed SigLIP encoder with a dynamic tiling pipeline yields strong scaling to high-resolution OCR and chart tasks without requiring a custom vision transformer, an approach later echoed by LLaVA-OneVision and several smaller VLMs. Third, by carrying Multi-head Latent Attention from DeepSeek-V2 into a multimodal setting, DeepSeek-VL2 provides a working example of MLA in production with both pure-text and image-text decoding paths active, foreshadowing the MLA-driven KV-cache efficiency of DeepSeek-V3 and subsequent DeepSeek releases.[^7][^19]
In aggregate, DeepSeek-VL2 helped to consolidate the late-2024 open-weights multimodal stack around a recognisable recipe: SigLIP-SO400M-384 visual front end, dynamic tiling for resolution, MLA for efficient decoding, mixture-of-experts language backbone, and a three-stage alignment-pretraining-fine-tuning training pipeline.[^5][^11] Its release sat within DeepSeek's broader push that month, immediately preceding the December 26, 2024 release of DeepSeek-V3 and the January 2025 release of DeepSeek-R1, which together established DeepSeek as a high-profile open-weights research lab in the run-up to 2025.[^19]
Public reception of DeepSeek-VL2 mixed strong attention to the OCR and document benchmark numbers with caveats around general multimodal reasoning. The Hugging Face paper page and the GitHub repository accumulated heavy engagement in the months after release, with the official repository tracked at over five thousand stars and roughly eighteen hundred forks by early 2026.[^2] Coverage in third-party blogs frequently flagged the OCRBench 834 score as the headline number; the Startup Fortune writeup highlighted that DeepSeek-VL2 was the strongest open-weights VLM on OCRBench at the time of release and pointed to the gap to GPT-4o (834 versus 736) as evidence that open MoE models were closing the multimodal gap with proprietary systems.[^9]
Technical reviews focused on the dynamic tiling pipeline and the MLA-plus-MoE backbone. The Moonlight literature review noted that the model "achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models", but also observed that the gains over Qwen2-VL-7B and InternVL2-8B on general MMBench-style benchmarks were modest.[^12] The Zilliz blog and a paper-summary post by Aakash Kumar Nain documented the architecture in detail, including the role of the global thumbnail and the special-token grammar for grounding.[^8][^11] DeepWiki maintains a structured project wiki for the repository, which is widely cited by downstream developers building OCR and document-AI applications on DeepSeek-VL2.[^18]
Within DeepSeek's own roadmap, the VL2 release marked the visible inflection point at which the lab adopted a uniform MLA-plus-MoE recipe across its product lines. DeepSeek-V3, released two weeks later on December 26, 2024, used the same MLA mechanism and a much larger expert pool to scale into the open-weights frontier for text generation.[^19] DeepSeek-OCR (a smaller, OCR-focused model line) and the Janus-Pro understanding-and-generation system followed in early 2025, reusing many of the data-processing pipelines documented in the VL2 paper.[^17][^18]