DeepSeek-VL2

Chinese AI Mixture of Experts Multimodal AI

21 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v3 · 4,125 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepSeek-VL2 is an open-weights family of Mixture-of-Experts (MoE) vision-language models released by the Chinese AI laboratory DeepSeek on December 13, 2024.^[1]^[2] It comes in three sizes, DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, and DeepSeek-VL2, with 1.0 billion, 2.8 billion, and 4.5 billion activated parameters respectively, drawn from sparse MoE backbones of 3, 16, and 27 billion total parameters.^[1]^[3]^[4] Its two defining features are a dynamic tiling vision encoder that splits high-resolution images into 384 by 384 tiles for OCR and document parsing, and a DeepSeekMoE language backbone with Multi-head Latent Attention (MLA) that compresses the key-value cache for efficient decoding, so the 4.5B-active flagship reaches 93.3 on DocVQA and 834 on OCRBench while activating roughly half the parameters of dense 7B-8B baselines such as Qwen2-VL-7B and InternVL2-8B.^[1]^[5]

The paper frames the release directly: "We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades."^[1] DeepSeek reports that the family "achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models."^[1] Model code is published under the MIT License while the weights are governed by the permissive DeepSeek Model License, which allows commercial use.^[2]^[3]

Item	Value
Developer	DeepSeek (DeepSeek-AI)
Initial release	December 13, 2024
Paper	arXiv:2412.10302
Variants	Tiny (3B total / 1.0B active), Small (16B total / 2.8B active), Full (27B total / 4.5B active)
Vision encoder	SigLIP-SO400M-384 with dynamic tiling
Language backbone	DeepSeekMoE with Multi-head Latent Attention
Adaptor	2x2 pixel shuffle plus two-layer MLP
Context length	4096 tokens
Code license	MIT
Model license	DeepSeek Model License (commercial use permitted)
Repository	github.com/deepseek-ai/DeepSeek-VL2

When was DeepSeek-VL2 released and where did it come from?

DeepSeek was founded in 2023 in Hangzhou and grew out of the quantitative trading firm High-Flyer. By the spring of 2024 the lab had released two multimodal lines of work. DeepSeek-VL (arXiv:2403.05525, March 8, 2024) was a dense vision-language model published in 1.3B and 7B sizes built around a hybrid vision encoder that combined a SigLIP variant with a SAM-derived high-resolution branch, aimed at 1024 by 1024 inputs and "real-world" use cases such as web screenshots, PDFs, OCR, charts, and knowledge questions.^[6] DeepSeek-VL emphasised preserving language ability by mixing text-only data into pretraining from the beginning, and structured its instruction tuning around a taxonomy derived from real user queries.^[6]

In parallel, DeepSeek released the language-only DeepSeek-V2 in May 2024, introducing the DeepSeekMoE architecture and the Multi-head Latent Attention (MLA) mechanism that compresses the key-value cache into a low-rank latent space to cut memory bandwidth during decoding.^[7] DeepSeek-VL2, posted to arXiv on December 13, 2024 as arXiv:2412.10302, replaces the dense DeepSeek-LLM backbone of the original DeepSeek-VL with this MoE plus MLA backbone, while simultaneously upgrading the visual front end from the hybrid SigLIP/SAM design to a single SigLIP-SO400M-384 encoder operating on a dynamic grid of tiles.^[1]^[5]

The paper is led by Zhiyu Wu and Xiaokang Chen with a team of authors from DeepSeek and lists multimodal understanding (visual question answering, OCR, document/table/chart parsing, and visual grounding) as its target capabilities.^[1] The release fits into DeepSeek's broader 2024 pattern of open-weights releases under permissive terms: model code is published under the MIT License while weights are governed by the DeepSeek Model License, which allows commercial use.^[2]^[3]

How does DeepSeek-VL2 work?

DeepSeek-VL2 has three modules stacked in the conventional multimodal VLM pattern: a vision encoder, a vision-language adaptor, and an MoE language model.^[5]^[8]

Vision encoder and dynamic tiling

The visual front end is the SigLIP-SO400M-384 model, a Shape-Optimised ViT variant of SigLIP trained at a fixed 384 by 384 input resolution.^[5]^[8] To handle the high-resolution, variable-aspect-ratio images that arise in OCR, document parsing, and chart understanding, DeepSeek-VL2 wraps SigLIP in a dynamic tiling pipeline. The image is resized to one of a set of candidate resolutions of the form (m by 384, n by 384), where m and n are positive integers and the product m times n is bounded by nine for standard evaluation.^[5]^[8] The candidate that minimises padding is selected, the image is split into m by n non-overlapping local tiles, and a separate global thumbnail tile is also produced. Each tile is independently encoded by SigLIP, yielding 27 by 27 = 729 patch embeddings per tile.^[5]^[8] At the maximum 3 by 3 layout the effective resolution is 1152 by 1152, and the model uses up to ten tiles (nine local plus one global) in standard configurations. For the InfoVQA evaluation, which contains very elongated infographics, the cap is loosened to m times n less than or equal to eighteen.^[5]

Vision-language adaptor

The output of SigLIP is passed through a 2 by 2 pixel shuffle that compresses each tile's 27 by 27 grid of visual tokens to 14 by 14 = 196 tokens, followed by a two-layer multilayer perceptron that projects into the language model embedding space.^[5]^[8] The model then assembles a sequence of visual tokens organised by tile, with special delimiters: <tile_newline> marks the end of each row of patches within a tile, and <view_separator> separates the global thumbnail view from the local tiles.^[5]^[8] For grounding tasks, the special tokens <|ref|>...<|/ref|> and <|det|>[x1,y1,x2,y2]<|/det|> encode referring expressions and bounding boxes respectively, and <|grounding|> marks grounded captioning queries.^[2]^[9]

MoE language backbone and Multi-head Latent Attention

The language model is a DeepSeekMoE network with Multi-head Latent Attention inherited from the DeepSeek-V2/V3 line.^[1]^[5] MLA replaces standard multi-head attention with a design that compresses the key and value vectors into a single low-rank latent vector that is cached during autoregressive decoding, shrinking the KV cache memory footprint and increasing throughput relative to vanilla multi-head self-attention or even grouped-query attention.^[7] In the abstract, DeepSeek describes leveraging "DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput."^[1]

The MoE blocks follow the DeepSeekMoE pattern of routed experts plus a small number of always-on shared experts. The three sizes differ in the number of routed experts and the embedding width, while sharing the top-K = 6 routing pattern and two shared experts:^[5]

Variant	Total params	Activated params	Routed experts	Shared experts	Routing
DeepSeek-VL2-Tiny	3B (DeepSeekMoE-3B base)	1.0B (0.57B LLM)	64	2	Softmax
DeepSeek-VL2-Small	16B (DeepSeekMoE-16B base)	2.8B (2.4B LLM)	64	2	Softmax
DeepSeek-VL2	27B (DeepSeekMoE-27B base)	4.5B (4.1B LLM)	72	2	Sigmoid with expert bias

The Tiny variant disables MLA in some implementations because its base LLM uses a different attention configuration, and downstream inference engines such as SGLang note this as a special case.^[4]^[10] All three variants share the same vision encoder and adaptor design.^[5]

How is DeepSeek-VL2 trained?

DeepSeek-VL2 is trained in three stages, all of which operate on a Mixture of Experts (MoE) backbone.^[5]^[8]

Stage 1: Vision-language alignment

The language model is held fixed while the vision-language MLP adaptor (and, in this work, the vision encoder) is trained on roughly 1.2 million caption and conversation samples drawn from ShareGPT4V.^[5]^[8] This stage establishes a shared embedding space without disturbing the pretrained language weights.

Stage 2: Vision-language pretraining

In the pretraining stage, all parameters, including the vision encoder, adaptor, and MoE language model, are unfrozen and jointly trained on a much larger mixture.^[5]^[8]^[11] The pretraining mix is approximately 70 percent vision-language data and 30 percent text-only data sourced from DeepSeek-V2's corpus, totalling roughly 800 billion image-text tokens for each variant (798.5B for Tiny, 808.9B for Small, 796.5B for the full model).^[5] The vision-language component combines:^[5]

Interleaved image-text data from WIT, WikiHow, a 30 percent subset of OBELICS, and an in-house Chinese-language interleaved corpus called Wanjuan.
Image captioning data from multiple open-source caption datasets refined through an in-house recaptioning pipeline.
OCR data including LaTeX OCR, a 12-million-sample RenderedText set, and in-house OCR collections.
Visual QA data spanning general VQA, table and chart understanding (PubTabNet, FinTabNet, Docmatix), web-to-code translation (Websight), and plot-to-Python rendering.
Visual grounding and grounded conversation data with bounding-box annotations.

Stage 3: Supervised fine-tuning

The final stage is SFT over roughly 19.5-20 billion tokens of mixed multimodal and text-only instruction data.^[5] The multimodal fine-tuning corpus covers general VQA, OCR and document understanding, table/chart QA, mathematical reasoning, textbook question answering, web-to-code, visual grounding, and grounded dialogue, with text-only conversations from the DeepSeek-V2 SFT corpus folded back in to maintain language quality.^[5]^[8]

The training pipeline uses bf16 mixed precision, with the recommended inference temperature for downstream sampling capped at 0.7.^[3]

How does DeepSeek-VL2 perform?

DeepSeek-VL2 was evaluated on a broad suite of multimodal benchmarks, including document and OCR tasks (DocVQA, OCRBench, TextVQA, ChartQA, AI2D), general reasoning (MMBench-V1.1, MME), reasoning over scientific diagrams (MathVista), and broad multimodal stress tests.^[5]^[9]

Benchmark	DeepSeek-VL2-Tiny (1.0B active)	DeepSeek-VL2-Small (2.8B active)	DeepSeek-VL2 (4.5B active)
OCRBench	809	834	811
DocVQA (test)	88.9	92.3	93.3
ChartQA	81.0	84.5	86.0
TextVQA	80.7	83.4	84.2
MMBench-V1.1	68.3	79.3	79.2
MMStar	45.9	57.0	61.3
MathVista	53.6	60.7	62.8
AI2D	71.6	80.0	81.4

Source: aggregated from the arXiv paper and the third-party Moonlight literature review.^[5]^[12]

Several observations follow from the paper's comparison tables. First, the Small variant frequently outperforms or ties the full model on OCR-heavy benchmarks: OCRBench peaks at 834 for the Small model rather than the full 4.5B-active variant, illustrating that the document corpus is small enough that the full model's extra experts mainly help on harder general-reasoning tasks like MMStar and MathVista.^[5]^[12] Second, DeepSeek-VL2 matches or exceeds dense 7B-class peers on document tasks while activating roughly half the parameters: 93.3 on DocVQA against GPT-4o's 92.8 and 84.2 on TextVQA against Qwen2-VL-7B's 84.3, with OCRBench 834 well above GPT-4o's 736.^[9]^[13]

On general multimodal benchmarks the picture is mixed. DeepSeek-VL2 reaches 83.1 on MMBench against 85.0 for Qwen2-VL-7B and 85.0 for InternVL2-8B, and 2,253 on MME against 2,327 for Qwen2-VL-7B, leaving a small gap to the larger dense models on language-heavy reasoning but closing it on every OCR and chart task.^[13]

Third, the parameter-efficiency curve is steep. DeepSeek-VL2-Small at 2.8B active activates roughly one third of the parameters of a 7B-class dense model but lands within two points of the larger model on most document benchmarks (92.3 DocVQA versus the full model's 93.3, and 834 OCRBench, the highest score of the family). DeepSeek-VL2-Tiny at 1.0B active is the smallest member but still scores 88.9 on DocVQA and 80.7 on TextVQA, comparable to InternVL2-2B and Qwen2-VL-2B on document tasks despite running with fewer activated parameters per forward pass.^[5]^[13] The MathVista numbers (53.6, 60.7, 62.8) and the AI2D scores (71.6, 80.0, 81.4) show monotonic gains with scale, suggesting that for general reasoning and diagram tasks the model benefits from the larger expert pool.^[5]

What are the variants and how is it run?

The three published checkpoints are hosted on Hugging Face as deepseek-ai/deepseek-vl2-tiny, deepseek-ai/deepseek-vl2-small, and deepseek-ai/deepseek-vl2.^[3]^[4]^[14] All three use the same processor class (DeepseekVLV2Processor) and the same modelling code (DeepseekVLV2ForCausalLM), with the variant selected by the model path.^[3] The context window is 4096 tokens for all three variants, which is a practical limit for very long documents but is augmented by the dynamic tiling that compresses each tile to 196 visual tokens.^[2]^[3]

The dynamic tiling strategy is applied when at most two images are present in the prompt; for three or more images the system falls back to padding each image to a single 384 by 384 tile to keep token budgets manageable.^[3]^[4] Visual grounding output is produced as structured tokens of the form <|ref|>description<|/ref|><|det|>[x1,y1,x2,y2]<|/det|>, allowing downstream parsers to extract bounding boxes without coordinate-token engineering.^[2]^[9]

Beyond the official Transformers integration, DeepSeek-VL2 is supported by inference engines including vLLM and SGLang, and the repository includes a Gradio demo added on December 25, 2024 and a Hugging Face Space added on February 6, 2025.^[2] The DeepSeek-VL2-Tiny variant is small enough to run on consumer GPUs with quantisation, and community tutorials describe fitting all three variants under standard 24GB and 48GB GPU budgets.^[15]

In sparse-MoE inference, DeepSeek-VL2 benefits from two distinct sources of efficiency: the MLA-compressed KV cache reduces memory bandwidth for autoregressive decoding, while the routed-expert pattern only loads a small fraction of expert weights for any given token. The combination means a 27B-total model can serve roughly the same latency budget as a 4-5B dense model when properly batched.^[7]^[11] The implementation note in the SGLang issue tracker indicates that DeepSeek-VL2-Tiny disables MLA in favour of standard attention because its DeepSeekMoE-3B base predates the V2-style MLA design, which means runtime engines need to switch attention kernels based on the model variant.^[10]

How does DeepSeek-VL2 compare to Qwen2-VL and InternVL2?

DeepSeek-VL2 sits in a competitive segment of late-2024 open-weights vision-language model releases. The most relevant comparisons are with its own dense predecessor and with three other model families: Qwen2-VL (released by Alibaba in September 2024), InternVL2 (Shanghai AI Lab, July 2024), and LLaVA-OneVision (released August 2024 from the LLaVA line).^[13]^[16]

Model	Released	Active / total params	Vision encoder	Notable design
DeepSeek-VL (7B)	March 2024	7B dense	SigLIP-L plus SAM-B hybrid	Real-world taxonomy, joint LLM pretraining
DeepSeek-VL2 (Full)	December 2024	4.5B active / 27B total	SigLIP-SO400M-384 plus dynamic tiles	MoE backbone, MLA, dynamic tiling
Qwen2-VL-7B	September 2024	7B dense	ViT-bigG with native dynamic resolution	Native dynamic-resolution ViT, M-RoPE
InternVL2-8B	July 2024	8B dense	InternViT-300M	Pixel-unshuffle high-resolution input
LLaVA-OneVision-7B	August 2024	7B dense	SigLIP-SO400M-384	Single-image, multi-image, and video unification

DeepSeek-VL2 differs from these peers along two architectural axes. First, the language backbone is sparse: DeepSeek-VL2-Small at 2.8B active competes with 7B and 8B dense models in roughly the same activated-parameter budget as Qwen2-VL-2B, while the full DeepSeek-VL2 at 4.5B active is positioned below 7B dense competitors on parameters but above them on document benchmarks.^[5]^[13] Second, where Qwen2-VL embeds native dynamic resolution inside the ViT itself and InternVL2 uses pixel-unshuffle to ingest higher resolutions through a fixed encoder, DeepSeek-VL2 keeps the SigLIP encoder at its native 384 by 384 resolution and instead handles arbitrary aspect ratios at the tiling level. This trades extra forward passes (one per tile) for the ability to reuse a strong off-the-shelf encoder without modification, an approach also used by LLaVA-OneVision's "Higher AnyRes" scheme on the same SigLIP backbone.^[5]^[16]

Against its own predecessor DeepSeek-VL, the gap is substantial. The dense 7B DeepSeek-VL pre-dates the dynamic-tiling and MoE eras and operates on a fixed 1024 by 1024 hybrid encoder; DeepSeek-VL2 reports large gains on document and OCR benchmarks at fewer activated parameters and adds new capabilities such as visual grounding tokens, multi-image conversations, and explicit grounded captioning.^[5]^[6]

DeepSeek itself released a contemporaneous "unified understanding and generation" multimodal line called Janus in October 2024 and Janus-Pro in January 2025; these are decoder-only models that produce both text and images, and are largely orthogonal to DeepSeek-VL2's strictly understanding-focused design.^[17]

What is DeepSeek-VL2 used for?

DeepSeek-VL2's training mix and benchmark profile point to a small set of applications that the model is explicitly tuned for. Document understanding and OCR are the strongest cluster: financial filings, scientific papers, and tabular data benefit from the dynamic tiling, with DocVQA, OCRBench, ChartQA, and TextVQA all serving as proxy benchmarks for these tasks.^[5]^[11] Visual grounding through the <|ref|> and <|det|> token vocabulary supports downstream pipelines that need bounding boxes for objects referenced in natural language, including grounded captioning and weakly-supervised detection workflows.^[2]^[9]

Other reported uses include visual question answering, table-to-text and chart-to-text conversion, plot-to-Python translation, web-page-to-HTML reconstruction (web-to-code), and instruction-following in mixed text and image contexts.^[5]^[11] Because both the model code (MIT) and the weights (DeepSeek Model License) permit commercial use, the family has been picked up by inference providers and downstream toolchains.^[2]^[3] DeepSeek-VL2 was also among the open-weights VLMs evaluated by document-AI surveys and OCR benchmarks during 2025, often appearing in head-to-head comparisons with Qwen2.5-VL and InternVL-class models.^[18]

A separate strand of applications exploits the dynamic-tiling capability for non-standard aspect ratios. Tall scrolling screenshots and wide panoramic figures, which truncate or pad heavily on fixed-resolution VLMs, fit naturally into DeepSeek-VL2's m by n tile grid. The InfoVQA evaluation in the paper, which uses a relaxed cap of m times n less than or equal to eighteen tiles, demonstrates that the system can run on inputs with extreme aspect ratios when the document content justifies the additional compute.^[5] Inference providers including SiliconFlow expose the full DeepSeek-VL2 model behind an OpenAI-compatible API, while DeepWiki and Roboflow have published practical guides covering grounded captioning, table extraction, and OCR pipelines.^[9]^[18]

What are DeepSeek-VL2's limitations?

The paper, model cards, and downstream documentation note several explicit limitations.^[3]^[5]

Context length. All three variants are limited to a 4096-token context window, which constrains very long documents or many-image sessions despite the visual-token compression.^[3]
Multi-image fallback. Dynamic tiling is only applied when at most two images are present in the input. For three or more images, each image is downsampled to a single 384 by 384 tile, which limits OCR fidelity in long multi-image conversations.^[3]^[4]
MoE training instability. The Tiny and Small variants use softmax routing; the full 27B-total model switches to sigmoid routing with an auxiliary expert bias term, reflecting the load-balancing challenges that DeepSeekMoE addresses in the V2/V3 line.^[5]
Coverage gaps versus dense peers. On purely language-heavy multimodal reasoning benchmarks (MMBench-V1.1, MME) the full model trails dense 7B-8B systems by a small margin, with the strongest comparative advantage concentrated on OCR, chart, and document tasks.^[5]^[13]
Hallucinations on long documents. Independent reviews note that, while OCR quality is strong, the model can still hallucinate on long-form document QA, especially when the relevant content sits near the edge of the candidate-resolution grid.^[11]
License terms. The model weights are released under the DeepSeek Model License rather than a fully open license, which permits commercial use but imposes additional use-case restrictions distinct from the MIT-licensed code.^[2]

Why does DeepSeek-VL2 matter?

DeepSeek-VL2's importance in the open-weights multimodal landscape comes from three contributions identified in the paper and corroborated by downstream coverage.^[1]^[5]^[11] First, it is one of the earliest large-scale Mixture-of-Experts vision-language models with publicly released weights, alongside Aria-MoE and MM1.5, demonstrating that MoE pretraining transfers cleanly to a multimodal target. Second, the combination of a fixed SigLIP encoder with a dynamic tiling pipeline yields strong scaling to high-resolution OCR and chart tasks without requiring a custom vision transformer, an approach later echoed by LLaVA-OneVision and several smaller VLMs. Third, by carrying Multi-head Latent Attention from DeepSeek-V2 into a multimodal setting, DeepSeek-VL2 provides a working example of MLA in production with both pure-text and image-text decoding paths active, foreshadowing the MLA-driven KV-cache efficiency of DeepSeek-V3 and subsequent DeepSeek releases.^[7]^[19]

In aggregate, DeepSeek-VL2 helped to consolidate the late-2024 open-weights multimodal stack around a recognisable recipe: SigLIP-SO400M-384 visual front end, dynamic tiling for resolution, MLA for efficient decoding, mixture-of-experts language backbone, and a three-stage alignment-pretraining-fine-tuning training pipeline.^[5]^[11] Its release sat within DeepSeek's broader push that month, immediately preceding the December 26, 2024 release of DeepSeek-V3 and the January 2025 release of DeepSeek-R1, which together established DeepSeek as a high-profile open-weights research lab in the run-up to 2025.^[19]

How was DeepSeek-VL2 received?

Public reception of DeepSeek-VL2 mixed strong attention to the OCR and document benchmark numbers with caveats around general multimodal reasoning. The Hugging Face paper page and the GitHub repository accumulated heavy engagement in the months after release, with the official repository tracked at over five thousand stars and roughly eighteen hundred forks by early 2026.^[2] Coverage in third-party blogs frequently flagged the OCRBench 834 score as the headline number; the Startup Fortune writeup highlighted that DeepSeek-VL2 was the strongest open-weights VLM on OCRBench at the time of release and pointed to the gap to GPT-4o (834 versus 736) as evidence that open MoE models were closing the multimodal gap with proprietary systems.^[9]

Technical reviews focused on the dynamic tiling pipeline and the MLA-plus-MoE backbone. The Moonlight literature review noted that the model "achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models", but also observed that the gains over Qwen2-VL-7B and InternVL2-8B on general MMBench-style benchmarks were modest.^[12] The Zilliz blog and a paper-summary post by Aakash Kumar Nain documented the architecture in detail, including the role of the global thumbnail and the special-token grammar for grounding.^[8]^[11] DeepWiki maintains a structured project wiki for the repository, which is widely cited by downstream developers building OCR and document-AI applications on DeepSeek-VL2.^[18]

Within DeepSeek's own roadmap, the VL2 release marked the visible inflection point at which the lab adopted a uniform MLA-plus-MoE recipe across its product lines. DeepSeek-V3, released two weeks later on December 26, 2024, used the same MLA mechanism and a much larger expert pool to scale into the open-weights frontier for text generation.^[19] DeepSeek-OCR (a smaller, OCR-focused model line) and the Janus-Pro understanding-and-generation system followed in early 2025, reusing many of the data-processing pipelines documented in the VL2 paper.^[17]^[18]

ELI5: What is DeepSeek-VL2 in simple terms?

Imagine an AI that can both read and look at pictures at the same time. DeepSeek-VL2 is that kind of AI: you can show it a photo of a receipt, a graph, or a page of a PDF and ask a question about it, and it answers in words. It is good at reading text inside images (OCR), understanding charts and tables, and even pointing to where something is in a picture by drawing a box around it.

The clever trick is that it does not look at a big picture all at once. It cuts the picture into smaller square tiles, looks at each tile closely, and also keeps one small "thumbnail" of the whole thing so it does not lose the big picture. That is the "dynamic tiling" part. It is also a "mixture of experts" model, which means it has many small specialist sub-networks but only switches on a few of them for each word, so it stays fast even though the whole model is large. It comes in three sizes (small, medium, and large) and anyone can download and use it.

References

Wu, Zhiyu et al., "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding", arXiv, 2024-12-13. https://arxiv.org/abs/2412.10302. Accessed 2026-06-25. ↩
DeepSeek-AI, "DeepSeek-VL2 GitHub repository", GitHub, 2024-12-13. https://github.com/deepseek-ai/DeepSeek-VL2. Accessed 2026-06-25. ↩
DeepSeek-AI, "deepseek-ai/deepseek-vl2 model card", Hugging Face, 2024-12-13. https://huggingface.co/deepseek-ai/deepseek-vl2. Accessed 2026-06-25. ↩
DeepSeek-AI, "deepseek-ai/deepseek-vl2-tiny model card", Hugging Face, 2024-12-13. https://huggingface.co/deepseek-ai/deepseek-vl2-tiny. Accessed 2026-06-25. ↩
Wu, Zhiyu et al., "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding (HTML)", arXiv, 2024-12-13. https://arxiv.org/html/2412.10302v1. Accessed 2026-06-25. ↩
Lu, Haoyu et al., "DeepSeek-VL: Towards Real-World Vision-Language Understanding", arXiv, 2024-03-08. https://arxiv.org/abs/2403.05525. Accessed 2026-06-25. ↩
DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model", arXiv, 2024-05-07. https://arxiv.org/abs/2405.04434. Accessed 2026-06-25. ↩
Nain, Aakash Kumar, "Paper summary: DeepSeek-VL2", aakashkumarnain.github.io, 2025-01-05. https://aakashkumarnain.github.io/posts/paper_summaries/deepseek_vl2.html. Accessed 2026-06-25. ↩
Startup Fortune, "DeepSeek VL2 crushes OCRBench with 834 score, setting new open-source multimodal standard", Startup Fortune, 2025-01-15. https://startupfortune.com/deepseek-vl2-crushes-ocrbench-with-834-score-setting-new-open-source-multimodal-standard/. Accessed 2026-06-25. ↩
SGLang project, "Support Deepseek-vl2-tiny model, in which mla is disabled (Issue #5537)", GitHub, 2025-04-23. https://github.com/sgl-project/sglang/issues/5537. Accessed 2026-06-25. ↩
Zilliz Learn, "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding", Zilliz blog, 2025-01-09. https://zilliz.com/blog/deepseek-vl2-mixture-of-experts-vision-language-models-for-advanced-multimodal-understanding. Accessed 2026-06-25. ↩
Moonlight Literature Review, "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding", themoonlight.io, 2025-01-08. https://www.themoonlight.io/en/review/deepseek-vl2-mixture-of-experts-vision-language-models-for-advanced-multimodal-understanding. Accessed 2026-06-25. ↩
LLM-Stats, "DeepSeek-VL2 Benchmarks, Pricing & Context Window", llm-stats.com, 2025-03-15. https://llm-stats.com/models/deepseek-vl2. Accessed 2026-06-25. ↩
DeepSeek-AI, "deepseek-ai/deepseek-vl2-small model card", Hugging Face, 2024-12-13. https://huggingface.co/deepseek-ai/deepseek-vl2-small. Accessed 2026-06-25. ↩
NodeShift, "Mastering DeepSeek: Installing Tiny, Small, and VL2 Models with Inference and a Gradio Interface", DEV Community, 2025-02-04. https://dev.to/nodeshiftcloud/mastering-deepseek-installing-tiny-small-and-vl2-models-with-inference-and-a-gradio-interface-4d2m. Accessed 2026-06-25. ↩
Wang, Peng et al., "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution", arXiv, 2024-09-18. https://arxiv.org/abs/2409.12191. Accessed 2026-06-25. ↩
DeepSeek-AI, "Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling", GitHub, 2025-01-27. https://github.com/deepseek-ai/Janus. Accessed 2026-06-25. ↩
Roboflow, "DeepSeek Vision Models: Janus, VL2, and OCR", Roboflow blog, 2025-02-12. https://blog.roboflow.com/deepseek-vision-models/. Accessed 2026-06-25. ↩
DeepSeek-AI, "DeepSeek-V3 Technical Report", arXiv, 2024-12-26. https://arxiv.org/abs/2412.19437. Accessed 2026-06-25. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

DeepSeek-Coder DeepSeek-VL MMStar MiniCPM-V Qwen2-VL Skywork-R1V

When was DeepSeek-VL2 released and where did it come from?

How does DeepSeek-VL2 work?

Vision encoder and dynamic tiling

Vision-language adaptor

MoE language backbone and Multi-head Latent Attention

How is DeepSeek-VL2 trained?

Stage 1: Vision-language alignment

Stage 2: Vision-language pretraining

Stage 3: Supervised fine-tuning

How does DeepSeek-VL2 perform?

What are the variants and how is it run?

How does DeepSeek-VL2 compare to Qwen2-VL and InternVL2?

What is DeepSeek-VL2 used for?

What are DeepSeek-VL2's limitations?

Why does DeepSeek-VL2 matter?

How was DeepSeek-VL2 received?

ELI5: What is DeepSeek-VL2 in simple terms?

See also

References

Improve this article

Related Articles

Llama 4 Scout and Maverick

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here

Related Articles

Llama 4 Scout and Maverick

DeepSeek V4

Kimi K2

DeepSeek V3

Hunyuan

GLM-4.5

What links here