DeepSeek-VL
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,338 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,338 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepSeek-VL is the first open-source vision-language model series from DeepSeek, the Chinese AI company. It was released on 11 March 2024 in 1.3B and 7B sizes, each shipping as a base model and a chat-tuned variant. The accompanying paper, "DeepSeek-VL: Towards Real-World Vision-Language Understanding" (arXiv:2403.05525), was submitted on 8 March 2024, with a revised version on 11 March, and is led by Haoyu Lu and a team of DeepSeek researchers [1]. The series extends the company's text-only DeepSeek LLM into a multimodal vision-language model by pairing the language backbone with a hybrid vision encoder and a vision-language adaptor.
The project's stated goal was practical: rather than maximizing a single academic benchmark, DeepSeek-VL aimed for a model that handled everyday inputs such as web screenshots, PDFs, charts, diagrams, optical character recognition (OCR), and natural photos, while keeping the underlying language ability of the LLM intact [1]. The Mixture-of-Experts successor DeepSeek-VL2 followed in December 2024, and the autoregressive image-generation line Janus is a related but architecturally distinct effort.
By early 2024 DeepSeek had released its DeepSeek LLM text models and was building out a broader model lineup. DeepSeek-VL was the company's entry into open multimodal models, competing with contemporaries such as LLaVA, Qwen-VL, and InternVL. The paper frames three priorities that shaped the design: a data pipeline broad enough to cover real-world use cases, a vision encoder that could read high-resolution images without excessive compute, and a pretraining recipe that did not degrade text performance [1]. The last point was a recurring problem for vision-language models of the era, where bolting a vision module onto an LLM and fine-tuning on image-text data tended to erode the model's language skills.
The defining architectural choice in DeepSeek-VL is a hybrid vision encoder that runs two separate pretrained encoders in parallel and fuses their outputs. One branch is a SigLIP-L encoder operating at 384 x 384 resolution, which captures coarse, text-aligned semantic content. The other is an encoder based on SAM-B, the base image encoder from Meta's Segment Anything Model, operating at 1024 x 1024 to capture fine-grained detail such as small text, document layout, and dense chart elements [1][2].
The two streams are reshaped to a common format before fusion. The SigLIP-L branch turns a 384 x 384 image into a 24 x 24 x 1024 feature map, then flattens it to 576 x 1024. The SAM-B branch turns a 1024 x 1024 image into a 64 x 64 x 256 feature map, interpolates it to 96 x 96 x 256, applies two convolutional layers with stride 2 to reach 24 x 24 x 1024, and reshapes that to 576 x 1024 [1]. The two 576 x 1024 maps are concatenated along the feature dimension to give 576 visual tokens of dimension 2048. A two-layer hybrid multilayer perceptron (MLP) vision-language adaptor then projects these 576 tokens into the LLM's embedding space, where they sit alongside the text tokens [1][2].
This fixed budget of 576 visual tokens per image is a deliberate efficiency trade-off. It keeps the sequence length manageable while still supplying high-resolution detail through the SAM-B branch. The fixed 1024 x 1024 input resolution later became the main limitation that the successor model addressed.
DeepSeek-VL reuses the DeepSeek LLM family as its language model, which is a dense decoder-only transformer in the LLaMA architectural style. The 1.3B vision model is built on a roughly 1B-parameter DeepSeek language model pretrained on about 500 billion text tokens, and the 7B vision model is built on the 7B DeepSeek LLM pretrained on roughly 2 trillion text tokens [1]. Both support a 4096-token context length [2]. Because the language weights start from a fully trained DeepSeek LLM, preserving that capability through multimodal training was an explicit design constraint rather than an afterthought.
The team emphasized data construction as much as architecture. The pretraining corpus draws from web image-text pairs, web screenshots, PDFs and rendered documents, OCR data, charts and tables, and knowledge-grounded content, deduplicated and filtered to cover realistic scenarios [1]. The supervised fine-tuning mix adds instruction-following and conversational vision-language data.
Training proceeds in three stages [1]:
| Stage | What is trained | Data and scale |
|---|---|---|
| 1. Adaptor warmup | Vision-language adaptor only; vision encoder and LLM frozen | About 1.25 million image-text pairs (drawn from ShareGPT4V) plus about 2.5 million document OCR rendering pairs |
| 2. Joint vision-language pretraining | Adaptor and LLM trainable; vision encoder frozen | Mixed 70% language-only and 30% vision-language data; 7B run uses 42,000 steps at batch size 2,304, the 1B run uses 96,000 steps at batch size 1,024 |
| 3. Supervised fine-tuning | SigLIP-L encoder, adaptor, and LLM jointly tuned | About 10,000 steps at batch size 256 |
A central finding is that the modality ratio in stage 2 matters. Mixing a large share of text-only data with the vision-language data, together with a warm-up schedule that gradually increases the multimodal proportion, lets the model acquire visual grounding while retaining its language performance [1]. The supervised fine-tuning stage uses a weighted loss that balances the language-modeling and vision-language objectives.
DeepSeek-VL was published in two sizes, each with a base checkpoint for further training and a chat checkpoint tuned for instruction following [2].
| Variant | Vision encoder | Language backbone | Context | Hugging Face identifiers |
|---|---|---|---|---|
| DeepSeek-VL 1.3B | SigLIP-L (384) + SAM-B (1024) hybrid | DeepSeek LLM ~1B (~500B tokens) | 4096 | deepseek-ai/deepseek-vl-1.3b-base, deepseek-ai/deepseek-vl-1.3b-chat |
| DeepSeek-VL 7B | SigLIP-L (384) + SAM-B (1024) hybrid | DeepSeek LLM 7B (~2T tokens) | 4096 | deepseek-ai/deepseek-vl-7b-base, deepseek-ai/deepseek-vl-7b-chat |
The paper reports DeepSeek-VL against open multimodal models of the same scale across general multimodal understanding, science, and math benchmarks. Selected scores for the chat models are below [1].
| Benchmark | DeepSeek-VL 1.3B chat | DeepSeek-VL 7B chat |
|---|---|---|
| MMBench (dev) | 64.6 | 73.2 |
| MMBench-CN (dev) | 61.3 | 72.8 |
| SEEDBench | 66.7 | 70.4 |
| MMMU (val) | 32.2 | 36.6 |
| MathVista | 31.1 | 36.1 |
| MM-Vet | 34.8 | 41.5 |
| POPE | 87.6 | 88.1 |
The authors position the 7B chat model as state-of-the-art or competitive among similarly sized open models on these vision-language tasks, while the paper separately reports that text-only performance stays close to the underlying DeepSeek LLM, which it attributes to the modality-mixing pretraining recipe [1]. The POPE scores, an object-hallucination metric, indicate relatively low hallucination for both sizes.
The code in the official deepseek-ai/DeepSeek-VL repository is released under the MIT License, and the model weights are governed by the separate DeepSeek Model License, which permits commercial use subject to the license's use restrictions [2]. This open and commercially permissive stance was consistent with DeepSeek's broader release strategy and made the models widely available on Hugging Face.
The direct successor is DeepSeek-VL2 (arXiv:2412.10302), released in December 2024, which replaces the dense backbone with a DeepSeekMoE Mixture-of-Experts language model using Multi-head Latent Attention, and swaps the fixed 1024 x 1024 hybrid encoder for a dynamic tiling scheme that splits an image into tiles to support varying aspect ratios and higher effective resolution. DeepSeek-VL2 ships as Tiny, Small, and base variants with roughly 1.0B, 2.8B, and 4.5B activated parameters respectively [3].
Separately, DeepSeek's Janus and Janus-Pro models pursue a unified architecture that both understands and generates images, decoupling the visual encoding paths for the two tasks. Janus is a parallel line of work rather than a continuation of the DeepSeek-VL understanding-only design.