DeepSeek-VL

AI Models Chinese AI Multimodal AI

7 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v2 · 1,336 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepSeek-VL is the first open-source vision-language model series from DeepSeek, the Chinese AI company. It was released on 11 March 2024 in 1.3B and 7B sizes, each shipping as a base model and a chat-tuned variant. The accompanying paper, "DeepSeek-VL: Towards Real-World Vision-Language Understanding" (arXiv:2403.05525), was submitted on 8 March 2024, with a revised version on 11 March, and is led by Haoyu Lu and a team of DeepSeek researchers ^[1]. The series extends the company's text-only DeepSeek LLM into a multimodal vision-language model by pairing the language backbone with a hybrid vision encoder and a vision-language adaptor.

The project's stated goal was practical: rather than maximizing a single academic benchmark, DeepSeek-VL aimed for a model that handled everyday inputs such as web screenshots, PDFs, charts, diagrams, optical character recognition (OCR), and natural photos, while keeping the underlying language ability of the LLM intact ^[1]. The Mixture-of-Experts successor DeepSeek-VL2 followed in December 2024, and the autoregressive image-generation line Janus is a related but architecturally distinct effort.

Background

By early 2024 DeepSeek had released its DeepSeek LLM text models and was building out a broader model lineup. DeepSeek-VL was the company's entry into open multimodal models, competing with contemporaries such as LLaVA, Qwen-VL, and InternVL. The paper frames three priorities that shaped the design: a data pipeline broad enough to cover real-world use cases, a vision encoder that could read high-resolution images without excessive compute, and a pretraining recipe that did not degrade text performance ^[1]. The last point was a recurring problem for vision-language models of the era, where bolting a vision module onto an LLM and fine-tuning on image-text data tended to erode the model's language skills.

Hybrid vision encoder

The defining architectural choice in DeepSeek-VL is a hybrid vision encoder that runs two separate pretrained encoders in parallel and fuses their outputs. One branch is a SigLIP-L encoder operating at 384 x 384 resolution, which captures coarse, text-aligned semantic content. The other is an encoder based on SAM-B, the base image encoder from Meta's Segment Anything Model, operating at 1024 x 1024 to capture fine-grained detail such as small text, document layout, and dense chart elements ^[1]^[2].

The two streams are reshaped to a common format before fusion. The SigLIP-L branch turns a 384 x 384 image into a 24 x 24 x 1024 feature map, then flattens it to 576 x 1024. The SAM-B branch turns a 1024 x 1024 image into a 64 x 64 x 256 feature map, interpolates it to 96 x 96 x 256, applies two convolutional layers with stride 2 to reach 24 x 24 x 1024, and reshapes that to 576 x 1024 ^[1]. The two 576 x 1024 maps are concatenated along the feature dimension to give 576 visual tokens of dimension 2048. A two-layer hybrid multilayer perceptron (MLP) vision-language adaptor then projects these 576 tokens into the LLM's embedding space, where they sit alongside the text tokens ^[1]^[2].

This fixed budget of 576 visual tokens per image is a deliberate efficiency trade-off. It keeps the sequence length manageable while still supplying high-resolution detail through the SAM-B branch. The fixed 1024 x 1024 input resolution later became the main limitation that the successor model addressed.

Language backbone

DeepSeek-VL reuses the DeepSeek LLM family as its language model, which is a dense decoder-only transformer in the LLaMA architectural style. The 1.3B vision model is built on a roughly 1B-parameter DeepSeek language model pretrained on about 500 billion text tokens, and the 7B vision model is built on the 7B DeepSeek LLM pretrained on roughly 2 trillion text tokens ^[1]. Both support a 4096-token context length ^[2]. Because the language weights start from a fully trained DeepSeek LLM, preserving that capability through multimodal training was an explicit design constraint rather than an afterthought.

Data and training

The team emphasized data construction as much as architecture. The pretraining corpus draws from web image-text pairs, web screenshots, PDFs and rendered documents, OCR data, charts and tables, and knowledge-grounded content, deduplicated and filtered to cover realistic scenarios ^[1]. The supervised fine-tuning mix adds instruction-following and conversational vision-language data.

Training proceeds in three stages ^[1]:

Stage	What is trained	Data and scale
1. Adaptor warmup	Vision-language adaptor only; vision encoder and LLM frozen	About 1.25 million image-text pairs (drawn from ShareGPT4V) plus about 2.5 million document OCR rendering pairs
2. Joint vision-language pretraining	Adaptor and LLM trainable; vision encoder frozen	Mixed 70% language-only and 30% vision-language data; 7B run uses 42,000 steps at batch size 2,304, the 1B run uses 96,000 steps at batch size 1,024
3. Supervised fine-tuning	SigLIP-L encoder, adaptor, and LLM jointly tuned	About 10,000 steps at batch size 256

A central finding is that the modality ratio in stage 2 matters. Mixing a large share of text-only data with the vision-language data, together with a warm-up schedule that gradually increases the multimodal proportion, lets the model acquire visual grounding while retaining its language performance ^[1]. The supervised fine-tuning stage uses a weighted loss that balances the language-modeling and vision-language objectives.

Model sizes

DeepSeek-VL was published in two sizes, each with a base checkpoint for further training and a chat checkpoint tuned for instruction following ^[2].

Variant	Vision encoder	Language backbone	Context	Hugging Face identifiers
DeepSeek-VL 1.3B	SigLIP-L (384) + SAM-B (1024) hybrid	DeepSeek LLM ~1B (~500B tokens)	4096	deepseek-ai/deepseek-vl-1.3b-base, deepseek-ai/deepseek-vl-1.3b-chat
DeepSeek-VL 7B	SigLIP-L (384) + SAM-B (1024) hybrid	DeepSeek LLM 7B (~2T tokens)	4096	deepseek-ai/deepseek-vl-7b-base, deepseek-ai/deepseek-vl-7b-chat

Benchmarks

The paper reports DeepSeek-VL against open multimodal models of the same scale across general multimodal understanding, science, and math benchmarks. Selected scores for the chat models are below ^[1].

Benchmark	DeepSeek-VL 1.3B chat	DeepSeek-VL 7B chat
MMBench (dev)	64.6	73.2
MMBench-CN (dev)	61.3	72.8
SEEDBench	66.7	70.4
MMMU (val)	32.2	36.6
MathVista	31.1	36.1
MM-Vet	34.8	41.5
POPE	87.6	88.1

The authors position the 7B chat model as state-of-the-art or competitive among similarly sized open models on these vision-language tasks, while the paper separately reports that text-only performance stays close to the underlying DeepSeek LLM, which it attributes to the modality-mixing pretraining recipe ^[1]. The POPE scores, an object-hallucination metric, indicate relatively low hallucination for both sizes.

Licensing

The code in the official deepseek-ai/DeepSeek-VL repository is released under the MIT License, and the model weights are governed by the separate DeepSeek Model License, which permits commercial use subject to the license's use restrictions ^[2]. This open and commercially permissive stance was consistent with DeepSeek's broader release strategy and made the models widely available on Hugging Face.

The direct successor is DeepSeek-VL2 (arXiv:2412.10302), released in December 2024, which replaces the dense backbone with a DeepSeekMoE Mixture-of-Experts language model using Multi-head Latent Attention, and swaps the fixed 1024 x 1024 hybrid encoder for a dynamic tiling scheme that splits an image into tiles to support varying aspect ratios and higher effective resolution. DeepSeek-VL2 ships as Tiny, Small, and base variants with roughly 1.0B, 2.8B, and 4.5B activated parameters respectively ^[3].

Separately, DeepSeek's Janus and Janus-Pro models pursue a unified architecture that both understands and generates images, decoupling the visual encoding paths for the two tasks. Janus is a parallel line of work rather than a continuation of the DeepSeek-VL understanding-only design.

References

Lu, Haoyu et al. "DeepSeek-VL: Towards Real-World Vision-Language Understanding." arXiv:2403.05525. https://arxiv.org/abs/2403.05525 ↩
DeepSeek-VL official repository (README, model list, and license). https://github.com/deepseek-ai/DeepSeek-VL ↩
Wu, Zhiyu et al. "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding." arXiv:2412.10302. https://arxiv.org/abs/2412.10302 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DeepSeek Janus

Background

Hybrid vision encoder

Language backbone

Data and training

Model sizes

Benchmarks

Licensing

Successors and related models

References

Improve this article

Related Articles

Doubao Seed 1.6

DeepSeek Janus

DeepSeek-OCR

InternVL

Qwen2.5-VL

DeepSeek-VL2

What links here

Related Articles

Doubao Seed 1.6

DeepSeek Janus

DeepSeek-OCR

InternVL

Qwen2.5-VL

DeepSeek-VL2