# DeepSeek-VL

> Source: https://aiwiki.ai/wiki/deepseek_vl
> Updated: 2026-07-16
> Categories: AI Models, Chinese AI, Multimodal AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

DeepSeek-VL is the first open-source vision-language model series from [DeepSeek](/wiki/deepseek), the Chinese AI company. It was released on 11 March 2024 in 1.3B and 7B sizes, each shipping as a base model and a chat-tuned variant. The accompanying paper, "DeepSeek-VL: Towards Real-World Vision-Language Understanding" (arXiv:2403.05525), was submitted on 8 March 2024, with a revised version on 11 March, and is led by Haoyu Lu and a team of DeepSeek researchers [1]. The series extends the company's text-only [DeepSeek LLM](/wiki/deepseek_llm) into a [multimodal](/wiki/multimodal) [vision-language model](/wiki/vision_language_model) by pairing the language backbone with a hybrid vision encoder and a vision-language adaptor.

The project's stated goal was practical: rather than maximizing a single academic benchmark, DeepSeek-VL aimed for a model that handled everyday inputs such as web screenshots, PDFs, charts, diagrams, optical character recognition (OCR), and natural photos, while keeping the underlying language ability of the LLM intact [1]. The Mixture-of-Experts successor [DeepSeek-VL2](/wiki/deepseek_vl2) followed in December 2024, and the autoregressive image-generation line [Janus](/wiki/deepseek_janus) is a related but architecturally distinct effort.

## Background

By early 2024 DeepSeek had released its [DeepSeek LLM](/wiki/deepseek_llm) text models and was building out a broader model lineup. DeepSeek-VL was the company's entry into open multimodal models, competing with contemporaries such as LLaVA, Qwen-VL, and InternVL. The paper frames three priorities that shaped the design: a data pipeline broad enough to cover real-world use cases, a vision encoder that could read high-resolution images without excessive compute, and a pretraining recipe that did not degrade text performance [1]. The last point was a recurring problem for vision-language models of the era, where bolting a vision module onto an LLM and fine-tuning on image-text data tended to erode the model's language skills.

## Hybrid vision encoder

The defining architectural choice in DeepSeek-VL is a hybrid vision encoder that runs two separate pretrained encoders in parallel and fuses their outputs. One branch is a [SigLIP](/wiki/siglip)-L encoder operating at 384 x 384 resolution, which captures coarse, text-aligned semantic content. The other is an encoder based on SAM-B, the base image encoder from Meta's Segment Anything Model, operating at 1024 x 1024 to capture fine-grained detail such as small text, document layout, and dense chart elements [1][2].

The two streams are reshaped to a common format before fusion. The SigLIP-L branch turns a 384 x 384 image into a 24 x 24 x 1024 feature map, then flattens it to 576 x 1024. The SAM-B branch turns a 1024 x 1024 image into a 64 x 64 x 256 feature map, interpolates it to 96 x 96 x 256, applies two convolutional layers with stride 2 to reach 24 x 24 x 1024, and reshapes that to 576 x 1024 [1]. The two 576 x 1024 maps are concatenated along the feature dimension to give 576 visual tokens of dimension 2048. A two-layer hybrid multilayer perceptron (MLP) vision-language adaptor then projects these 576 tokens into the LLM's embedding space, where they sit alongside the text tokens [1][2].

This fixed budget of 576 visual tokens per image is a deliberate efficiency trade-off. It keeps the sequence length manageable while still supplying high-resolution detail through the SAM-B branch. The fixed 1024 x 1024 input resolution later became the main limitation that the successor model addressed.

## Language backbone

DeepSeek-VL reuses the [DeepSeek LLM](/wiki/deepseek_llm) family as its language model, which is a dense decoder-only transformer in the LLaMA architectural style. The 1.3B vision model is built on a roughly 1B-parameter DeepSeek language model pretrained on about 500 billion text tokens, and the 7B vision model is built on the 7B DeepSeek LLM pretrained on roughly 2 trillion text tokens [1]. Both support a 4096-token context length [2]. Because the language weights start from a fully trained DeepSeek LLM, preserving that capability through multimodal training was an explicit design constraint rather than an afterthought.

## Data and training

The team emphasized data construction as much as architecture. The pretraining corpus draws from web image-text pairs, web screenshots, PDFs and rendered documents, OCR data, charts and tables, and knowledge-grounded content, deduplicated and filtered to cover realistic scenarios [1]. The supervised fine-tuning mix adds instruction-following and conversational vision-language data.

Training proceeds in three stages [1]:

| Stage | What is trained | Data and scale |
| --- | --- | --- |
| 1. Adaptor warmup | Vision-language adaptor only; vision encoder and LLM frozen | About 1.25 million image-text pairs (drawn from ShareGPT4V) plus about 2.5 million document OCR rendering pairs |
| 2. Joint vision-language pretraining | Adaptor and LLM trainable; vision encoder frozen | Mixed 70% language-only and 30% vision-language data; 7B run uses 42,000 steps at batch size 2,304, the 1B run uses 96,000 steps at batch size 1,024 |
| 3. Supervised fine-tuning | SigLIP-L encoder, adaptor, and LLM jointly tuned | About 10,000 steps at batch size 256 |

A central finding is that the modality ratio in stage 2 matters. Mixing a large share of text-only data with the vision-language data, together with a warm-up schedule that gradually increases the multimodal proportion, lets the model acquire visual grounding while retaining its language performance [1]. The supervised fine-tuning stage uses a weighted loss that balances the language-modeling and vision-language objectives.

## Model sizes

DeepSeek-VL was published in two sizes, each with a base checkpoint for further training and a chat checkpoint tuned for instruction following [2].

| Variant | Vision encoder | Language backbone | Context | Hugging Face identifiers |
| --- | --- | --- | --- | --- |
| DeepSeek-VL 1.3B | SigLIP-L (384) + SAM-B (1024) hybrid | DeepSeek LLM ~1B (~500B tokens) | 4096 | deepseek-ai/deepseek-vl-1.3b-base, deepseek-ai/deepseek-vl-1.3b-chat |
| DeepSeek-VL 7B | SigLIP-L (384) + SAM-B (1024) hybrid | DeepSeek LLM 7B (~2T tokens) | 4096 | deepseek-ai/deepseek-vl-7b-base, deepseek-ai/deepseek-vl-7b-chat |

## Benchmarks

The paper reports DeepSeek-VL against open multimodal models of the same scale across general multimodal understanding, science, and math benchmarks. Selected scores for the chat models are below [1].

| Benchmark | DeepSeek-VL 1.3B chat | DeepSeek-VL 7B chat |
| --- | --- | --- |
| MMBench (dev) | 64.6 | 73.2 |
| MMBench-CN (dev) | 61.3 | 72.8 |
| SEEDBench | 66.7 | 70.4 |
| MMMU (val) | 32.2 | 36.6 |
| MathVista | 31.1 | 36.1 |
| MM-Vet | 34.8 | 41.5 |
| POPE | 87.6 | 88.1 |

The authors position the 7B chat model as state-of-the-art or competitive among similarly sized open models on these vision-language tasks, while the paper separately reports that text-only performance stays close to the underlying DeepSeek LLM, which it attributes to the modality-mixing pretraining recipe [1]. The POPE scores, an object-hallucination metric, indicate relatively low hallucination for both sizes.

## Licensing

The code in the official deepseek-ai/DeepSeek-VL repository is released under the MIT License, and the model weights are governed by the separate DeepSeek Model License, which permits commercial use subject to the license's use restrictions [2]. This open and commercially permissive stance was consistent with DeepSeek's broader release strategy and made the models widely available on Hugging Face.

## Successors and related models

The direct successor is [DeepSeek-VL2](/wiki/deepseek_vl2) (arXiv:2412.10302), released in December 2024, which replaces the dense backbone with a DeepSeekMoE Mixture-of-Experts language model using Multi-head Latent Attention, and swaps the fixed 1024 x 1024 hybrid encoder for a dynamic tiling scheme that splits an image into tiles to support varying aspect ratios and higher effective resolution. DeepSeek-VL2 ships as Tiny, Small, and base variants with roughly 1.0B, 2.8B, and 4.5B activated parameters respectively [3].

Separately, DeepSeek's [Janus](/wiki/deepseek_janus) and Janus-Pro models pursue a unified architecture that both understands and generates images, decoupling the visual encoding paths for the two tasks. Janus is a parallel line of work rather than a continuation of the DeepSeek-VL understanding-only design.

## References

1. Lu, Haoyu et al. "DeepSeek-VL: Towards Real-World Vision-Language Understanding." arXiv:2403.05525. https://arxiv.org/abs/2403.05525
2. DeepSeek-VL official repository (README, model list, and license). https://github.com/deepseek-ai/DeepSeek-VL
3. Wu, Zhiyu et al. "DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding." arXiv:2412.10302. https://arxiv.org/abs/2412.10302