DeepSeek Janus

AI Models Chinese AI Multimodal AI

18 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v5 · 3,598 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DeepSeek Janus is a family of open-weight unified multimodal models from Chinese AI lab DeepSeek that perform both image understanding and text-to-image generation in a single autoregressive Transformer. Its defining design choice is to decouple visual encoding into two independent pathways, one tuned for understanding and one for generation, so that each task uses the visual representation best suited to it while a shared transformer backbone reads and writes both kinds of tokens.^[1] The original Janus (1.3B parameters) was released on 17 October 2024, and the scaled follow-up Janus-Pro (1B and 7B parameters, with the open Janus-Pro-7B) arrived on 27 January 2025; DeepSeek reports that Janus-Pro-7B scores 0.80 on the GenEval text-to-image benchmark, ahead of DALL-E 3 at 0.67 and Stable Diffusion 3 Medium at 0.74.^[4]^[11]^[13] The code is released under the MIT License and the model weights under a separate DeepSeek Model License that permits commercial use.^[6]

The series comprises three named releases: the original Janus at 1.3B parameters (October 2024),^[1] JanusFlow at 1.3B parameters which swaps the discrete image tokenizer for a rectified-flow head (November 2024),^[2] and Janus-Pro at 1B and 7B parameters (January 2025), released days after the viral debut of DeepSeek-R1.^[3]^[4] The Janus paper frames the contribution plainly: "we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing."^[1]

Infobox

Property	Value
Developer	DeepSeek-AI^[1]
First release (Janus)	17 October 2024^[5]
JanusFlow release	12 November 2024 (arXiv); 13 November 2024 (repo)^[2]^[6]
Janus-Pro release	27 January 2025^[4]^[6]
Model sizes	1.3B (Janus, JanusFlow); 1B and 7B (Janus-Pro)^[6]^[7]
Understanding encoder	SigLIP-Large-Patch16-384^[7]^[8]
Generation tokenizer	LlamaGen VQ tokenizer, codebook 16,384, 16x downsample^[8]
Backbone LLM	DeepSeek-LLM-1.3B / 1.5B / 7B base^[7]^[8]
Code license	MIT License^[6]
Model license	DeepSeek Model License (commercial use permitted)^[6]
Primary tasks	Image-to-text understanding and text-to-image generation^[1]^[3]

What is DeepSeek Janus?

Janus is one of the more visible attempts to show that a single autoregressive transformer can be competitive on both visual understanding (answering questions about an image, captioning, visual reasoning) and image generation (producing an image from a text prompt). Rather than treating these as two separate models, Janus runs both through one shared transformer trunk, switching only the visual encoding pathway and the output head depending on the task.^[1]^[8] The models are open: the deepseek-ai/Janus repository ships the code under the MIT License, and the released checkpoints (Janus-1.3B, JanusFlow-1.3B, Janus-Pro-1B, Janus-Pro-7B) are distributed on Hugging Face under a DeepSeek Model License that explicitly permits commercial use.^[6]^[7]^[12]

How does Janus unify understanding and generation?

The animating claim of the Janus papers is that the visual representation needed for understanding and the representation needed for generation are fundamentally different, so forcing both through one shared encoder is a compromise.^[1]^[8] Understanding favors a small number of high-dimensional, semantically dense embeddings, of the kind produced by contrastive vision encoders. Generation favors a long sequence of low-level, spatially fine-grained tokens, of the kind produced by discrete image quantizers.^[1] Prior unified systems such as Chameleon and Show-o ran both tasks through a single tokenizer, which the Janus authors argue forces a suboptimal compromise.^[1]

Janus instead instantiates two encoders in parallel.^[1]^[8]

The understanding pathway uses a frozen SigLIP-Large-Patch16-384 vision encoder, which produces continuous semantic features from a 384x384 input image. A two-layer MLP adaptor maps SigLIP features into the language-model embedding space.^[7]^[8]
The generation pathway uses a discrete vector-quantized tokenizer borrowed from the LlamaGen project, with a codebook of 16,384 entries and a 16x spatial downsample factor; a separate two-layer MLP adaptor maps each codebook index into the language-model embedding space, and an image head predicts the next code in autoregressive sequence.^[8]

Both streams feed a single autoregressive transformer initialised from DeepSeek-LLM-1.3b-base.^[7] At training and inference time the system can switch modes by selecting which adaptor projects an incoming image (for understanding) or by switching the output head from text logits to image-code logits (for generation).^[8] As the paper puts it, this decoupling "alleviates the conflict between the visual encoder's roles in understanding and generation" while keeping a single backbone.^[1]

History

When was the original Janus released (October 2024)?

The original Janus paper, "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation," was posted to arXiv on 17 October 2024 with identifier 2410.13848.^[5] The authors are Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo, a collaboration that draws members from DeepSeek-AI together with academic collaborators.^[5] The model card on Hugging Face confirms the same release date and lists DeepSeek-LLM-1.3b-base as the language backbone.^[7] On 20 October 2024, the team pushed a corrective update to the tokenizer configuration that had been causing degraded generation outputs in the initial release.^[6]

The paper situates Janus against earlier unified multimodal models such as Chameleon and Show-o, which use a single shared visual representation for both understanding and generation, arguing that the granularity mismatch between those two tasks leads to compromise on both.^[1]^[8] Janus instead routes input images through a SigLIP encoder for understanding queries and routes generation targets through a discrete VQ tokenizer, while the shared autoregressive transformer reads and writes both kinds of tokens in a single sequence.^[1]^[8]

What is JanusFlow (November 2024)?

One month after the original release, on 12 November 2024, the same group posted "JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation" to arXiv as 2411.07975.^[2] The first authors of this paper are Yiyang Ma and Xingchao Liu, with the larger DeepSeek vision team including Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan.^[2] JanusFlow was subsequently accepted to CVPR 2025.^[2] The repository tag dates the JanusFlow-1.3B model release at 13 November 2024.^[6]

JanusFlow keeps the decoupled-encoder idea but replaces the discrete VQ tokenizer used for generation with a continuous rectified flow head trained jointly with the autoregressive backbone.^[2]^[9] The paper argues that rectified flow "can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications," and adds two improvements over the original Janus: (i) decoupled understanding and generation encoders, and (ii) representation alignment during unified training.^[2]

What is Janus-Pro (January 2025)?

DeepSeek released Janus-Pro on 27 January 2025, just one week after the high-profile launch of the DeepSeek-R1 reasoning model on 20 January 2025.^[4]^[10] The accompanying technical report, "Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling," appeared on arXiv as 2501.17811 on 29 January 2025 with authors Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan.^[11] The arXiv listing explicitly notes "substantial text overlap" with the original Janus paper, reflecting that Janus-Pro is best understood as a scaled and re-trained edition of the same architecture rather than a redesign.^[11] The report summarizes the result directly: "Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation."^[11]

The release was widely covered as part of the broader "DeepSeek moment" of late January 2025. Western press treated Janus-Pro both as a standalone image-model story and as a follow-on to the R1 announcement that had already triggered sharp declines in U.S. AI-related equities.^[3]^[10] Coverage by TechCrunch on 27 January noted the model family ranges from 1 billion to 7 billion parameters under an MIT code license that permits commercial use.^[4] TechNode reported the announcement was made in the early hours of 28 January Beijing time, on the eve of the Lunar New Year.^[10] VentureBeat framed the release in the context of the simultaneous "AI stock bloodbath" affecting Nvidia and other U.S. technology stocks.^[3]

Technical Details

Why decouple the visual encoder?

The animating claim of the Janus papers is that the visual representation needed for understanding (answering questions about an image, captioning, visual reasoning) and the representation needed for generation (predicting the next pixel-equivalent token from text) are fundamentally different.^[1]^[8] Understanding favors a small number of high-dimensional, semantically dense embeddings, of the kind produced by contrastive vision encoders. Generation favors a long sequence of low-level, spatially fine-grained tokens, of the kind produced by discrete image quantizers.^[1] Prior unified systems such as Chameleon and Show-o ran both tasks through a single tokenizer, which the Janus authors argue forces a suboptimal compromise.^[1]

Janus instead instantiates two encoders in parallel.^[1]^[8]

The understanding pathway uses a frozen SigLIP-Large-Patch16-384 vision encoder, which produces continuous semantic features from a 384x384 input image. A two-layer MLP adaptor maps SigLIP features into the language-model embedding space.^[7]^[8]
The generation pathway uses a discrete vector-quantized tokenizer borrowed from the LlamaGen project, with a codebook of 16,384 entries and a 16x spatial downsample factor; a separate two-layer MLP adaptor maps each codebook index into the language-model embedding space, and an image head predicts the next code in autoregressive sequence.^[8]

Both streams feed a single autoregressive transformer initialised from DeepSeek-LLM-1.3b-base.^[7] At training and inference time the system can switch modes by selecting which adaptor projects an incoming image (for understanding) or by switching the output head from text logits to image-code logits (for generation).^[8]

How is Janus trained?

The original Janus is trained in three stages.^[8]

Stage I (adaptor warmup): the SigLIP encoder and the LLM backbone are frozen; only the two adaptors and the image generation head are updated. Batch size 256 for roughly 10K steps.^[8]
Stage II (unified pretraining): all components except the visual encoder are unfrozen; the model is trained on a mixture of text-only data, image-understanding data, and text-to-image data. Batch size 512 for roughly 180K steps.^[8]
Stage III (supervised fine-tuning): instruction-formatted samples across modalities. Batch size 256 for roughly 24K steps.^[8]

Training data for the original 1.3B model included ShareGPT4V, ImageNet-1k, WikiHow, WIT, COCO, and LAION-derived corpora, augmented with approximately 2M in-house text-to-image samples.^[8]

How does JanusFlow replace discrete tokens with rectified flow?

JanusFlow keeps the decoupled-encoder structure but changes the generation head. Instead of predicting indices in a discrete codebook, the model uses a rectified flow formulation: an ordinary-differential-equation that learns to map Gaussian noise to image data conditioned on the autoregressive context.^[2]^[9] A lightweight ConvNeXt-style architecture provides the per-step velocity field on the generation side; the SigLIP encoder remains for understanding.^[9] The same shared autoregressive transformer drives both pathways. Conceptually, this aligns Janus with the broader move from VQ-token-based image LLMs toward continuous-token or flow matching formulations, while preserving the language-model-style sequence interface.^[2]

The JanusFlow paper highlights that decoupling the understanding and generation encoders and aligning their representations during unified training are crucial to making the rectified-flow head compatible with an autoregressive LLM without requiring major architectural surgery.^[2] Like the original Janus, it is trained in three phases (adaptation, unified pretraining excluding the visual encoder, supervised fine-tuning) and ships at a compact 1.3B parameter scale.^[9]

What changed in Janus-Pro?

The Janus-Pro technical report identifies three changes relative to the original Janus:^[11]

an optimized training strategy (adjusted stage durations and data mixtures),
expanded training data including synthetic aesthetic data to improve text-to-image stability, and
larger model scaling, producing Janus-Pro-1B and Janus-Pro-7B variants in addition to the 1.3B-class lineage.^[11]^[12]

The report states that Janus-Pro "incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size."^[11] According to the Hugging Face model cards, Janus-Pro-1B is built on DeepSeek-LLM-1.5b-base and Janus-Pro-7B is built on DeepSeek-LLM-7b-base; the SigLIP-L vision encoder at 384x384 and the LlamaGen-derived 16x VQ tokenizer are unchanged from the original Janus.^[7]^[12] The image input resolution remains 384x384.^[4]^[12]

Architecture comparison

Component	Janus (1.3B)	JanusFlow (1.3B)	Janus-Pro (1B / 7B)
Understanding encoder^[7]^[8]^[9]^[12]	SigLIP-L @ 384	SigLIP-L @ 384	SigLIP-L @ 384
Generation pathway^[8]^[9]^[12]	LlamaGen VQ tokenizer (codebook 16,384, 16x)	Rectified flow head with ConvNeXt-style velocity net	LlamaGen VQ tokenizer (codebook 16,384, 16x)
LLM backbone^[7]^[11]^[12]	DeepSeek-LLM-1.3B-base	DeepSeek-LLM-1.3B-base	DeepSeek-LLM-1.5B-base / 7B-base
Output for generation^[1]^[2]^[11]	Discrete image-code logits	Continuous flow vector field	Discrete image-code logits
Release^[5]^[2]^[11]	Oct 2024 (arXiv 2410.13848)	Nov 2024 (arXiv 2411.07975)	Jan 2025 (arXiv 2501.17811)

How does Janus perform on benchmarks?

Understanding benchmarks (Janus 1.3B)

The original Janus paper reports that the 1.3B model surpasses comparably sized unified models and approaches or exceeds larger task-specific baselines.^[8] Reported scores include 69.4 on MMBench (versus 64.3 for LLaVA-v1.5-7B) and 87.0 on POPE (versus 73.8 for Show-o).^[8]

Generation benchmarks (Janus 1.3B)

For text-to-image, Janus-1.3B is reported at GenEval 61 percent, beating SDXL at 55 percent and DALL-E 2 at 52 percent, and at COCO-30K FID 8.53 versus Show-o at 9.24.^[8]

Janus-Pro-7B generation benchmarks

The Janus-Pro technical report and corroborating third-party coverage report the following GenEval and DPG-Bench numbers for the 7B model. On GenEval overall accuracy, DeepSeek reports Janus-Pro-7B at 0.80, ahead of DALL-E 3 at 0.67 and Stable Diffusion 3 Medium at 0.74:^[12]^[13]

Benchmark	Janus-Pro-7B	DALL-E 3	SD3-Medium
GenEval overall accuracy^[13]	0.80	0.67	0.74
GenEval color alignment^[13]	0.79	0.43	(not listed)
GenEval attribute alignment^[13]	0.66	0.45	(not listed)
GenEval positional alignment^[13]	0.90	0.83	(not listed)
DPG-Bench overall^[13]	84.19	83.50	(not listed)

For multimodal understanding, the Janus-Pro 7B report measures average accuracy across POPE, MME-Perception, GQA, and MMMU and claims gains over both the previous Janus generations and several similarly sized task-specific VLMs.^[14]^[11]

DeepSeek's own claims that Janus-Pro-7B surpasses DALL-E 3 and Stable Diffusion XL on these benchmarks have been widely repeated in the trade press; reviewers including TechCrunch have noted that the cited competitor models are not all current state-of-the-art and that the 384x384 input resolution is modest.^[4]

Is DeepSeek Janus open source?

All Janus-series models are distributed through the deepseek-ai organisation on Hugging Face and the deepseek-ai/Janus repository on GitHub.^[6]^[7]^[12] The released checkpoints to date are:

deepseek-ai/Janus-1.3B, released 17 October 2024 alongside the original paper.^[7]
deepseek-ai/JanusFlow-1.3B, released 13 November 2024.^[6]
deepseek-ai/Janus-Pro-1B and deepseek-ai/Janus-Pro-7B, released 27 January 2025.^[6]^[12]

The repository's code is published under the MIT License, while the model weights are governed by a separate DeepSeek Model License that explicitly permits commercial use.^[6] In practice this makes Janus open-weight (downloadable and commercially usable) rather than fully MIT-licensed end to end. As of model-card snapshots, the 7B model lists tens of thousands of monthly downloads and a large ecosystem of community fine-tunes and Hugging Face Spaces built on top of it.^[12]

Why does Janus matter?

A demonstration of unified multimodal scaling

Janus is one of the more visible attempts to show that a single autoregressive transformer can be competitive on both visual understanding and image generation. The decoupled-encoder design is a concrete answer to a research question that earlier unified models, including Chameleon and Show-o, raised but did not fully resolve.^[1]^[8] By placing each task on the visual representation best suited to it (semantic for understanding, fine-grained discrete tokens or rectified-flow continuous tokens for generation) while still sharing the language-model trunk, Janus argues that "unified" need not mean "single shared encoder."^[1]

Role in the January 2025 "DeepSeek moment"

Janus-Pro arrived in the same news cycle as DeepSeek-R1, the reasoning-focused LLM whose release on 20 January 2025 drew global attention and contributed to a sharp sell-off in U.S. AI-exposed equities.^[10]^[3] Trade press characterized Janus-Pro as a second demonstration, after R1, that a Chinese open-weight lab could ship competitive frontier-adjacent systems on a fraction of the compute budget assumed by Western incumbents.^[3]^[10] VentureBeat's coverage explicitly linked the timing to "fresh fears of Chinese tech dominance" in AI.^[3] Open licensing of both code and weights amplified that effect by enabling downstream adoption without vendor restrictions.^[4]^[6]

Influence on subsequent unified models

The decoupled-encoder pattern, with a SigLIP-style semantic encoder for understanding and a separate discrete (VQ) or continuous (rectified flow) head for generation, was widely cited in subsequent unified multimodal model proposals during 2025. JanusFlow's acceptance to CVPR 2025 indicates that the rectified-flow variant in particular had a measurable academic footprint.^[2]

What are the limitations of Janus?

Several limitations are visible in the released systems.^[4]^[8]^[11]

Input resolution: all currently released Janus variants accept 384x384 image inputs, which constrains fine-grained understanding of small text in images and high-resolution detail relative to systems that accept higher resolutions natively.^[4]^[7]^[12]
Modest generation resolution and benchmark scope: while Janus-Pro-7B reports strong GenEval and DPG-Bench numbers, those benchmarks focus on prompt-alignment metrics over a finite set of categories; they do not measure aesthetic quality, identity preservation, photorealism in human figures, or long-tail compositional ability comprehensively. TechCrunch's coverage flagged that some of the competitor models cited by DeepSeek are not the most current.^[4]
No video and no audio: the Janus releases are image-and-text only, in contrast to broader multimodal systems that also handle video or audio.^[1]^[6]
Text-overlap and reproduction concerns: the Janus-Pro arXiv listing carries an explicit "substantial text overlap" notice with the earlier Janus paper, which while not a scientific defect signals that Janus-Pro is an iterative scaling and data improvement rather than a new architecture.^[11]
Self-reported benchmarks: the most quoted comparisons against DALL-E 3 and Stable Diffusion are reported by DeepSeek itself; independent reproductions exist in the community but tend to be partial and use different evaluation harnesses.^[3]^[13]

How does Janus compare to other models?

Model	Tokenization for generation	Visual encoder	Released	Open weights
Janus-1.3B^[7]	LlamaGen VQ tokenizer	SigLIP-L	Oct 2024	Yes
JanusFlow-1.3B^[2]	Rectified flow head	SigLIP-L	Nov 2024	Yes
Janus-Pro-7B^[12]	LlamaGen VQ tokenizer	SigLIP-L	Jan 2025	Yes
LLaVA	Understanding-only (no generation)	CLIP / SigLIP variants	2023 onward	Yes
Chameleon (Meta)^[1]	Shared VQ tokenizer	Shared VQ tokenizer	2024	Partial
Show-o^[1]	Discrete tokens, masked + AR	MAGVIT-style tokenizer	2024	Yes
DALL-E 3	Closed	Closed	2023	No
Stable Diffusion 3	Latent diffusion with MMDiT	T5 + CLIP text encoders	2024	Weights only

Janus differs from LLaVA in that LLaVA-family models are understanding-only and do not generate images. It differs from Chameleon and Show-o by separating the tokenization stack between the two tasks. It differs from end-to-end diffusion systems such as Stable Diffusion 3 in that Janus treats image generation as part of an autoregressive sequence over an LLM, rather than a separate denoiser conditioned on text embeddings.^[1]^[2]

ELI5: What is Janus in plain terms?

Imagine one brain that can both look at a picture and describe it, and also draw a new picture when you ask for one. Most earlier systems used the same "eyes" for both jobs, which made each job a little worse. Janus gives the brain two sets of eyes: one set is good at understanding what is in a picture, and the other set is good at drawing. Both sets of eyes report to the same brain, so it stays a single model. DeepSeek gave the recipe away for free (open weights), and its bigger version, Janus-Pro-7B, scored higher than DALL-E 3 on a popular picture-accuracy test.^[1]^[13]

References

Wu, Chengyue et al., "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation", arXiv, 2024-10-17. https://arxiv.org/abs/2410.13848. Accessed 2026-05-21. ↩
Ma, Yiyang et al., "JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation", arXiv, 2024-11-12 (v1) / 2025-03-24 (v2), accepted CVPR 2025. https://arxiv.org/abs/2411.07975. Accessed 2026-05-21. ↩
Franzen, Carl, "DeepSeek unleashes 'Janus Pro 7B' vision model amidst AI stock bloodbath, igniting fresh fears of Chinese tech dominance", VentureBeat, 2025-01-27. https://venturebeat.com/ai/deepseek-unleashes-janus-pro-7b-vision-model-amidst-ai-stock-bloodbath-igniting-fresh-fears-of-chinese-tech-dominance. Accessed 2026-05-21. ↩
Wiggers, Kyle, "Viral AI company DeepSeek releases new image model family", TechCrunch, 2025-01-27. https://techcrunch.com/2025/01/27/viral-ai-company-deepseek-releases-new-image-model-family/. Accessed 2026-05-21. ↩
arXiv listing page for 2410.13848, "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation", arXiv, 2024-10-17. https://arxiv.org/abs/2410.13848v1. Accessed 2026-05-21. ↩
DeepSeek-AI, "Janus-Series: Unified Multimodal Understanding and Generation Models (repository README)", GitHub, accessed 2026-05-21. https://github.com/deepseek-ai/Janus. Accessed 2026-05-21. ↩
DeepSeek-AI, "deepseek-ai/Janus-1.3B (model card)", Hugging Face, 2024-10-17. https://huggingface.co/deepseek-ai/Janus-1.3B. Accessed 2026-05-21. ↩
Wu, Chengyue et al., "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (HTML version)", arXiv, 2024-10-17. https://arxiv.org/html/2410.13848v1. Accessed 2026-05-21. ↩
TheMoonlight Review, "JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation", themoonlight.io, 2024. https://www.themoonlight.io/en/review/janusflow-harmonizing-autoregression-and-rectified-flow-for-unified-multimodal-understanding-and-generation. Accessed 2026-05-21. ↩
TechNode Staff, "DeepSeek releases new models Janus-Pro and JanusFlow on Lunar New Year's Eve", TechNode, 2025-01-30. https://technode.com/2025/01/30/deepseek-releases-new-models-janus-pro-and-janusflow-on-lunar-new-years-eve/. Accessed 2026-05-21. ↩
Chen, Xiaokang et al., "Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling", arXiv, 2025-01-29. https://arxiv.org/abs/2501.17811. Accessed 2026-05-21. ↩
DeepSeek-AI, "deepseek-ai/Janus-Pro-7B (model card)", Hugging Face, 2025-01-27. https://huggingface.co/deepseek-ai/Janus-Pro-7B. Accessed 2026-05-21. ↩
PromptHub, "DeepSeek Janus-Pro-7B Model Overview and How it Ranks Against DALL-E 3", PromptHub Blog, 2025. https://www.prompthub.us/blog/deepseek-janus-pro-7b-model-overview-and-how-it-ranks-against-dall-e-3. Accessed 2026-05-21. ↩
Mwiti, Derrick, "DeepSeek Release Another Open-Source AI Model, Janus Pro", InfoQ, 2025-01-31. https://www.infoq.com/news/2025/01/deepseek-ai-janus/. Accessed 2026-05-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

DeepSeek-Coder DeepSeek-VL Qwen2-VL Skywork-R1V

Infobox

What is DeepSeek Janus?

How does Janus unify understanding and generation?

History

When was the original Janus released (October 2024)?

What is JanusFlow (November 2024)?

What is Janus-Pro (January 2025)?

Technical Details

Why decouple the visual encoder?

How is Janus trained?

How does JanusFlow replace discrete tokens with rectified flow?

What changed in Janus-Pro?

Architecture comparison

How does Janus perform on benchmarks?

Understanding benchmarks (Janus 1.3B)

Generation benchmarks (Janus 1.3B)

Janus-Pro-7B generation benchmarks

Is DeepSeek Janus open source?

Why does Janus matter?

A demonstration of unified multimodal scaling

Role in the January 2025 "DeepSeek moment"

Influence on subsequent unified models

What are the limitations of Janus?

How does Janus compare to other models?

ELI5: What is Janus in plain terms?

See also

References

Improve this article

Related Articles

Doubao Seed 1.6

DeepSeek-VL

DeepSeek-OCR

InternVL

Qwen2.5-VL

DeepSeek-VL2

What links here

Related Articles

Doubao Seed 1.6

DeepSeek-VL

DeepSeek-OCR

InternVL

Qwen2.5-VL

DeepSeek-VL2

What links here