DeepSeek Janus
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,905 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,905 words
Add missing citations, update stale details, or suggest a clearer explanation.
Janus is a family of open-weight unified multimodal models released by Chinese AI lab DeepSeek between October 2024 and January 2025. The defining design choice across the family is to decouple visual encoding into two independent pathways, one for image understanding and one for image generation, while a single autoregressive Transformer backbone processes the resulting token streams. This architecture addresses the long-standing tension in unified vision-language systems between the high-level semantic features needed for understanding tasks and the fine-grained pixel-level features needed for generation.[^1] The series comprises three named releases: the original Janus at 1.3B parameters (October 2024),[^1] JanusFlow at 1.3B parameters which swaps the discrete image tokenizer for a rectified-flow head (November 2024),[^2] and Janus-Pro at 1B and 7B parameters (January 2025), released days after the viral debut of DeepSeek-R1.[^3][^4]
| Property | Value |
|---|---|
| Developer | DeepSeek-AI[^1] |
| First release (Janus) | 17 October 2024[^5] |
| JanusFlow release | 12 November 2024 (arXiv); 13 November 2024 (repo)[^2][^6] |
| Janus-Pro release | 27 January 2025[^4][^6] |
| Model sizes | 1.3B (Janus, JanusFlow); 1B and 7B (Janus-Pro)[^6][^7] |
| Understanding encoder | SigLIP-Large-Patch16-384[^7][^8] |
| Generation tokenizer | LlamaGen VQ tokenizer, codebook 16,384, 16x downsample[^8] |
| Backbone LLM | DeepSeek-LLM-1.3B / 1.5B / 7B base[^7][^8] |
| Code license | MIT License[^6] |
| Model license | DeepSeek Model License (commercial use permitted)[^6] |
| Primary tasks | Image-to-text understanding and text-to-image generation[^1][^3] |
The original Janus paper, "Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation," was posted to arXiv on 17 October 2024 with identifier 2410.13848.[^5] The authors are Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, and Ping Luo, a collaboration that draws members from DeepSeek-AI together with academic collaborators.[^5] The model card on Hugging Face confirms the same release date and lists DeepSeek-LLM-1.3b-base as the language backbone.[^7] On 20 October 2024, the team pushed a corrective update to the tokenizer configuration that had been causing degraded generation outputs in the initial release.[^6]
The paper situates Janus against earlier unified multimodal models such as Chameleon and Show-o, which use a single shared visual representation for both understanding and generation, arguing that the granularity mismatch between those two tasks leads to compromise on both.[^1][^8] Janus instead routes input images through a SigLIP encoder for understanding queries and routes generation targets through a discrete VQ tokenizer, while the shared autoregressive transformer reads and writes both kinds of tokens in a single sequence.[^1][^8]
One month after the original release, on 12 November 2024, the same group posted "JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation" to arXiv as 2411.07975.[^2] The first authors of this paper are Yiyang Ma and Xingchao Liu, with the larger DeepSeek vision team including Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, Liang Zhao, Yisong Wang, Jiaying Liu, and Chong Ruan.[^2] JanusFlow was subsequently accepted to CVPR 2025.[^2] The repository tag dates the JanusFlow-1.3B model release at 13 November 2024.[^6]
JanusFlow keeps the decoupled-encoder idea but replaces the discrete VQ tokenizer used for generation with a continuous rectified flow head trained jointly with the autoregressive backbone.[^2][^9] The paper argues that rectified flow "can be straightforwardly trained within the large language model framework, eliminating the need for complex architectural modifications," and adds two improvements over the original Janus: (i) decoupled understanding and generation encoders, and (ii) representation alignment during unified training.[^2]
DeepSeek released Janus-Pro on 27 January 2025, just one week after the high-profile launch of the DeepSeek-R1 reasoning model on 20 January 2025.[^4][^10] The accompanying technical report, "Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling," appeared on arXiv as 2501.17811 on 29 January 2025 with authors Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan.[^11] The arXiv listing explicitly notes "substantial text overlap" with the original Janus paper, reflecting that Janus-Pro is best understood as a scaled and re-trained edition of the same architecture rather than a redesign.[^11]
The release was widely covered as part of the broader "DeepSeek moment" of late January 2025. Western press treated Janus-Pro both as a standalone image-model story and as a follow-on to the R1 announcement that had already triggered sharp declines in U.S. AI-related equities.[^3][^10] Coverage by TechCrunch on 27 January noted the model family ranges from 1 billion to 7 billion parameters under an MIT code license that permits commercial use.[^4] TechNode reported the announcement was made in the early hours of 28 January Beijing time, on the eve of the Lunar New Year.[^10] VentureBeat framed the release in the context of the simultaneous "AI stock bloodbath" affecting Nvidia and other U.S. technology stocks.[^3]
The animating claim of the Janus papers is that the visual representation needed for understanding (answering questions about an image, captioning, visual reasoning) and the representation needed for generation (predicting the next pixel-equivalent token from text) are fundamentally different.[^1][^8] Understanding favors a small number of high-dimensional, semantically dense embeddings, of the kind produced by contrastive vision encoders. Generation favors a long sequence of low-level, spatially fine-grained tokens, of the kind produced by discrete image quantizers.[^1] Prior unified systems such as Chameleon and Show-o ran both tasks through a single tokenizer, which the Janus authors argue forces a suboptimal compromise.[^1]
Janus instead instantiates two encoders in parallel.[^1][^8]
Both streams feed a single autoregressive transformer initialised from DeepSeek-LLM-1.3b-base.[^7] At training and inference time the system can switch modes by selecting which adaptor projects an incoming image (for understanding) or by switching the output head from text logits to image-code logits (for generation).[^8]
The original Janus is trained in three stages.[^8]
Training data for the original 1.3B model included ShareGPT4V, ImageNet-1k, WikiHow, WIT, COCO, and LAION-derived corpora, augmented with approximately 2M in-house text-to-image samples.[^8]
JanusFlow keeps the decoupled-encoder structure but changes the generation head. Instead of predicting indices in a discrete codebook, the model uses a rectified flow formulation: an ordinary-differential-equation that learns to map Gaussian noise to image data conditioned on the autoregressive context.[^2][^9] A lightweight ConvNeXt-style architecture provides the per-step velocity field on the generation side; the SigLIP encoder remains for understanding.[^9] The same shared autoregressive transformer drives both pathways. Conceptually, this aligns Janus with the broader move from VQ-token-based image LLMs toward continuous-token or flow matching formulations, while preserving the language-model-style sequence interface.[^2]
The JanusFlow paper highlights that decoupling the understanding and generation encoders and aligning their representations during unified training are crucial to making the rectified-flow head compatible with an autoregressive LLM without requiring major architectural surgery.[^2] Like the original Janus, it is trained in three phases (adaptation, unified pretraining excluding the visual encoder, supervised fine-tuning) and ships at a compact 1.3B parameter scale.[^9]
The Janus-Pro technical report identifies three changes relative to the original Janus:[^11]
According to the Hugging Face model cards, Janus-Pro-1B is built on DeepSeek-LLM-1.5b-base and Janus-Pro-7B is built on DeepSeek-LLM-7b-base; the SigLIP-L vision encoder at 384x384 and the LlamaGen-derived 16x VQ tokenizer are unchanged from the original Janus.[^7][^12] The image input resolution remains 384x384.[^4][^12]
| Component | Janus (1.3B) | JanusFlow (1.3B) | Janus-Pro (1B / 7B) |
|---|---|---|---|
| Understanding encoder[^7][^8][^9][^12] | SigLIP-L @ 384 | SigLIP-L @ 384 | SigLIP-L @ 384 |
| Generation pathway[^8][^9][^12] | LlamaGen VQ tokenizer (codebook 16,384, 16x) | Rectified flow head with ConvNeXt-style velocity net | LlamaGen VQ tokenizer (codebook 16,384, 16x) |
| LLM backbone[^7][^11][^12] | DeepSeek-LLM-1.3B-base | DeepSeek-LLM-1.3B-base | DeepSeek-LLM-1.5B-base / 7B-base |
| Output for generation[^1][^2][^11] | Discrete image-code logits | Continuous flow vector field | Discrete image-code logits |
| Release[^5][^2][^11] | Oct 2024 (arXiv 2410.13848) | Nov 2024 (arXiv 2411.07975) | Jan 2025 (arXiv 2501.17811) |
The original Janus paper reports that the 1.3B model surpasses comparably sized unified models and approaches or exceeds larger task-specific baselines.[^8] Reported scores include 69.4 on MMBench (versus 64.3 for LLaVA-v1.5-7B) and 87.0 on POPE (versus 73.8 for Show-o).[^8]
For text-to-image, Janus-1.3B is reported at GenEval 61 percent, beating SDXL at 55 percent and DALL-E 2 at 52 percent, and at COCO-30K FID 8.53 versus Show-o at 9.24.[^8]
The Janus-Pro technical report and corroborating third-party coverage report the following GenEval and DPG-Bench numbers for the 7B model:[^12][^13]
| Benchmark | Janus-Pro-7B | DALL-E 3 | SD3-Medium |
|---|---|---|---|
| GenEval overall accuracy[^13] | 0.80 | 0.67 | 0.74 |
| GenEval color alignment[^13] | 0.79 | 0.43 | (not listed) |
| GenEval attribute alignment[^13] | 0.66 | 0.45 | (not listed) |
| GenEval positional alignment[^13] | 0.90 | 0.83 | (not listed) |
| DPG-Bench overall[^13] | 84.19 | 83.50 | (not listed) |
For multimodal understanding, the Janus-Pro 7B report measures average accuracy across POPE, MME-Perception, GQA, and MMMU and claims gains over both the previous Janus generations and several similarly sized task-specific VLMs.[^14][^11]
DeepSeek's own claims that Janus-Pro-7B surpasses DALL-E 3 and Stable Diffusion XL on these benchmarks have been widely repeated in the trade press; reviewers including TechCrunch have noted that the cited competitor models are not all current state-of-the-art and that the 384x384 input resolution is modest.[^4]
All Janus-series models are distributed through the deepseek-ai organisation on Hugging Face and the deepseek-ai/Janus repository on GitHub.[^6][^7][^12] The released checkpoints to date are:
The repository's code is published under the MIT License, while the model weights are governed by a separate DeepSeek Model License that explicitly permits commercial use.[^6] As of model-card snapshots, the 7B model lists tens of thousands of monthly downloads and a large ecosystem of community fine-tunes and Hugging Face Spaces built on top of it.[^12]
Janus is one of the more visible attempts to show that a single autoregressive transformer can be competitive on both visual understanding and image generation. The decoupled-encoder design is a concrete answer to a research question that earlier unified models, including Chameleon and Show-o, raised but did not fully resolve.[^1][^8] By placing each task on the visual representation best suited to it (semantic for understanding, fine-grained discrete tokens or rectified-flow continuous tokens for generation) while still sharing the language-model trunk, Janus argues that "unified" need not mean "single shared encoder."[^1]
Janus-Pro arrived in the same news cycle as DeepSeek-R1, the reasoning-focused LLM whose release on 20 January 2025 drew global attention and contributed to a sharp sell-off in U.S. AI-exposed equities.[^10][^3] Trade press characterized Janus-Pro as a second demonstration, after R1, that a Chinese open-weight lab could ship competitive frontier-adjacent systems on a fraction of the compute budget assumed by Western incumbents.[^3][^10] VentureBeat's coverage explicitly linked the timing to "fresh fears of Chinese tech dominance" in AI.[^3] Open licensing of both code and weights amplified that effect by enabling downstream adoption without vendor restrictions.[^4][^6]
The decoupled-encoder pattern, with a SigLIP-style semantic encoder for understanding and a separate discrete (VQ) or continuous (rectified flow) head for generation, was widely cited in subsequent unified multimodal model proposals during 2025. JanusFlow's acceptance to CVPR 2025 indicates that the rectified-flow variant in particular had a measurable academic footprint.[^2]
Several limitations are visible in the released systems.[^4][^8][^11]
| Model | Tokenization for generation | Visual encoder | Released | Open weights |
|---|---|---|---|---|
| Janus-1.3B[^7] | LlamaGen VQ tokenizer | SigLIP-L | Oct 2024 | Yes |
| JanusFlow-1.3B[^2] | Rectified flow head | SigLIP-L | Nov 2024 | Yes |
| Janus-Pro-7B[^12] | LlamaGen VQ tokenizer | SigLIP-L | Jan 2025 | Yes |
| LLaVA | Understanding-only (no generation) | CLIP / SigLIP variants | 2023 onward | Yes |
| Chameleon (Meta)[^1] | Shared VQ tokenizer | Shared VQ tokenizer | 2024 | Partial |
| Show-o[^1] | Discrete tokens, masked + AR | MAGVIT-style tokenizer | 2024 | Yes |
| DALL-E 3 | Closed | Closed | 2023 | No |
| Stable Diffusion 3 | Latent diffusion with MMDiT | T5 + CLIP text encoders | 2024 | Weights only |
Janus differs from LLaVA in that LLaVA-family models are understanding-only and do not generate images. It differs from Chameleon and Show-o by separating the tokenization stack between the two tasks. It differs from end-to-end diffusion systems such as Stable Diffusion 3 in that Janus treats image generation as part of an autoregressive sequence over an LLM, rather than a separate denoiser conditioned on text embeddings.[^1][^2]