Show-o

Deep Learning Generative AI

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 1,884 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Show-o is a unified multimodal model, introduced in 2024, that handles both multimodal understanding and visual generation inside a single Transformer. Its distinguishing feature is that one network mixes two different generative paradigms: it predicts text autoregressively, one token at a time, while it generates images through discrete denoising diffusion, that is, by iteratively unmasking image tokens in the style of MaskGIT. The same set of weights therefore answers questions about an image, generates an image from a text prompt, fills in or extends images, and produces interleaved text-and-image output. The model was presented in the paper "Show-o: One Single Transformer to Unify Multimodal Understanding and Generation" by Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou, posted to arXiv on 2024-08-21 and later accepted at ICLR 2025 ^[1]^[2].

Most authors were affiliated with Show Lab at the National University of Singapore, with Weihao Wang and Zhijie Chen at ByteDance ^[2]. The flagship model is relatively small at about 1.3 billion parameters, yet the authors report that it matches or beats larger single-purpose models on a range of understanding and generation benchmarks ^[1]. Show-o was released within days of Meta's Transfusion, and the two are usually discussed together as contrasting answers to the same question: how to fuse a language model with an image generator in one Transformer. Where Transfusion attaches a continuous diffusion loss on image latents, Show-o keeps images as discrete tokens and uses discrete diffusion ^[1]^[3].

Motivation: unifying understanding and generation

By 2024, the strongest systems for multimodal understanding and the strongest systems for image generation had diverged in their basic machinery ^[1]. Understanding-focused multimodal large language models, such as the LLaVA family, are built on autoregressive LLMs that ingest image features and emit text. The best image generators, by contrast, are continuous diffusion models that denoise pixels or latents and have no native notion of language modeling. Combining the two typically meant one of two compromises ^[1]:

Treat everything as discrete tokens and run a fully autoregressive model over a joint text-plus-image vocabulary, as in Chameleon. This is clean and uniform, but generating an image one token at a time, scanning left to right over hundreds or thousands of image tokens, is slow.
Bolt a separate pretrained diffusion model onto a language model through adapters or connectors, as in systems like NExT-GPT or SEED. This preserves image quality but the result is really several models stitched together rather than one jointly trained network.

Show-o's goal was a single Transformer that natively does both jobs while keeping each modality on the generative paradigm best suited to it: autoregression for the inherently sequential, causal structure of text, and (discrete) diffusion for images, whose tokens can be predicted in parallel and refined over a small number of steps ^[1]. The intended payoff is flexibility, one model for many vision-language tasks, together with the sampling efficiency of parallel image decoding rather than slow token-by-token image generation ^[1].

How Show-o works

Tokenizing both modalities

Show-o operates entirely on discrete tokens. Text uses the tokenizer of its base language model. Images are quantized into discrete codes by a lookup-free quantization tokenizer of the MAGVIT-v2 type, with a codebook of 8,192 entries; a 256 by 256 input image becomes a 16 by 16 grid of 256 discrete tokens ^[1]. The Transformer is built on top of Phi-1.5, a 1.3-billion-parameter language model, whose vocabulary is expanded with the 8,192 image codes plus special tokens so that a single embedding table and a single sequence can carry both modalities ^[1]. Because images are discrete, Show-o never needs the continuous VAE latents and noise-prediction machinery that a continuous diffusion model requires; image generation is reframed as predicting masked discrete tokens.

Omni-attention

The central architectural idea is the omni-attention mechanism, a single attention operation whose mask adapts to the modality of each token ^[1]. Text tokens use causal attention, so each text token attends only to preceding tokens, exactly as in a standard autoregressive LLM. Image tokens use full, bidirectional attention among themselves, which is the natural choice for diffusion-style generation because the whole image is denoised jointly rather than written out left to right. Crucially, image tokens also attend to the text tokens that precede them (so generation is conditioned on the prompt), while text tokens that follow an image can attend back to all of that image's tokens (so understanding is conditioned on the full picture) ^[1]. When a sequence contains only text, omni-attention reduces exactly to ordinary causal attention, which lets Show-o behave as a plain language model when needed ^[1].

Two objectives in one model

Show-o is trained with two losses applied to the same sequence depending on token type ^[1]:

Next Token Prediction (NTP) is the standard language-modeling objective. Text tokens are predicted autoregressively with cross-entropy, maximizing the likelihood of each next text token given everything before it.
Mask Token Prediction (MTP) is the discrete-diffusion objective for images. During training a random subset of image tokens is replaced with a special [MASK] token, and the model must reconstruct the original tokens using all of the text tokens and the unmasked image tokens. This is the discrete analogue of denoising: masking corresponds to the forward (noising) process and prediction to the reverse (denoising) process.

The two terms are summed into a single training objective, so one Transformer learns text autoregression and image discrete diffusion simultaneously ^[1]. For text-to-image generation at inference, the image region starts fully masked and the tokens are filled in over a small number of parallel decoding steps, keeping the most confident predictions at each step and re-masking the rest, following the MaskGIT decoding scheme ^[1]. The paper reports that this needs roughly 16 to 20 sampling steps for a 256 by 256 image, far fewer than the hundreds of steps a left-to-right autoregressive image decoder of comparable length would take ^[1]. Image quality is further improved with classifier-free guidance: a null-text condition is dropped in during training, and at inference the conditioned and unconditioned logits are combined as a guided prediction to steer the output toward the prompt ^[1].

Capabilities and results

A single Show-o checkpoint supports a wide span of tasks by arranging text and image tokens differently in the sequence ^[1]:

Multimodal understanding, such as visual question answering and image captioning, where image tokens are given and text is generated autoregressively.
Text-to-image generation, where a text prompt conditions the parallel unmasking of a fully masked image region.
Text-guided inpainting and extrapolation, where some image tokens are kept fixed and the masked remainder, an erased region or an extended canvas, is regenerated to be consistent with both the kept pixels and a text instruction, with no task-specific fine-tuning.
Mixed-modality generation, producing interleaved sequences of text and images such as illustrated step-by-step content.

Despite its modest 1.3B size, Show-o reported results competitive with, or better than, larger specialized models ^[1]. Selected figures from the paper:

Task	Benchmark	Show-o (~1.3B)
Understanding	POPE	73.8
Understanding	MME (perception)	948.4
Understanding	VQAv2 (test)	59.3
Understanding	GQA	48.7
Understanding	MMMU	25.1
Understanding	Flickr30k (caption)	36.2
Generation	MS-COCO FID (30K, zero-shot)	9.24
Generation	GenEval (overall)	0.53

Values as reported in the Show-o paper ^[1]. The headline efficiency claim is that, because image tokens are produced by parallel discrete diffusion rather than left-to-right autoregression, Show-o generates images with about 20 times fewer sampling steps than an equivalent fully autoregressive model while remaining a capable understanding model ^[1].

Relationship to Transfusion and Chameleon

Show-o is best understood alongside two contemporaneous unified models, and the three map cleanly onto a design space ^[1]^[3]:

Model	Image representation	Image objective	Image attention
Chameleon	Discrete tokens	Autoregressive (next token)	Causal
Show-o	Discrete tokens	Discrete diffusion (mask token prediction)	Bidirectional
Transfusion	Continuous latents	Continuous diffusion (denoising)	Bidirectional

Chameleon, from Meta, tokenizes images and trains a single Transformer purely autoregressively over the joint vocabulary; it is fully token-based but generates images slowly, one token at a time, under causal attention ^[3]^[4]. Show-o keeps the discrete-token representation but swaps the image objective from autoregression to discrete diffusion, so image tokens are unmasked in parallel under bidirectional attention, which is the source of its sampling speedup ^[1]. Transfusion, released within a day of Show-o, takes the other fork: it leaves images as continuous VAE latents and trains them with a continuous diffusion (noise-prediction) loss, avoiding lossy quantization at the cost of needing a separate continuous diffusion pathway ^[3]^[5]. In short, Show-o and Transfusion both combine autoregression for text with diffusion for images, but Show-o's diffusion is discrete (over codebook tokens) whereas Transfusion's is continuous (over latents) ^[1]^[3]^[5]. All three are routinely cited together as the canonical points of comparison in surveys of unified multimodal models ^[3].

Significance

Show-o demonstrated that a single, comparatively small Transformer can serve as both a multimodal understanding model and an image generator, and that the generation half does not have to be slow: by casting image synthesis as discrete diffusion (mask-token prediction) rather than token-by-token autoregression, it cut the number of generation steps dramatically while keeping the whole system within one set of weights and one attention operation ^[1]. The omni-attention design, causal for text and bidirectional for images, became a recognizable template for letting one network host two generative paradigms at once ^[1]^[3].

The work has continued. In June 2025 the same group released Show-o2, described as an improved native unified multimodal model that extends the approach to video as well as images. Show-o2 replaces the discrete-diffusion image head with flow matching applied to a flow head, builds unified visual representations on top of a 3D causal variational autoencoder using spatial and spatial-temporal fusion, and keeps autoregressive modeling on the language head, trained with a two-stage recipe that scales to larger models ^[6]. Together, Show-o and Show-o2 are frequently referenced as part of the move toward "native" unified models that learn understanding and generation jointly in one network rather than connecting separately trained components ^[1]^[3]^[6].

References

Xie, J., Mao, W., Bai, Z., Zhang, D. J., Wang, W., Lin, K. Q., Gu, Y., Chen, Z., Yang, Z., and Shou, M. Z. "Show-o: One Single Transformer to Unify Multimodal Understanding and Generation." arXiv:2408.12528, 2024-08-21. https://arxiv.org/abs/2408.12528 ↩
Show Lab, National University of Singapore. "Show-o" (project page). https://showlab.github.io/Show-o/ ↩
"Towards Unified Multimodal Models: Trends and Insights." ICLR 2025 Blogposts. https://d2jud02ci9yv69.cloudfront.net/2025-04-28-unified-models-47/blog/unified-models/ ↩
Chameleon Team (Meta FAIR). "Chameleon: Mixed-Modal Early-Fusion Foundation Models." arXiv:2405.09818, 2024. https://arxiv.org/abs/2405.09818 ↩
Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., and Levy, O. "Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model." arXiv:2408.11039, 2024. https://arxiv.org/abs/2408.11039 ↩
Xie, J., Yang, Z., and Shou, M. Z., et al. "Show-o2: Improved Native Unified Multimodal Models." arXiv:2506.15564, 2025. https://arxiv.org/abs/2506.15564 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Mixture of Depths