Show-o
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,884 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,884 words
Add missing citations, update stale details, or suggest a clearer explanation.
Show-o is a unified multimodal model, introduced in 2024, that handles both multimodal understanding and visual generation inside a single Transformer. Its distinguishing feature is that one network mixes two different generative paradigms: it predicts text autoregressively, one token at a time, while it generates images through discrete denoising diffusion, that is, by iteratively unmasking image tokens in the style of MaskGIT. The same set of weights therefore answers questions about an image, generates an image from a text prompt, fills in or extends images, and produces interleaved text-and-image output. The model was presented in the paper "Show-o: One Single Transformer to Unify Multimodal Understanding and Generation" by Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou, posted to arXiv on 2024-08-21 and later accepted at ICLR 2025 [1][2].
Most authors were affiliated with Show Lab at the National University of Singapore, with Weihao Wang and Zhijie Chen at ByteDance [2]. The flagship model is relatively small at about 1.3 billion parameters, yet the authors report that it matches or beats larger single-purpose models on a range of understanding and generation benchmarks [1]. Show-o was released within days of Meta's Transfusion, and the two are usually discussed together as contrasting answers to the same question: how to fuse a language model with an image generator in one Transformer. Where Transfusion attaches a continuous diffusion loss on image latents, Show-o keeps images as discrete tokens and uses discrete diffusion [1][3].
By 2024, the strongest systems for multimodal understanding and the strongest systems for image generation had diverged in their basic machinery [1]. Understanding-focused multimodal large language models, such as the LLaVA family, are built on autoregressive LLMs that ingest image features and emit text. The best image generators, by contrast, are continuous diffusion models that denoise pixels or latents and have no native notion of language modeling. Combining the two typically meant one of two compromises [1]:
Show-o's goal was a single Transformer that natively does both jobs while keeping each modality on the generative paradigm best suited to it: autoregression for the inherently sequential, causal structure of text, and (discrete) diffusion for images, whose tokens can be predicted in parallel and refined over a small number of steps [1]. The intended payoff is flexibility, one model for many vision-language tasks, together with the sampling efficiency of parallel image decoding rather than slow token-by-token image generation [1].
Show-o operates entirely on discrete tokens. Text uses the tokenizer of its base language model. Images are quantized into discrete codes by a lookup-free quantization tokenizer of the MAGVIT-v2 type, with a codebook of 8,192 entries; a 256 by 256 input image becomes a 16 by 16 grid of 256 discrete tokens [1]. The Transformer is built on top of Phi-1.5, a 1.3-billion-parameter language model, whose vocabulary is expanded with the 8,192 image codes plus special tokens so that a single embedding table and a single sequence can carry both modalities [1]. Because images are discrete, Show-o never needs the continuous VAE latents and noise-prediction machinery that a continuous diffusion model requires; image generation is reframed as predicting masked discrete tokens.
The central architectural idea is the omni-attention mechanism, a single attention operation whose mask adapts to the modality of each token [1]. Text tokens use causal attention, so each text token attends only to preceding tokens, exactly as in a standard autoregressive LLM. Image tokens use full, bidirectional attention among themselves, which is the natural choice for diffusion-style generation because the whole image is denoised jointly rather than written out left to right. Crucially, image tokens also attend to the text tokens that precede them (so generation is conditioned on the prompt), while text tokens that follow an image can attend back to all of that image's tokens (so understanding is conditioned on the full picture) [1]. When a sequence contains only text, omni-attention reduces exactly to ordinary causal attention, which lets Show-o behave as a plain language model when needed [1].
Show-o is trained with two losses applied to the same sequence depending on token type [1]:
The two terms are summed into a single training objective, so one Transformer learns text autoregression and image discrete diffusion simultaneously [1]. For text-to-image generation at inference, the image region starts fully masked and the tokens are filled in over a small number of parallel decoding steps, keeping the most confident predictions at each step and re-masking the rest, following the MaskGIT decoding scheme [1]. The paper reports that this needs roughly 16 to 20 sampling steps for a 256 by 256 image, far fewer than the hundreds of steps a left-to-right autoregressive image decoder of comparable length would take [1]. Image quality is further improved with classifier-free guidance: a null-text condition is dropped in during training, and at inference the conditioned and unconditioned logits are combined as a guided prediction to steer the output toward the prompt [1].
A single Show-o checkpoint supports a wide span of tasks by arranging text and image tokens differently in the sequence [1]:
Despite its modest 1.3B size, Show-o reported results competitive with, or better than, larger specialized models [1]. Selected figures from the paper:
| Task | Benchmark | Show-o (~1.3B) |
|---|---|---|
| Understanding | POPE | 73.8 |
| Understanding | MME (perception) | 948.4 |
| Understanding | VQAv2 (test) | 59.3 |
| Understanding | GQA | 48.7 |
| Understanding | MMMU | 25.1 |
| Understanding | Flickr30k (caption) | 36.2 |
| Generation | MS-COCO FID (30K, zero-shot) | 9.24 |
| Generation | GenEval (overall) | 0.53 |
Values as reported in the Show-o paper [1]. The headline efficiency claim is that, because image tokens are produced by parallel discrete diffusion rather than left-to-right autoregression, Show-o generates images with about 20 times fewer sampling steps than an equivalent fully autoregressive model while remaining a capable understanding model [1].
Show-o is best understood alongside two contemporaneous unified models, and the three map cleanly onto a design space [1][3]:
| Model | Image representation | Image objective | Image attention |
|---|---|---|---|
| Chameleon | Discrete tokens | Autoregressive (next token) | Causal |
| Show-o | Discrete tokens | Discrete diffusion (mask token prediction) | Bidirectional |
| Transfusion | Continuous latents | Continuous diffusion (denoising) | Bidirectional |
Chameleon, from Meta, tokenizes images and trains a single Transformer purely autoregressively over the joint vocabulary; it is fully token-based but generates images slowly, one token at a time, under causal attention [3][4]. Show-o keeps the discrete-token representation but swaps the image objective from autoregression to discrete diffusion, so image tokens are unmasked in parallel under bidirectional attention, which is the source of its sampling speedup [1]. Transfusion, released within a day of Show-o, takes the other fork: it leaves images as continuous VAE latents and trains them with a continuous diffusion (noise-prediction) loss, avoiding lossy quantization at the cost of needing a separate continuous diffusion pathway [3][5]. In short, Show-o and Transfusion both combine autoregression for text with diffusion for images, but Show-o's diffusion is discrete (over codebook tokens) whereas Transfusion's is continuous (over latents) [1][3][5]. All three are routinely cited together as the canonical points of comparison in surveys of unified multimodal models [3].
Show-o demonstrated that a single, comparatively small Transformer can serve as both a multimodal understanding model and an image generator, and that the generation half does not have to be slow: by casting image synthesis as discrete diffusion (mask-token prediction) rather than token-by-token autoregression, it cut the number of generation steps dramatically while keeping the whole system within one set of weights and one attention operation [1]. The omni-attention design, causal for text and bidirectional for images, became a recognizable template for letting one network host two generative paradigms at once [1][3].
The work has continued. In June 2025 the same group released Show-o2, described as an improved native unified multimodal model that extends the approach to video as well as images. Show-o2 replaces the discrete-diffusion image head with flow matching applied to a flow head, builds unified visual representations on top of a 3D causal variational autoencoder using spatial and spatial-temporal fusion, and keeps autoregressive modeling on the language head, trained with a two-stage recipe that scales to larger models [6]. Together, Show-o and Show-o2 are frequently referenced as part of the move toward "native" unified models that learn understanding and generation jointly in one network rather than connecting separately trained components [1][3][6].