# Parti (text-to-image model)

> Source: https://aiwiki.ai/wiki/parti
> Updated: 2026-06-27
> Categories: Generative AI, Google, Image Generation
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Parti** (Pathways Autoregressive Text-to-Image) is a [text-to-image](/wiki/text_to_image) generation model from [Google](/wiki/google) Research that produces images from natural-language descriptions by treating the task as a sequence-to-sequence problem rather than as a denoising process. Described in the paper "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation," posted to arXiv on 22 June 2022, Parti is an encoder-decoder Transformer that generates a sequence of discrete image tokens from a text prompt and then decodes them back into pixels, the same next-token recipe used to train large language models.[1][2] The work is best known for two contributions: a demonstration that autoregressive image generation keeps improving as the model is scaled to 20 billion parameters (reaching a state-of-the-art zero-shot Frechet Inception Distance of 7.23 on MS-COCO), and PartiPrompts, a benchmark of more than 1,600 English prompts designed to stress-test text-to-image systems.[1][2][3] Parti was released as a research result, with code, model weights, and training data deliberately withheld pending stronger safeguards.[1][2]

## What is Parti?

Parti emerged during a period of rapid progress in text-to-image generation in 2022. It was published within days of [Imagen](/wiki/imagen), another Google model announced that same month, and the two systems were explicitly framed as complementary explorations of different generative families: Parti is autoregressive, while Imagen is a [diffusion model](/wiki/diffusion_model). According to the project page, the two are "complementary in exploring two different families of generative models, autoregressive and diffusion, respectively, opening exciting opportunities for combinations of these two powerful models."[1] Earlier autoregressive text-to-image systems, including OpenAI's [DALL-E](/wiki/dall_e) and the CogView models, had established the broad template that Parti built on, but Parti pushed the autoregressive recipe to a substantially larger scale.[3]

The seventeen-author paper was led by Jiahui Yu, with Yonghui Wu and Jason Baldridge among the senior authors. The team came out of Google's research organization, drawing on the same engineering lineage as Google's [Pathways](/wiki/pathways) system and the broader work of [Google Brain](/wiki/google_brain) and [Google Research](/wiki/google_research). The "Pathways" in the model's name reflects that infrastructure heritage rather than a separately released product.[1][4]

## How does Parti work?

The defining feature of Parti is that it casts image generation as a translation-like task. The paper states that "Parti treats text-to-image generation as a sequence-to-sequence modeling problem, akin to machine translation, with sequences of image tokens as the target outputs rather than text tokens in another language."[2] Where a machine-translation model reads a sentence in one language and emits a sentence in another, Parti reads a text prompt and emits a sequence of discrete image tokens, which are then decoded back into pixels. This framing lets the model inherit techniques and intuitions from large language models, particularly the expectation that quality scales with model and data size.[1][2][3]

The system has two main components. First, a Transformer-based image tokenizer called ViT-VQGAN learns to encode an image into a sequence of discrete tokens and to reconstruct those tokens back into a high-fidelity image. ViT-VQGAN serves as both the encoder that produces training targets and the detokenizer that turns generated token sequences back into visible pictures.[1][2][3] Second, a standard encoder-decoder Transformer is trained to map the text prompt (consumed by the encoder) to the corresponding image-token sequence, with the decoder predicting tokens one at a time in autoregressive fashion.[2][3] As the paper puts it, "Our approach is simple: First, Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens."[2]

Because the image is represented as a flat sequence of tokens, the generation problem reduces to next-token prediction, the same objective used to train text-language models. That design choice is what allows Parti to benefit directly from the scaling behavior observed in large language modeling, and it is the central methodological difference between Parti and diffusion-based competitors such as Imagen and later systems like [Imagen 2](/wiki/imagen_2).[1][2]

## How large is Parti and how does it scale?

Parti was studied at four sizes so the authors could measure how performance changes with scale. The encoder-decoder Transformer was trained at roughly 350 million, 750 million, 3 billion, and 20 billion parameters.[1][5]

| Model variant | Parameters |
| --- | --- |
| Parti-350M | 350 million |
| Parti-750M | 750 million |
| Parti-3B | 3 billion |
| Parti-20B | 20 billion |

The headline finding was that quality improved consistently as the model grew, with no sign of saturation at the largest size tested.[1][2] The paper reports that the team "achieve consistent quality improvements by scaling the encoder-decoder Transformer model up to 20B parameters, with a new state-of-the-art zero-shot FID score of 7.23 and finetuned FID score of 3.22 on MS-COCO."[2] Human raters also preferred the largest model by wide margins over the 3-billion-parameter version: in side-by-side comparisons, the 20B outputs were favored 63.2 percent of the time for image realism and quality, and 75.9 percent of the time for how well the image matched the text.[1] The gains were most pronounced on prompts that demanded abstract reasoning, specific viewpoints, world knowledge, or the rendering of written text, suggesting that scale helps most on the hardest compositional cases.[1][2]

## What can Parti do?

A recurring weakness of early text-to-image models was their inability to spell. Parti's scaling study showed that the largest model handled writing and symbol rendering noticeably better than smaller variants, producing more legible words and signs inside generated scenes.[1][2] More broadly, the authors emphasized content-rich synthesis: the ability to compose images that combine multiple objects in specified relationships, reflect particular artistic styles, respect requested perspectives, and incorporate factual world knowledge.[1][3]

The Parti project page illustrated these capabilities with progressively harder prompts, showing how a single detailed description could be rendered with increasing fidelity as model size grew. The authors framed prompt complexity itself as a useful axis of evaluation, since a model's failures become more informative as prompts move from simple object requests toward elaborate, multi-clause instructions.[1][2]

## What is the PartiPrompts (P2) benchmark?

Alongside the model, the team released PartiPrompts (often abbreviated P2), described in the paper as "a new holistic benchmark of over 1600 English prompts" intended to probe text-to-image systems across a wide range of difficulty.[1][2][3] Rather than scoring only photorealism, P2 is organized along two independent axes. Each prompt is tagged with one of 12 broad content categories (covering subjects such as abstract concepts, animals, vehicles, outdoor scenes, and world knowledge) and one of 11 challenge dimensions (including basic, quantity, words and symbols, linguistic structures, imagination, fine-grained detail, and complex).[6]

This two-way tagging lets researchers analyze where a model struggles. A prompt like "7 dogs sitting around a poker table, two of which are turning away" falls under the animals category but tests the quantity challenge, while a request for an imaginary landscape might sit in outdoor scenes under the imagination challenge.[6] By separating subject matter from the kind of reasoning a prompt demands, PartiPrompts became a widely cited resource for evaluating later text-to-image models well beyond Parti itself.[6]

## How is Parti different from Imagen and diffusion models?

Parti and Imagen were released together as two halves of the same research bet, but they take opposite technical routes. Parti is autoregressive: it generates an image one discrete token at a time, conditioning each new token on the prompt and on tokens already produced, exactly as a language model predicts the next word. Imagen and other diffusion models instead start from random noise and iteratively denoise it into an image over many refinement steps. Because Parti reduces image generation to next-token prediction, it inherits the scaling laws and tooling of large language models, which is why the authors could push it to 20 billion parameters and watch quality keep climbing.[1][2] The FID results placed autoregressive generation on competitive footing with diffusion at the time, and the project framed the two paradigms as complementary rather than rivals.[1][2]

## Why is Parti significant?

Parti is significant as a high-water mark for the autoregressive line of text-to-image research and as a clean demonstration that the scaling laws familiar from language modeling carry over to image generation. Together with Imagen, it showed Google pursuing both major generative paradigms in parallel, and the FID results placed autoregressive generation on competitive footing with diffusion at the time.[1][2] The reported numbers, including the 7.23 zero-shot FID and the 20-billion-parameter ceiling, became reference points in subsequent text-to-image literature.[3]

Equally influential was the decision not to ship the model. Citing concerns about bias, safety, and the potential for misuse, the authors wrote: "For these reasons, we have decided not to release our Parti models, code, or data for public use without further safeguards in place."[1] They still released the PartiPrompts benchmark to support further research.[1][3] That posture mirrored a broader caution among large research labs around generative imagery in 2022. Within Google, generative-image and multimodal work would continue across the organization and into [Google DeepMind](/wiki/google_deepmind) after its 2023 formation, with PartiPrompts persisting as a standard evaluation set even as the underlying generative approach in many later products shifted toward diffusion.[6]

## References

1. [Parti: Pathways Autoregressive Text-to-Image Model (project page)](https://sites.research.google/parti/) - Google Research.
2. Yu, Jiahui, et al. ["Scaling Autoregressive Models for Content-Rich Text-to-Image Generation."](https://arxiv.org/abs/2206.10789) arXiv:2206.10789, 22 June 2022.
3. ["Google AI Researchers Propose the Pathways Autoregressive Text-to-Image (Parti) Model."](https://www.marktechpost.com/2022/06/30/google-ai-researchers-propose-the-pathways-autoregressive-text-to-image-parti-model-which-generates-high-fidelity-photorealistic-images-and-supports-content-rich-synthesis/) MarkTechPost, 30 June 2022.
4. [google-research/parti (README)](https://github.com/google-research/parti/blob/main/README.md) - GitHub.
5. [Scaling Autoregressive Models for Content-Rich Text-to-Image Generation (paper PDF)](https://gweb-research-parti.web.app/parti_paper.pdf) - Google Research.
6. [P2 (PartiPrompts) Dataset](https://paperswithcode.com/dataset/p2) - Papers With Code.

