T2I-CompBench
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,430 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,430 words
Add missing citations, update stale details, or suggest a clearer explanation.
T2I-CompBench is an AI benchmark for evaluating compositional text-to-image generation, introduced in the paper "T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation" by Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu, and published at the Datasets and Benchmarks Track of NeurIPS 2023 [1][2]. The work is a collaboration between the University of Hong Kong and Huawei Noah's Ark Lab [2]. The benchmark targets a specific weakness of modern generative image models: even when individual objects are rendered with high fidelity, models often fail to bind the correct attributes to the correct objects, place objects in the correct spatial relationships, count objects accurately, or compose multiple objects into a coherent scene [1].
The original release consists of 6,000 compositional text prompts spanning 3 categories and 6 sub-categories, together with a set of automatic evaluation metrics tailored to each sub-category and a study of how well those metrics correlate with human judgments [1]. An extended version, T2I-CompBench++, was released in 2025 and published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), expanding the benchmark to 8,000 prompts across 4 categories and 8 sub-categories, adding 3D-spatial relationships and generative numeracy as new dimensions [3][4]. T2I-CompBench and its successor are widely cited in the evaluation tables of text-to-image model reports and serve as one of the standard reference points for compositional faithfulness, alongside GenEval, GenAI-Bench, and DPG-Bench.
Compositionality refers to a model's ability to combine known concepts, such as objects, attributes, and relations, into novel and correct configurations. Text-to-image diffusion models such as Stable Diffusion produce visually convincing images, but they frequently make compositional errors that a human would not [1]. Common failure modes include:
Before T2I-CompBench, evaluation of text-to-image alignment leaned heavily on the CLIP-based CLIPScore, which measures global image-text similarity. The authors argue that CLIPScore is poorly suited to fine-grained compositional checks because it captures overall semantic overlap rather than whether each attribute is bound to the right object or whether a specified relationship holds [1]. T2I-CompBench was designed to provide both a large, structured prompt set and metrics sensitive to these specific compositional properties.
The original T2I-CompBench organizes its 6,000 prompts into three categories of compositionality, each broken into sub-categories. The benchmark generates 1,000 prompts per sub-category, split into 700 for training and 300 for testing, so that the prompt set can support both evaluation and fine-tuning research [1][3].
| Category | Sub-category | Description |
|---|---|---|
| Attribute binding | Color | Binding color attributes to the correct objects |
| Attribute binding | Shape | Binding shape attributes to the correct objects |
| Attribute binding | Texture | Binding texture or material attributes to the correct objects |
| Object relationships | Spatial | Relative position of objects (for example, "next to," "on top of") |
| Object relationships | Non-spatial | Action or interaction relationships (for example, "watching," "holding") |
| Complex compositions | Complex | Multiple objects with multiple attributes and relationships in one scene |
Prompts are constructed to be "open-world" in the sense that they combine a broad vocabulary of objects and attributes rather than being restricted to a closed label set, and they are generated through a mix of predefined rules and large-language-model assistance to increase diversity [1].
A central contribution of T2I-CompBench is a set of automatic metrics matched to each compositional sub-category, intended to align better with human perception than a single global similarity score [1]:
The authors validate these metrics against human ratings and report that Disentangled BLIP-VQA correlates best with humans on attribute binding, the UniDet-based metric correlates best on spatial relationships, and CLIPScore is the strongest choice for non-spatial relationships [1]. The paper also explores using multimodal large language models for evaluation, an idea that is expanded substantially in the follow-up version.
Beyond the benchmark itself, the original paper proposes a training method called GORS (Generative mOdel finetuning with Reward-driven Sample selection), which fine-tunes a pretrained text-to-image model on its own generations that score highly on the compositional metrics, weighting the fine-tuning loss by the reward, and shows that this improves compositional performance [1].
T2I-CompBench++, titled "T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation," is the journal extension authored by Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu, published in IEEE TPAMI in 2025 [3][4]. It shares the arXiv identifier of the original work (arXiv:2307.06350), with the latest revision posted in 2025 [3].
The enhanced benchmark grows to 8,000 prompts organized into four primary categories, attribute binding, object relationships, generative numeracy, and complex compositions, and eight sub-categories [3][4]. Relative to the original six sub-categories, T2I-CompBench++ adds:
On the metrics side, T2I-CompBench++ refines the detection-based evaluation so that the UniDet-based metric covers both 2D-spatial and 3D-spatial relationships as well as numeracy, and it adds evaluation based on multimodal large language models, using models such as GPT-4V and ShareGPT4V with chain-of-thought style prompting as an alternative scoring approach [3][4]. The paper benchmarks 11 text-to-image models, including recent systems such as FLUX.1, Stable Diffusion 3, DALL-E 3, PixArt-alpha, and SDXL, against earlier baselines [3][4].
T2I-CompBench established compositional faithfulness as a measurable axis for text-to-image evaluation and provided ready-to-use prompt sets and metrics that subsequent work could adopt directly. Reported T2I-CompBench scores, especially for color, shape, texture, and spatial binding, appear in the evaluation sections of many image-generation systems, where they complement aesthetic and overall-alignment measures. The benchmark sits within a small family of compositional and prompt-following evaluations that emerged around the same period:
Compared with these, T2I-CompBench is distinguished by its scale (thousands of prompts), its explicit decomposition into attribute, relationship, numeracy, and complex-composition categories, and its category-specific metrics that the authors calibrated against human ratings [1][3]. The combination of a structured benchmark, automatic metrics, and a fine-tuning recipe has made T2I-CompBench and T2I-CompBench++ frequently cited reference points for measuring and improving compositional generation in systems including Stable Diffusion, DALL-E 3, PixArt, Imagen, Midjourney, and FLUX.1.