# T2I-CompBench

> Source: https://aiwiki.ai/wiki/t2i_compbench
> Updated: 2026-06-09
> Categories: AI Benchmarks, Computer Vision
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

## Overview

T2I-CompBench is an [AI benchmark](/wiki/ai_benchmark) for evaluating compositional [text-to-image](/wiki/text-to-image_models) generation, introduced in the paper "T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation" by Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu, and published at the Datasets and Benchmarks Track of NeurIPS 2023 [1][2]. The work is a collaboration between the University of Hong Kong and Huawei Noah's Ark Lab [2]. The benchmark targets a specific weakness of modern [generative](/wiki/generative_ai) image models: even when individual objects are rendered with high fidelity, models often fail to bind the correct attributes to the correct objects, place objects in the correct spatial relationships, count objects accurately, or compose multiple objects into a coherent scene [1].

The original release consists of 6,000 compositional text prompts spanning 3 categories and 6 sub-categories, together with a set of automatic evaluation metrics tailored to each sub-category and a study of how well those metrics correlate with human judgments [1]. An extended version, T2I-CompBench++, was released in 2025 and published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), expanding the benchmark to 8,000 prompts across 4 categories and 8 sub-categories, adding 3D-spatial relationships and generative numeracy as new dimensions [3][4]. T2I-CompBench and its successor are widely cited in the evaluation tables of text-to-image model reports and serve as one of the standard reference points for compositional faithfulness, alongside [GenEval](/wiki/geneval), [GenAI-Bench](/wiki/genai_bench), and DPG-Bench.

## Motivation: compositionality

Compositionality refers to a model's ability to combine known concepts, such as objects, attributes, and relations, into novel and correct configurations. Text-to-image [diffusion models](/wiki/diffusion_model) such as [Stable Diffusion](/wiki/stable_diffusion) produce visually convincing images, but they frequently make compositional errors that a human would not [1]. Common failure modes include:

- Attribute leakage or mis-binding, where the color, shape, or texture named for one object is applied to a different object (for example, a prompt for "a red book and a yellow vase" yielding a yellow book or a red vase).
- Incorrect spatial relationships, where "on top of," "to the left of," or "below" is ignored or reversed.
- Incorrect counts, where the number of generated objects does not match the requested quantity.
- Failure to render all objects in a complex multi-object prompt, with some entities omitted entirely.

Before T2I-CompBench, evaluation of text-to-image alignment leaned heavily on the [CLIP](/wiki/clip)-based [CLIPScore](/wiki/clip_score), which measures global image-text similarity. The authors argue that CLIPScore is poorly suited to fine-grained compositional checks because it captures overall semantic overlap rather than whether each attribute is bound to the right object or whether a specified relationship holds [1]. T2I-CompBench was designed to provide both a large, structured prompt set and metrics sensitive to these specific compositional properties.

## Structure: categories and prompts

The original T2I-CompBench organizes its 6,000 prompts into three categories of compositionality, each broken into sub-categories. The benchmark generates 1,000 prompts per sub-category, split into 700 for training and 300 for testing, so that the prompt set can support both evaluation and fine-tuning research [1][3].

| Category | Sub-category | Description |
|----------|--------------|-------------|
| Attribute binding | Color | Binding color attributes to the correct objects |
| Attribute binding | Shape | Binding shape attributes to the correct objects |
| Attribute binding | Texture | Binding texture or material attributes to the correct objects |
| Object relationships | Spatial | Relative position of objects (for example, "next to," "on top of") |
| Object relationships | Non-spatial | Action or interaction relationships (for example, "watching," "holding") |
| Complex compositions | Complex | Multiple objects with multiple attributes and relationships in one scene |

Prompts are constructed to be "open-world" in the sense that they combine a broad vocabulary of objects and attributes rather than being restricted to a closed label set, and they are generated through a mix of predefined rules and large-language-model assistance to increase diversity [1].

## Evaluation metrics

A central contribution of T2I-CompBench is a set of automatic metrics matched to each compositional sub-category, intended to align better with human perception than a single global similarity score [1]:

- Disentangled BLIP-VQA, used for the attribute binding categories. Rather than asking a single complex question, the method decomposes a prompt into separate visual question answering queries (one per object-attribute pair) and uses the BLIP visual-question-answering model to check each binding independently, which reduces interference between attributes.
- UniDet-based spatial metric, used for spatial relationships. It applies the UniDet object detector to locate objects and then verifies whether the detected layout satisfies the requested spatial relation. In T2I-CompBench++ this detection-based approach is extended to cover 3D-spatial relationships and numeracy.
- CLIPScore, used for non-spatial relationships, where global image-text similarity is a reasonable proxy for whether an action or interaction is depicted.
- 3-in-1, used for complex compositions. It averages the CLIPScore, Disentangled BLIP-VQA, and UniDet scores to produce a single number for prompts that mix attributes and relationships.

The authors validate these metrics against human ratings and report that Disentangled BLIP-VQA correlates best with humans on attribute binding, the UniDet-based metric correlates best on spatial relationships, and CLIPScore is the strongest choice for non-spatial relationships [1]. The paper also explores using multimodal large language models for evaluation, an idea that is expanded substantially in the follow-up version.

Beyond the benchmark itself, the original paper proposes a training method called GORS (Generative mOdel finetuning with Reward-driven Sample selection), which fine-tunes a pretrained text-to-image model on its own generations that score highly on the compositional metrics, weighting the fine-tuning loss by the reward, and shows that this improves compositional performance [1].

## T2I-CompBench++

T2I-CompBench++, titled "T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation," is the journal extension authored by Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu, published in IEEE TPAMI in 2025 [3][4]. It shares the arXiv identifier of the original work (arXiv:2307.06350), with the latest revision posted in 2025 [3].

The enhanced benchmark grows to 8,000 prompts organized into four primary categories, attribute binding, object relationships, generative numeracy, and complex compositions, and eight sub-categories [3][4]. Relative to the original six sub-categories, T2I-CompBench++ adds:

- 3D-spatial relationships, capturing depth and front-to-back arrangements rather than only 2D layout.
- Generative numeracy, testing whether models generate the correct number of objects.

On the metrics side, T2I-CompBench++ refines the detection-based evaluation so that the UniDet-based metric covers both 2D-spatial and 3D-spatial relationships as well as numeracy, and it adds evaluation based on multimodal large language models, using models such as GPT-4V and ShareGPT4V with chain-of-thought style prompting as an alternative scoring approach [3][4]. The paper benchmarks 11 text-to-image models, including recent systems such as FLUX.1, [Stable Diffusion 3](/wiki/stable_diffusion_3), [DALL-E 3](/wiki/dall_e_3), PixArt-alpha, and [SDXL](/wiki/sdxl), against earlier baselines [3][4].

## Significance and use

T2I-CompBench established compositional faithfulness as a measurable axis for text-to-image evaluation and provided ready-to-use prompt sets and metrics that subsequent work could adopt directly. Reported T2I-CompBench scores, especially for color, shape, texture, and spatial binding, appear in the evaluation sections of many image-generation systems, where they complement aesthetic and overall-alignment measures. The benchmark sits within a small family of compositional and prompt-following evaluations that emerged around the same period:

- GenEval, an object-focused benchmark of several hundred prompts that uses object detection and attribute classification to check single objects, two objects, counting, colors, position, and attribute binding.
- DPG-Bench (Dense Prompt Graph Benchmark), which uses roughly a thousand long, information-dense prompts (averaging dozens of words) to stress dense prompt following.
- GenAI-Bench, which broadens evaluation to include verbs, comparisons, and scene-level reasoning.

Compared with these, T2I-CompBench is distinguished by its scale (thousands of prompts), its explicit decomposition into attribute, relationship, numeracy, and complex-composition categories, and its category-specific metrics that the authors calibrated against human ratings [1][3]. The combination of a structured benchmark, automatic metrics, and a fine-tuning recipe has made T2I-CompBench and T2I-CompBench++ frequently cited reference points for measuring and improving compositional generation in systems including [Stable Diffusion](/wiki/stable_diffusion), [DALL-E 3](/wiki/dall_e_3), PixArt, [Imagen](/wiki/imagen), [Midjourney](/wiki/midjourney), and [FLUX.1](/wiki/flux_1).

## References

1. Huang, Kaiyi; Sun, Kaiyue; Xie, Enze; Li, Zhenguo; Liu, Xihui. "T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation." Advances in Neural Information Processing Systems 36 (NeurIPS 2023), Datasets and Benchmarks Track. https://proceedings.neurips.cc/paper_files/paper/2023/hash/f8ad010cdd9143dbb0e9308c093aff24-Abstract-Datasets_and_Benchmarks.html
2. Project page, "T2I-CompBench," University of Hong Kong and Huawei Noah's Ark Lab. https://karine-h.github.io/T2I-CompBench/
3. Huang, Kaiyi; Duan, Chengqi; Sun, Kaiyue; Xie, Enze; Li, Zhenguo; Liu, Xihui. "T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation." arXiv:2307.06350 (v3, 2025). https://arxiv.org/abs/2307.06350
4. Huang, Kaiyi; et al. "T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-Image Generation." IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. https://ieeexplore.ieee.org/document/10847875/
5. T2I-CompBench code repository. https://github.com/Karine-Huang/T2I-CompBench

