GenAI-Bench

AI Benchmarks Computer Vision

8 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v2 · 1,618 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GenAI-Bench is an AI benchmark for evaluating compositional text-to-image and text-to-video generation, introduced in 2024 by researchers from Carnegie Mellon University and Meta AI ^[1]^[2]. It consists of about 1,600 text prompts collected from professional graphic designers, each tagged with the compositional reasoning skills it exercises, together with large collections of human ratings of images and videos produced by leading generative models. The benchmark tests whether generated visuals actually match prompts that involve attributes, relationships, counting, comparison, and logical operators such as negation, rather than just whether a model can produce a photorealistic image of a single object.

GenAI-Bench is closely associated with VQAScore, an automatic alignment metric proposed in a companion paper, "Evaluating Text-to-Visual Generation with Image-to-Text Generation" ^[2]. VQAScore scores how well an image matches a caption by asking a visual question answering model how likely it is to answer "Yes" to the question "Does this figure show '{text}'?". On GenAI-Bench, VQAScore correlates with human judgments substantially better than earlier metrics such as CLIPScore, making the benchmark both a test set for generative models and a yardstick for the automatic metrics used to evaluate them. The GenAI-Bench paper received the Best Short Paper award at the CVPR 2024 SynData workshop ^[1]^[3].

What is GenAI-Bench?

GenAI-Bench is a holistic benchmark for compositional text-to-visual generation that pairs roughly 1,600 designer-written prompts with a skill taxonomy and more than 80,000 human alignment ratings ^[1]. It was released in June 2024 (arXiv:2406.13743) by Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, and Deva Ramanan ^[1]. The benchmark has two jobs at once: it ranks how faithfully image and video generators follow complex prompts, and it serves as ground truth for ranking the automatic metrics that score those generators.

Why was GenAI-Bench created?

By 2024, diffusion model systems such as Stable Diffusion, DALL-E 3, and Midjourney could render highly photorealistic images, yet they often failed on prompts that combined multiple concepts in specific ways. The GenAI-Bench authors frame this as a gap between perceptual quality, which had improved rapidly, and faithfulness to complex compositional prompts, which had not ^[1]. The paper notes that current models "still struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison" ^[1].

The problem is compounded by the metrics used to measure alignment. CLIPScore, derived from CLIP, measures the cosine similarity between an image embedding and a text embedding. Because CLIP encodes text largely as a "bag of words," CLIPScore is insensitive to word order and structure: it cannot reliably distinguish "the moon is over the cow" from "the cow is over the moon," and it struggles with attribute binding, counting, and negation ^[3]. As a result, a metric can assign a high score to an image that ignores the relational or logical content of a prompt. Evaluating compositional generation therefore requires two things at once: prompts hard enough to expose these failures, and metrics faithful enough to detect them. GenAI-Bench was built to supply the prompts and the human ratings needed to study both.

What does GenAI-Bench contain?

GenAI-Bench centers on roughly 1,600 prompts sourced from graphic designers who routinely use text-to-image tools, which the authors argue makes the prompts more natural and more demanding than synthetically templated alternatives such as T2I-CompBench or the PartiPrompts set ^[1]^[3]. Each prompt is annotated with the visio-linguistic skills it requires, grouped into two tiers:

Skill tier	Skills covered
Basic	Objects, scenes, attributes (such as color, shape, material), and relationships (spatial, action, and part relations)
Advanced	Counting, comparison, differentiation, and logic (including negation and universality / universal quantification)

The advanced tier is the distinguishing feature: it targets higher-order reasoning that earlier compositional benchmarks largely omitted, such as "exactly three", comparative phrasing, telling near-identical objects apart, and statements that require something to be absent ^[1]^[3].

To turn the prompts into an evaluation set, the authors generated outputs from a range of models and collected human alignment ratings on a 1-to-5 Likert scale, with multiple annotators per image-text pair ^[3]. The image side covers six text-to-image systems, Stable Diffusion v2.1, SD-XL, SD-XL Turbo, DeepFloyd-IF, Midjourney v6, and DALL-E 3, and the video side covers four text-to-video systems, ModelScope, Floor33, Pika v1, and Runway Gen-2 ^[1]. A separate split, GenAI-Rank, supports the study of best-of-N selection: it pairs 800 prompts with 9 images per prompt for DALL-E 3 and SD-XL ^[1]. The authors report releasing tens of thousands of human ratings across the two splits, including roughly 38,400 ratings for the main GenAI-Bench evaluation and about 43,200 ratings for the GenAI-Rank images, for a total of more than 80,000 ratings ^[1].

What is VQAScore?

VQAScore is the alignment metric developed alongside GenAI-Bench and presented in "Evaluating Text-to-Visual Generation with Image-to-Text Generation" by Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan, published at the European Conference on Computer Vision (ECCV) 2024 ^[2]. Rather than comparing separate image and text embeddings, VQAScore reframes alignment as a question-answering problem. As the paper describes it, VQAScore "uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a 'Yes' answer to a simple 'Does this figure show "{text}"?' question" ^[2]. The probability the model assigns to "Yes" becomes the alignment score, computed against a generative vision-language model ^[2]^[3].

To compute this probability accurately, the authors trained an in-house model called CLIP-FlanT5, built on the FLAN-T5 encoder-decoder. Its key design choice is a bidirectional image-question encoder, which lets the image features depend on the question being asked and the question features depend on the image, in contrast to the unidirectional attention used by many decoder-only multimodal models ^[2]^[3]. The paper reports that VQAScore, even when computed with off-the-shelf VQA models, achieves state-of-the-art correlation with human judgments across roughly eight image-text alignment benchmarks, and that the CLIP-FlanT5 variant outperforms much larger proprietary systems, including pipelines that rely on GPT-4V, without using any human feedback at scoring time ^[2]^[3]. On the Winoground compositional reasoning benchmark, the authors report VQAScore improving over CLIPScore by roughly 5x on basic skills and about 10x on advanced skills ^[2].

How is GenAI-Bench used?

GenAI-Bench is used in two complementary ways: to rank generative models and to rank the metrics that score them.

For models, the human ratings expose where current systems break down, showing that all of the evaluated image and video models degrade on advanced compositional skills relative to basic ones, with the largest drops on counting, differentiation, and logic ^[1]. For metrics, the benchmark serves as ground truth for measuring how well each automatic scorer agrees with human raters. The authors report that VQAScore substantially outperforms CLIPScore and also exceeds learned human-preference metrics such as PickScore, HPSv2, and ImageReward on GenAI-Bench prompts, especially those requiring advanced reasoning ^[1]^[3].

A further use is improving generation without retraining. Because VQAScore provides a reliable alignment signal that needs no fine-tuning, it can be used for black-box best-of-N selection: generating a handful of candidate images (the paper uses 3 to 9) and keeping the one VQAScore ranks highest ^[1]^[3]. The authors report that ranking by VQAScore is roughly 2x to 3x more effective than ranking by PickScore, HPSv2, or ImageReward at raising human alignment ratings for DALL-E 3 and SD-XL outputs, which is the motivation behind the GenAI-Rank split ^[1]. VQAScore and its associated tooling have since been adopted in industry practice; Google's Imagen 3 technical report cites VQAScore as a stronger replacement for CLIPScore in automated text-to-image evaluation ^[3].

How does GenAI-Bench compare to other generative-media benchmarks?

GenAI-Bench sits within a family of benchmarks and metrics aimed at compositional faithfulness in generative media, and it is best understood by contrast with them.

T2I-CompBench evaluates compositional text-to-image generation across categories such as attribute binding, object relationships, and complex compositions, relying largely on templated prompts and a mix of automatic metrics. GenAI-Bench instead emphasizes natural prompts written by designers and adds an explicit advanced-skill tier covering logic and comparison ^[1].
GenEval is an object-focused framework that uses an object detector to check properties like object count, color, and spatial position against the prompt. It is precise but narrow, whereas GenAI-Bench targets broader open-ended prompts and higher-order reasoning that object detection alone cannot verify ^[1].
Winoground is a discriminative benchmark: it tests whether a vision-language model can match two captions to two images that contain the same words in different arrangements. VQAScore is evaluated on Winoground as a metric, while GenAI-Bench applies the same compositional concerns to the generative setting of producing images and videos from prompts ^[2].

Compared with these, GenAI-Bench's distinguishing contributions are its designer-sourced prompts, its skill taxonomy spanning both basic and advanced reasoning, its coverage of both text-to-image and text-to-video models, and its large pool of human ratings that lets it benchmark automatic metrics rather than only models ^[1]^[3]. The benchmark and the VQAScore metric are distributed together through the project's open-source evaluation toolkit (t2v_metrics), which lets researchers score text-to-image, text-to-video, and text-to-3D outputs with a single interface ^[3].

References

Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Ling, T., Xia, X., Zhang, P., Neubig, G., Ramanan, D. (2024). "GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation." arXiv:2406.13743. https://arxiv.org/abs/2406.13743 ↩
Lin, Z., Pathak, D., Li, B., Li, J., Xia, X., Neubig, G., Zhang, P., Ramanan, D. (2024). "Evaluating Text-to-Visual Generation with Image-to-Text Generation." European Conference on Computer Vision (ECCV) 2024. arXiv:2404.01291. https://arxiv.org/abs/2404.01291 ↩
Lin, Z. (2024). "VQAScore: Evaluating and Improving Vision-Language Generative Models." Machine Learning Blog, ML@CMU, Carnegie Mellon University, October 7, 2024. https://blog.ml.cmu.edu/2024/10/07/vqascore-evaluating-and-improving-vision-language-generative-models/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

CLIP Score GenEval T2I-CompBench

What is GenAI-Bench?

Why was GenAI-Bench created?

What does GenAI-Bench contain?

What is VQAScore?

How is GenAI-Bench used?

How does GenAI-Bench compare to other generative-media benchmarks?

References

Improve this article

Related Articles

Fox (benchmark)

Visual Question Answering Models

Frechet Inception Distance

CLIP Score

MMMU-Pro

EgoSchema

What links here

Related Articles

Fox (benchmark)

Visual Question Answering Models

Frechet Inception Distance

CLIP Score

MMMU-Pro

EgoSchema

What links here