GenAI-Bench
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,476 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,476 words
Add missing citations, update stale details, or suggest a clearer explanation.
GenAI-Bench is an AI benchmark for evaluating compositional text-to-image and text-to-video generation, introduced in 2024 by researchers from Carnegie Mellon University and Meta AI [1][2]. It consists of about 1,600 text prompts collected from professional graphic designers, each tagged with the compositional reasoning skills it exercises, together with large collections of human ratings of images and videos produced by leading generative models. The benchmark is designed to test whether generated visuals actually match prompts that involve attributes, relationships, counting, comparison, and logical operators such as negation, rather than just whether a model can produce a photorealistic image of a single object.
GenAI-Bench is closely associated with VQAScore, an automatic alignment metric proposed in a companion paper, "Evaluating Text-to-Visual Generation with Image-to-Text Generation" [2]. VQAScore scores how well an image matches a caption by asking a visual question answering model how likely it is to answer "Yes" to the question "Does this figure show '{text}'?". On GenAI-Bench, VQAScore correlates with human judgments substantially better than earlier metrics such as CLIPScore, making the benchmark both a test set for generative models and a yardstick for the automatic metrics used to evaluate them. The GenAI-Bench paper received the Best Short Paper award at the CVPR 2024 SynData workshop [1][3].
By 2024, diffusion model systems such as Stable Diffusion, DALL-E 3, and Midjourney could render highly photorealistic images, yet they often failed on prompts that combined multiple concepts in specific ways. The GenAI-Bench authors frame this as a gap between perceptual quality, which had improved rapidly, and faithfulness to complex compositional prompts, which had not [1].
The problem is compounded by the metrics used to measure alignment. CLIPScore, derived from CLIP, measures the cosine similarity between an image embedding and a text embedding. Because CLIP encodes text largely as a "bag of words," CLIPScore is insensitive to word order and structure: it cannot reliably distinguish "the moon is over the cow" from "the cow is over the moon," and it struggles with attribute binding, counting, and negation [3]. As a result, a metric can assign a high score to an image that ignores the relational or logical content of a prompt. Evaluating compositional generation therefore requires two things at once: prompts hard enough to expose these failures, and metrics faithful enough to detect them. GenAI-Bench was built to supply the prompts and the human ratings needed to study both.
GenAI-Bench centers on roughly 1,600 prompts sourced from graphic designers who routinely use text-to-image tools, which the authors argue makes the prompts more natural and more demanding than synthetically templated alternatives such as T2I-CompBench or the PartiPrompts set [1][3]. Each prompt is annotated with the visio-linguistic skills it requires, grouped into two tiers:
| Skill tier | Skills covered |
|---|---|
| Basic | Objects, scenes, attributes (such as color, shape, material), and relationships (spatial, action, and part relations) |
| Advanced | Counting, comparison, differentiation, and logic (including negation and universality / universal quantification) |
The advanced tier is the distinguishing feature: it targets higher-order reasoning that earlier compositional benchmarks largely omitted, such as "exactly three", comparative phrasing, telling near-identical objects apart, and statements that require something to be absent [1][3].
To turn the prompts into an evaluation set, the authors generated outputs from a range of models and collected human alignment ratings on a 1-to-5 Likert scale, with multiple annotators per image-text pair [3]. The image side covers six text-to-image systems, Stable Diffusion v2.1, SD-XL, SD-XL Turbo, DeepFloyd-IF, Midjourney v6, and DALL-E 3, and the video side covers four text-to-video systems, ModelScope, Floor33, Pika v1, and Runway Gen-2 [1]. A separate split, GenAI-Rank, supports the study of best-of-N selection: it pairs 800 prompts with 9 images per prompt for DALL-E 3 and SD-XL [1]. The authors report releasing tens of thousands of human ratings across the two splits, including roughly 38,400 ratings for the main GenAI-Bench evaluation and about 43,200 ratings for the GenAI-Rank images, for a total of more than 80,000 ratings [1].
VQAScore is the alignment metric developed alongside GenAI-Bench and presented in "Evaluating Text-to-Visual Generation with Image-to-Text Generation" by Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan, published at the European Conference on Computer Vision (ECCV) 2024 [2]. Rather than comparing separate image and text embeddings, VQAScore reframes alignment as a question-answering problem. Given an image and a caption, it poses the question "Does this figure show '{text}'?" to a generative vision-language model and uses the probability the model assigns to the answer "Yes" as the alignment score [2][3].
To compute this probability accurately, the authors trained an in-house model called CLIP-FlanT5, built on the FLAN-T5 encoder-decoder. Its key design choice is a bidirectional image-question encoder, which lets the image features depend on the question being asked and the question features depend on the image, in contrast to the unidirectional attention used by many decoder-only multimodal models [2][3]. The paper reports that VQAScore, even when computed with off-the-shelf VQA models, achieves state-of-the-art correlation with human judgments across roughly eight image-text alignment benchmarks, and that the CLIP-FlanT5 variant outperforms much larger proprietary systems, including pipelines that rely on GPT-4V, without using any human feedback at scoring time [2][3]. On the Winoground compositional reasoning benchmark, the authors report VQAScore improving over CLIPScore by roughly 5x on basic skills and about 10x on advanced skills [2].
GenAI-Bench is used in two complementary ways: to rank generative models and to rank the metrics that score them.
For models, the human ratings expose where current systems break down, showing that all of the evaluated image and video models degrade on advanced compositional skills relative to basic ones, with the largest drops on counting, differentiation, and logic [1]. For metrics, the benchmark serves as ground truth for measuring how well each automatic scorer agrees with human raters. The authors report that VQAScore substantially outperforms CLIPScore and also exceeds learned human-preference metrics such as PickScore, HPSv2, and ImageReward on GenAI-Bench prompts, especially those requiring advanced reasoning [1][3].
A further use is improving generation without retraining. Because VQAScore provides a reliable, differentiable-free alignment signal, it can be used for black-box best-of-N selection: generating a handful of candidate images (the paper uses 3 to 9) and keeping the one VQAScore ranks highest [1][3]. The authors report that ranking by VQAScore is roughly 2x to 3x more effective than ranking by PickScore, HPSv2, or ImageReward at raising human alignment ratings for DALL-E 3 and SD-XL outputs, which is the motivation behind the GenAI-Rank split [1]. VQAScore and its associated tooling have since been adopted in industry practice; Google's Imagen 3 technical report cites VQAScore as a stronger replacement for CLIPScore in automated text-to-image evaluation [3].
GenAI-Bench sits within a family of benchmarks and metrics aimed at compositional faithfulness in generative media, and it is best understood by contrast with them.
Compared with these, GenAI-Bench's distinguishing contributions are its designer-sourced prompts, its skill taxonomy spanning both basic and advanced reasoning, its coverage of both text-to-image and text-to-video models, and its large pool of human ratings that lets it benchmark automatic metrics rather than only models [1][3]. The benchmark and the VQAScore metric are distributed together through the project's open-source evaluation toolkit, which lets researchers score text-to-image, text-to-video, and text-to-3D outputs with a single interface [3].