GenEval
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,050 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,050 words
Add missing citations, update stale details, or suggest a clearer explanation.
GenEval is an object-focused benchmark for evaluating how well text-to-image models follow the content of a prompt. Instead of scoring overall image quality, it runs an object-detection model on each generated image and checks specific compositional properties: whether the right objects appear, how many there are, what color they are, and where they sit relative to each other. The framework was introduced by Dhruba Ghosh, Hanna Hajishirzi, and Ludwig Schmidt in a paper presented at the NeurIPS 2023 Datasets and Benchmarks track. [1][2]
GenEval became one of the standard numbers reported in text-to-image model releases through 2024 and 2025, alongside or in place of holistic scores. Because it grades fine-grained skills rather than aesthetics, it surfaces weaknesses that distribution-level metrics tend to hide, especially in counting, spatial relations, and attribute binding. [1][3]
The usual ways of scoring an image generator do not tell you much about whether it actually drew what you asked for. Frechet Inception Distance compares the statistics of a set of generated images against a set of real ones, so it measures realism and diversity at the level of a whole distribution, not the alignment of any single image to its prompt. CLIP Score does look at image-text correspondence, but it produces a single similarity value that blends everything together. A model can post a respectable CLIP Score while still putting two dogs in a picture that asked for three, or painting the wrong object red. These holistic metrics give you one number and leave you guessing about which skill is failing. [1][2]
GenEval takes a different angle. It treats text-to-image alignment as a set of separable, checkable claims about object presence, count, color, and position, then verifies each claim using off-the-shelf vision models. The result is an interpretable, instance-level signal: for any generated image you can say not just that it scored low, but exactly which property it got wrong and why. The authors argue that this kind of structured feedback is what holistic scores cannot provide. [1][2]
GenEval is built around six compositional tasks. Prompts are generated from templates that combine objects drawn from the MS-COCO category set with counts, colors, and spatial relations, which keeps the vocabulary inside the range a COCO-trained detector can recognize. The released suite contains 553 prompts spread across the six tasks, with four images sampled per prompt. [1][2]
| Skill | What the prompt asks | How it is graded |
|---|---|---|
| Single object | One named object, e.g. "a photo of a bird" | Detector finds at least one instance of the correct class |
| Two objects | Two distinct objects together | Both classes detected in the same image |
| Counting | A specific number of one object, e.g. "four cups" | Detector counts the right number of instances |
| Colors | An object in a named color | Detector locates the object, then a color check confirms the color |
| Position | One object placed relative to another, e.g. "a cat to the left of a dog" | Both objects detected, and their bounding-box centers satisfy the relation |
| Color attribution | Two objects in two different colors, e.g. "a red book and a blue vase" | Each object detected and each verified to carry its assigned color |
The first two skills are mostly about whether objects show up at all. Counting tests numerical fidelity. Colors and color attribution test attribute binding, with attribution being the harder version because the model has to keep two color-object pairings straight without swapping them. Position tests spatial reasoning, which generators have historically found difficult. [1]
The core of GenEval is a detection-based check rather than a learned scoring model. Each generated image is passed through Mask2Former, an instance-segmentation network, using a Swin-S backbone from the MMDetection library. Mask2Former returns bounding boxes and masks for the object classes it recognizes, and GenEval filters these to detections above a confidence threshold of 0.3. The counting task raises that threshold to 0.9, since a loose threshold would let spurious low-confidence boxes inflate the count. [1][2]
Object presence and number come straight from the detector output. Position is decided geometrically: GenEval reads the bounding-box coordinates of the two objects and verifies that their relative placement (left, right, above, below) matches the prompt. Color is the one property the detector alone cannot settle, so the pipeline crops the detected region and asks a separate discriminative model, CLIP with a ViT-L/14 image encoder, to classify the color of that crop. Color attribution chains the two steps together, detecting each object and then color-checking each crop independently. An image counts as correct only when every property the prompt specified is satisfied; a model's score on a task is the fraction of its images judged correct, and its overall GenEval score is the average across the six tasks. [1][2]
Because the whole pipeline runs on existing vision models, it is cheap to run and fully reproducible, with no human raters in the loop at evaluation time. The authors validated it against human judgment on a sample of images. GenEval reached 83% agreement with human annotators, against 88% agreement between the annotators themselves, and 91% agreement on the images where the annotators were unanimous. On the counting task in particular, GenEval improved on CLIP Score's correlation with human ratings by 22 points, which is the kind of fine-grained skill the authors built it to capture. [1][2]
The original paper evaluated several generators and a retrieval baseline. The CLIP retrieval baseline, which simply retrieves the best-matching real image rather than generating one, scored 0.35 overall, a useful floor for what the metric considers easy. Among the generators, DeepFloyd IF-XL led with 0.61 overall, just ahead of Stable Diffusion XL (SDXL) at 0.55. The table below gives the per-skill breakdown reported in the paper. [1]
| Model | Overall | Single | Two | Count | Colors | Position | Attribution |
|---|---|---|---|---|---|---|---|
| CLIP retrieval | 0.35 | 0.89 | 0.22 | 0.37 | 0.62 | 0.03 | 0.00 |
| minDALL-E | 0.23 | 0.73 | 0.11 | 0.12 | 0.37 | 0.02 | 0.01 |
| Stable Diffusion v1.5 | 0.43 | 0.97 | 0.38 | 0.35 | 0.76 | 0.04 | 0.06 |
| Stable Diffusion v2.1 | 0.50 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 |
| SDXL | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 |
| DeepFloyd IF-XL | 0.61 | 0.97 | 0.74 | 0.66 | 0.81 | 0.13 | 0.35 |
The shape of these results is consistent. Single-object scores were near saturation across the board, sitting around 0.97 to 0.98 for every real generator, and colors were already strong. The hard columns were position and color attribution, where even the best 2023 models scored well under 0.4, and counting, which separated the field (IF-XL's 0.66 was far ahead of SDXL's 0.39). This is the pattern that motivated the benchmark: models that look comparable on a single headline metric pull apart sharply once you look at compositional skills. [1]
Later models pushed the overall number up while keeping the same weak spots. In the Stable Diffusion 3 work, SD3-Medium reached 0.74 overall, with two-object accuracy at 0.94 and attribution improved to 0.60, though position stayed comparatively low at 0.33. DALL-E 3 was reported at 0.67 overall in the same comparison, strong on two objects (0.87) but still weak on counting (0.47). These two figures were independently tabulated in DeepSeek's Janus-Pro report, where the unified Janus-Pro-7B model posted 0.80 overall, ahead of SD3-Medium (0.74), Transfusion (0.63), and DALL-E 3 (0.67). [4][5]
These numbers came from different papers using the public GenEval code, so small differences in sampling, prompt rewriting, and the number of generations per prompt mean they are best read as roughly comparable rather than identical-protocol results. The broad ordering, with strong single-object and color skills and persistent difficulty on counting, position, and attribution, holds across reports. [1][4][5]
GenEval was picked up quickly as a default compositional benchmark for new image generators. It has been reported as a primary measure of basic text-to-image capability across a long line of 2024 and 2025 model papers, including Stable Diffusion 3, Transfusion, Emu3, Show-o, Janus and Janus-Pro, OmniGen, and Qwen-Image, among others. [3][5] Part of the appeal is practical: the pipeline is automated and cheap, the prompts are public, and the per-skill breakdown gives a release something more informative to report than a single alignment score. Diffusion and autoregressive image models alike are routinely benchmarked on it, which makes it a convenient common axis for comparing architectures. [3]
GenEval sits in a family of compositional text-to-image evaluations. T2I-CompBench is a related suite that also targets attribute binding and spatial and non-spatial relationships, generally using a mix of vision-language and BLIP-based scoring rather than object detection. GenAI-Bench and TIFA push toward question-answering style checks. GenEval's particular choice, leaning on a detector plus a color classifier, is what gives it crisp, almost rule-based grading on the properties a detector can see, and what also bounds the kinds of prompts it can handle. It is complementary to FID and CLIP Score rather than a replacement: FID still speaks to image realism and diversity, and CLIP Score to broad text alignment, while GenEval isolates compositional correctness. [1][6]
The most direct limitation is that GenEval is only as reliable as the vision models inside it. If Mask2Former misses an object, hallucinates one, or mislabels a class, the grade is wrong regardless of what the image actually shows, and the same applies to the CLIP color check. The detector was trained on MS-COCO categories, so the prompt vocabulary is effectively limited to objects, counts, and colors that a COCO-trained model can recognize; concepts outside that set cannot be tested cleanly. Position is reduced to bounding-box geometry, which captures simple left or right and above or below relations but not richer spatial language. The authors are explicit that the framework inherits the blind spots of its component models. [1][2]
There is also a gameability concern. Because the grading rule is known and narrow, a model or a prompt-rewriting wrapper can be tuned to satisfy the detector's notion of correctness without genuinely improving generation, for instance by exaggerating object size or saturation to make detection and color classification easier. And as generators improved, the benchmark started to top out. The 2025 GenEval 2 paper documents this directly: it reports that the original GenEval has become saturated, with Gemini 2.5 Flash Image scoring 96.7%, and that the metric's error relative to human judgment grows to as much as 17.7% for current models, a case of benchmark drift in which a once-calibrated judge falls behind the systems it evaluates. GenEval 2 responds by enlarging the prompt set to 800 prompts that scale from 3 to 10 compositional atoms and by swapping the detector for a question-answering grading method (Soft-TIFA) intended to resist that drift, with the strongest model reaching only 35.8% prompt-level accuracy on its harder compositional prompts. The existence of such successors is a reminder that GenEval captures a useful but bounded slice of text-to-image ability. [7]