WISE

AI Benchmarks Computer Vision

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v2 · 2,088 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WISE (World Knowledge-Informed Semantic Evaluation) is an AI benchmark that tests whether a text-to-image model actually possesses and correctly applies real-world knowledge when it draws a scene, rather than only whether the pixels superficially match the words in the prompt ^[1]. It uses 1,000 hand-crafted prompts across 25 subdomains in cultural common sense, spatio-temporal reasoning, and natural science, scored by a new metric called WiScore, and its central finding is that even strong image generators apply world knowledge poorly ^[1]. WISE was introduced in the paper "WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation," first posted to arXiv on March 10, 2025 by Yuwei Niu, Munan Ning, Mengren Zheng, Li Yuan, and colleagues, and later accepted to the International Conference on Machine Learning (ICML) 2026 ^[1]. The work is a collaboration led by Peking University with contributors from Chongqing University, Pengcheng Laboratory, and Rabbitpre AI, and the code and prompts are released openly through the PKU-YuanGroup repository ^[1]^[2].

The authors frame the gap directly in the paper's abstract: "existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation" ^[1]. Each WISE prompt is written so that a correct image requires the model to know a fact about the world (for example, that ice melts in a warm room, that a particular historical setting has a recognizable look, or that copper turns green as it oxidizes) and then to render that fact faithfully. WISE pairs this prompt set with WiScore, an automatic metric that uses a vision-language model as a judge to rate consistency, realism, and aesthetic quality ^[1]. The central finding of the original study is that even strong text-to-image systems score poorly when they have to apply world knowledge, exposing a gap that conventional alignment scores had largely hidden ^[1].

What is the WISE benchmark?

WISE is the first benchmark designed specifically for World Knowledge-Informed Semantic Evaluation of text-to-image generation ^[1]. It moves the question of evaluation from "does the image look like the words?" to "does the image prove the model knows the underlying fact?" Most earlier text-to-image evaluation focused on two things: image realism and shallow word-to-pixel alignment. A typical prompt would name objects and attributes ("a red apple on a wooden table"), and a metric would check that the named items appear with the named properties. The dominant automatic metric for this, CLIPScore, is derived from CLIP and measures the cosine similarity between an image embedding and a text embedding ^[1]. Because CLIP encodes text roughly as a bag of words, such metrics reward the presence of the right nouns and adjectives but are weak at structure, reasoning, and factual correctness.

The WISE authors argue that this leaves a large blind spot. A prompt like "the materials needed to make a campfire after the rain" does not name the objects to draw; the model has to reason that wood would be wet, that dry tinder is needed, and that the scene implies a particular set of physical conditions ^[1]. Likewise, "a fruit that is rich in vitamin C" requires biological knowledge to pick a plausible fruit, and "the most famous landmark in Paris" requires cultural and geographic knowledge to render the Eiffel Tower rather than a generic tower. Surface-alignment benchmarks and CLIP-style metrics cannot tell whether a model truly knows these facts, because they only check that some plausible image was produced, not that the image is the right answer to a knowledge question ^[1].

This gap became more consequential as image generators shifted from pure diffusion model pipelines toward unified multimodal AI systems that share a backbone with large language models. Such models are increasingly framed as steps toward a world model, a system that carries an internal model of how the world works. WISE was built to probe exactly that capability: whether a generator's image reflects genuine world knowledge, or merely a statistically convincing arrangement of textures ^[1].

What does WISE measure?

WISE organizes its 1,000 prompts into three top-level domains and 25 subdomains. The distribution reported in the paper is shown below ^[1].

Domain	Prompts	Subdomains
Cultural common sense	400	Festival, sports, religion, craft, construction, animal, plant, art, celebrity, life
Spatio-temporal reasoning	300	Time (horizontal and longitudinal) and space (geographic location, relative position, different view)
Natural science	300	Biology, physics, chemistry

Cultural common sense covers facts that vary by place, tradition, and human practice, such as how a specific festival is celebrated or what a traditional craft looks like. Spatio-temporal reasoning tests whether a model understands time (for example, how a landscape changes across seasons, or the chronological appearance of an object) and space (geographic settings, relative positions of objects, and viewpoint). Natural science draws on biology, physics, and chemistry, asking the model to depict outcomes that follow from scientific laws, such as a chemical reaction's product or the behavior of a physical system ^[1]. Across all three domains, prompts are written to demand reasoning and recall rather than literal transcription, so that producing the correct image is itself evidence that the model holds the underlying knowledge.

How is WiScore calculated?

WISE introduces WiScore as its primary metric, designed to replace CLIP-style similarity for knowledge-intensive prompts ^[1]. Instead of comparing embeddings, WiScore uses a powerful vision-language model as a judge to grade each generated image along three axes ^[1]:

Consistency: how accurately and completely the image reflects the prompt, capturing the key elements and the intended world knowledge.
Realism: how plausible the image is, including adherence to physical laws, accurate materials, and coherent spatial relationships.
Aesthetic quality: the overall artistic quality, including composition, color harmony, and style.

Each axis is scored on a 0 to 2 scale, with the labels rejected (0), acceptable (1), and exemplary (2) ^[1]. The three scores are then combined into a single WiScore using a weighted average that is normalized to a 0 to 1 range:

WiScore = (0.7 x Consistency + 0.2 x Realism + 0.1 x Aesthetic Quality) / 2

The heavy 0.7 weight on consistency reflects the benchmark's focus: a beautiful but factually wrong image should not score well, so correctness dominates the metric while realism and aesthetics play smaller roles ^[1]. In the published study the judge is GPT-4o (specifically the gpt-4o-2024-05-13 version), and the authors report that WiScore aligns more closely with human ratings than prior automatic metrics such as CLIPScore and VQAScore on these knowledge-heavy prompts ^[1]. After publication, the project also released a "WISE_Verified" leaderboard variant that changes the default protocol, adjusting the per-category weighting and swapping in an open-source judge served via vLLM; the maintainers explicitly note that WISE_Verified is not WISE 2.0 and that its protocol is distinct from the original WiScore used in the paper ^[2].

How do text-to-image models score on WISE?

The original WISE study evaluated 20 image generators, split into 10 dedicated text-to-image models and 10 unified multimodal models ^[1]. The headline result is that world-knowledge application is hard for essentially every model tested: the authors report "significant limitations in their ability to effectively integrate and apply world knowledge during image generation," and even the strongest dedicated diffusion models leave large amounts of factual content unrendered or wrong ^[1].

Among the dedicated text-to-image systems, FLUX.1-dev was the best performer in the paper with an overall WiScore of about 0.50, ahead of models such as SD-3.5-large (around 0.46) from the Stable Diffusion family and the autoregressive Janus-Pro-7B (around 0.35) ^[1]. The paper concludes that unified architectures that combine autoregressive language modeling with diffusion-style image synthesis tend to outperform diffusion-only pipelines on world knowledge, pointing toward the value of a shared multimodal backbone ^[1].

Model (original paper)	Type	Overall WiScore
FLUX.1-dev	Dedicated T2I (diffusion)	~0.50
SD-3.5-large	Dedicated T2I (diffusion)	~0.46
Janus-Pro-7B	Unified / autoregressive	~0.35

Even the top score of about 0.50 is far from a perfect 1.0, and the authors stress that across categories the models do not reach a level indicating a complete and satisfactory understanding of world knowledge ^[1]. WISE thus documents a large, measurable gap between rendering a scene and understanding it.

How did GPT-4o score on WISE?

A point worth clarifying about the timeline is that in the original WISE paper, GPT-4o was the judge model, not a generator under test. OpenAI's native image generation in GPT-4o launched in late March 2025, just after the benchmark first appeared, so it is not scored in the published WISE tables ^[1]. Its WISE result instead comes from a separate companion study, "GPT-ImgEval" (April 2025), which ran GPT-4o native image generation on a 200-prompt subset of WISE ^[3]. GPT-4o reached an overall WiScore of 0.89 on that subset, far above the diffusion baselines, with strong per-category scores reported as cultural 0.94, time 0.64, space 0.98, biology 0.93, physics 0.98, and chemistry 0.95 ^[3]. That evaluation is widely cited as evidence that unified autoregressive image generators apply world knowledge much more reliably than diffusion-only models, though it uses a reduced 200-prompt set and should be read alongside, not merged with, the original 20-model table ^[1]^[3].

The WISE leaderboard has continued to evolve as new systems appeared. Later strong entries on the project's tracking include unified models such as BAGEL (with chain-of-thought prompting) and Qwen-Image, and the maintainers' "WISE_Verified" leaderboard has since added newer frontier image models, reflecting rapid progress in the field after the paper's release ^[2]. Among these later systems is Google's Gemini 2.5 Flash Image, the image model nicknamed "Nano Banana" that Google released on August 26, 2025 and promoted in part for its use of Gemini world knowledge in generation and editing ^[4].

Why does WISE matter?

WISE matters because it reframed text-to-image evaluation around knowledge rather than appearance. By showing that models which produce gorgeous, well-aligned images can still get the underlying facts wrong, it gave researchers a concrete way to measure the difference between rendering and understanding ^[1]. That distinction has grown more important as image generators are increasingly treated as early world models and as the field moves toward unified multimodal systems that are expected to reason as well as draw.

The benchmark also illustrates a recurring lesson in evaluation: the metric matters as much as the data. Because CLIP-style scores were largely insensitive to factual correctness, the knowledge gap WISE exposes had been easy to overlook, and the WiScore design, with a vision-language-model judge and a heavy weight on consistency, was needed to surface it ^[1]. The strong showing of unified autoregressive generators relative to diffusion-only models, first hinted at in the original paper and made vivid by the GPT-ImgEval result on GPT-4o, has helped motivate the broader shift toward image generators built on language-model backbones ^[1]^[3]. Its acceptance to ICML 2026 and its continued use as a public leaderboard reflect WISE's role as a standard probe for knowledge-grounded generation ^[1]^[2].

ELI5: What is WISE in simple terms?

Imagine you ask an art robot to draw "the most famous tower in Paris." A lazy test only checks that the robot drew a tower. WISE is a stricter test: it checks that the robot actually knew the answer is the Eiffel Tower and drew that one. It asks the robot 1,000 tricky questions like this about culture, time and space, and science, then a smart judge gives each picture a grade called WiScore. The big lesson from WISE is that even very good art robots often do not really know the facts, so they draw pretty pictures that are wrong.

References

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Fanqing Meng, Kunpeng Ning, Bin Zhu, Li Yuan. "WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation." arXiv:2503.07265, March 2025 (accepted to ICML 2026). https://arxiv.org/abs/2503.07265 ↩
PKU-YuanGroup. "WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation" (code, prompts, and leaderboard). GitHub. https://github.com/PKU-YuanGroup/WISE ↩
Zhiyuan Yan et al. "GPT-ImgEval: A Comprehensive Benchmark for Diagnosing GPT-4o in Image Generation." arXiv:2504.02782, April 2025. https://arxiv.org/abs/2504.02782 ↩
Google. "Introducing Gemini 2.5 Flash Image, our state-of-the-art image model." Google Developers Blog, August 26, 2025. https://developers.googleblog.com/en/introducing-gemini-2-5-flash-image/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

BABILong EfficientNet

What is the WISE benchmark?

What does WISE measure?

How is WiScore calculated?

How do text-to-image models score on WISE?

How did GPT-4o score on WISE?

Why does WISE matter?

ELI5: What is WISE in simple terms?

See also

References

Improve this article

Related Articles

Fox (benchmark)

Visual Question Answering Models

Frechet Inception Distance

CLIP Score

MMMU-Pro

EgoSchema

What links here

Related Articles

Fox (benchmark)

Visual Question Answering Models

Frechet Inception Distance

CLIP Score

MMMU-Pro

EgoSchema

What links here