ERQA

AI Benchmarks Embodied AI Google DeepMind Multimodal AI

13 min read

Updated May 10, 2026

Suggest edit History Talk

RawGraph

Last edited

May 10, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 2,543 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ERQA
Overview
Full name	Embodied Reasoning Question Answering
Release date	2025-03-25
Latest version	1.0
Authors	Gemini Robotics Team, Google DeepMind
Type	Embodied Reasoning, Visual Question Answering, Robotics
Modality	Vision, Text
Task format	Multiple-choice VQA (4 options: A, B, C, D)
Reasoning categories	7
Total examples	400 questions
Multi-image share	112 (about 28%)
Single-image share	288 (about 72%)
Evaluation metric	Accuracy
Languages	English
Random baseline	25.0%
Best score (no CoT)	53.3% (Gemini Robotics-ER)
Best general VLM (no CoT)	48.3% (Gemini 2.0 Pro Experimental)
Best score (with CoT)	54.8% (Gemini 2.0 Pro Experimental)
Saturated	No
GitHub	embodiedreasoning/ERQA
Dataset format	TFRecord
Paper	arXiv:2503.20020
License	CC BY 4.0

ERQA (Embodied Reasoning Question Answering) is a multimodal benchmark released by Google DeepMind in March 2025 to evaluate the embodied reasoning capabilities of vision-language models (VLMs) on robotics scenes. It consists of 400 multiple-choice Visual Question Answering (VQA) questions covering seven reasoning skills important for an agent acting in the physical world: spatial, trajectory, action, state estimation, pointing, multi-view, and task reasoning. ERQA was open sourced alongside the Gemini Robotics technical report and is the public benchmark most closely associated with Gemini Robotics-ER. ^[1]^[2]

ERQA is intentionally complementary to existing VLM benchmarks such as RealWorldQA, BLINK, and MMMU. Rather than testing atomic visual skills like object recognition or counting, it asks compound questions about what an embodied agent should do given a scene, where it should look or move, and how multiple frames relate to each other. At release in February 2025, the strongest reported scores were 48.3% for Gemini 2.0 Pro Experimental without chain-of-thought and 54.8% with chain-of-thought, well above the 25% random baseline but well below saturation. ^[1]^[3]

background and motivation

Most multimodal models in 2024 and early 2025 were evaluated on benchmarks that emphasize atomic capabilities: identifying objects, counting them, locating them, or answering trivia about web images. Those skills are necessary for a robot but not sufficient. A useful embodied agent has to reason about which object to grasp, where a thrown ball will land, whether a cup is empty or full, and what task the human in front of it is doing. The Gemini Robotics team built ERQA to make progress on those compound skills measurable. ^[1]

The benchmark first appeared in the paper Gemini Robotics: Bringing AI into the Physical World, posted to arXiv on 25 March 2025. The paper introduces two model families. Gemini Robotics is a vision-language-action (VLA) model that turns visual observations and instructions into motor commands. Gemini Robotics-ER is a vision-language model focused on perception and planning: pointing, 2D and 3D bounding boxes, trajectory prediction, multi-view correspondence, and task decomposition. ERQA is one of the public benchmarks used to compare Gemini Robotics-ER to general-purpose VLMs, distributed under CC BY 4.0 through the GitHub repository embodiedreasoning/ERQA, which ships an evaluation harness for the Gemini API and OpenAI API. ^[1]^[2]

dataset structure

question composition

ERQA is small by modern benchmark standards, and that compactness is deliberate. Every question was hand authored and manually verified so there is exactly one defensible correct answer among the four choices. Each TFRecord example contains a textual question, one or more encoded images, the index positions where each image is interleaved with the prompt, and a single-letter ground truth answer (A, B, C, or D). ^[1]^[2]

Component	Quantity / Description
Total questions	400 multiple-choice VQA, hand authored and human verified
Answer options	4 per question, labeled A, B, C, D
Single-image questions	288 (about 72%)
Multi-image questions	112 (about 28%)
Storage format	TFRecord with `question`, `image/encoded`, `answer`, `question_type`, `visual_indices`
Languages	English only

The paper reports that the multi-image subset is consistently harder than the single-image subset across every model evaluated, suggesting current VLMs still struggle to fuse evidence across frames in the way an embodied agent needs. ^[1]

reasoning categories

Questions are tagged with one of seven question_type labels, chosen to mirror the decisions a real robot pipeline has to make rather than an academic taxonomy. ^[1]

Category	What it tests	Example
Spatial reasoning	3D relationships between objects and the viewer	"Which object is closest to the gripper?"
Trajectory reasoning	Motion paths and end positions	"Where will the rolled ball end up?"
Action reasoning	Consequences of physical actions	"What happens to the stack if the robot pushes this block?"
State estimation	Discrete or continuous object states	"Is the kettle on or off?"
Pointing	Identifying an image region from a description	"Point to the handle the robot should grasp."
Multi-view reasoning	Aligning entities across frames	"Which object in image 2 corresponds to the one in image 1?"
Task reasoning	Identifying the goal or sub-step	"What task is the person most likely doing?"

data sources

Images come from public robotics and egocentric video datasets, with a small number of custom captures for edge cases. Real footage means questions can lean on textures, occlusions, and clutter that simulator scenes lack. ^[1]^[2]

Source dataset	Contribution
Open X-Embodiment (OXE)	Diverse arm, gripper, and table-top robot scenes
Universal Manipulation Interface (UMI)	Hand-held gripper, first-person manipulation footage
MECCANO	Egocentric assembly video for part identification and step ordering
HoloAssist	AR-headset video for multi-view and human-robot collaboration
EGTEA Gaze+	First-person cooking video for long-horizon task and state reasoning
Custom captures	Hand-authored scenes filling multi-view pointing and other gaps

evaluation methodology

scoring

The evaluation protocol is austere: a model sees the prompt and candidate images in the order set by visual_indices, outputs one of A, B, C, or D, and its answer is compared to the ground truth. Accuracy is the only headline metric. The harness also supports a chain-of-thought variant in which the model can reason first; only the final letter is graded. ^[2]

Aspect	Implementation
Format	Multiple choice (A, B, C, D)
Manual verification	All 400 questions human verified
API support	Native Gemini 2.0 and OpenAI bindings
Chain of thought	Optional reasoning prefix
Retry mechanism	Configurable backoff on rate limits
Random baseline	25.0% (4-way uniform)

running the harness

The eval_harness.py script takes a TFRecord path, an API choice (gemini or openai), a model name, and a sample size. The default model list at release covered gemini-2.0-flash, gemini-2.0-pro, gemini-2.0-pro-exp-02-05, gpt-4o-2024-11-20, and gpt-4o-mini. Multiple API keys can be supplied via a keys file, and the harness retries on resource-exhaustion errors. The visual_indices field controls where images are interleaved with the prompt, which matters for multi-image questions where order is part of the problem. ^[2]

results on the initial release

The Gemini Robotics paper reports ERQA scores for five general-purpose VLMs and for Gemini Robotics-ER itself, both with and without chain-of-thought prompting, with results collected in February 2025. The numbers below are reproduced from Table 2 of the paper. ^[1]

Model	Accuracy without CoT	Accuracy with CoT	CoT gain
Random baseline	25.0%	25.0%	0.0
Claude 3.5 Sonnet	35.5%	45.8%	+10.3
GPT-4o-mini	37.3%	not reported	n/a
GPT-4o	47.0%	50.5%	+3.5
Gemini 2.0 Flash	46.3%	50.3%	+4.0
Gemini 2.0 Pro Experimental	48.3%	54.8%	+6.5
Gemini Robotics-ER	53.3%	not separately reported	n/a

A few patterns stand out. The best general-purpose VLM at release, Gemini 2.0 Pro Experimental with CoT, scored just under 55%, and the authors describe ERQA as the most challenging of the three embodied benchmarks (ERQA, RealWorldQA, BLINK) they evaluated. Chain-of-thought prompting helps every model tested, sometimes by double digits (Claude 3.5 Sonnet gains 10.3 points), and the gain is largest for models that started lower. Gemini Robotics-ER beats every general-purpose VLM in the no-CoT setting, the headline embodied-reasoning claim of the original paper. For comparison, the same paper reports Gemini 2.0 Pro Experimental at 74.5% on RealWorldQA and 65.2% on BLINK, both well above its 48.3% on ERQA. ^[1]

use in later Gemini Robotics releases

In September 2025, Google DeepMind released Gemini Robotics-ER 1.5 alongside Gemini Robotics 1.5. ER 1.5 is described as the high-level planning component of an agentic robot stack: it reasons about a scene, calls digital tools, and produces multi-step plans for a separate VLA model. It is reported to achieve state-of-the-art performance on 15 academic embodied reasoning benchmarks, including ERQA by name. Per-benchmark numbers are not broken out in the press release, but the tech report attributes the gain to better multi-view, pointing, and trajectory understanding, all explicit ERQA categories. ^[4]^[5]

In April 2026, Google DeepMind released Gemini Robotics-ER 1.6 through the Gemini API and Google AI Studio. The 1.6 release focused on instrument reading (a new capability), pointing precision, multi-view success detection, and physical safety. ERQA was not the headline benchmark for ER 1.6, but the skills it measures, especially pointing and multi-view reasoning, remain core to the model's evaluation suite. ^[6]

Benchmark	Focus	Scale	How it differs from ERQA
RealWorldQA	General real-world VQA from xAI	~700 MCQ	Less robotics specific, no multi-view requirement
BLINK	Core human visual perception	14 MCQ tasks	Pixel-level perception rather than action or task reasoning
RoboVQA	Long-horizon manipulation video	800K+ entries (open ended)	Larger but less focused on multi-image embodied skills
AI2-THOR	Embodied simulation environment	Procedural trajectories	Full agent control rather than question answering
Habitat	Embodied navigation in 3D scans	Procedural trajectories	Navigation only, no manipulation or VQA
Point-Bench	Pointing accuracy on household scenes	Several thousand pointing tasks	Atomic skill that ERQA tests as one of seven categories
ERQA+	Egocentric expanded ERQA-style benchmark	800 mixed-format questions	Reduces data contamination from the original ERQA

ERQA+

In 2025, the FlagEval team at the Beijing Academy of Artificial Intelligence (BAAI) released ERQA+, an enhanced benchmark building on the original ERQA design. ERQA+ doubles the dataset to 800 questions, anchors them on first-person robot video frames rather than web imagery, refines the taxonomy into planning, prediction, perception, and spatial reasoning, and broadens the question types beyond multiple choice (sorting, matching, counting, composite-judgment, open-ended). The reported leaderboard top is 57.3% for Gemini 3 Pro preview, while smaller open-weight models such as Gemma 3 12B reach only around 13.6%. ERQA+ is a follow-on rather than a replacement and credits ERQA explicitly. ^[7]

practical considerations

Component	Description
Questions	400 hand-authored VQA tasks in `data/erqa.tfrecord`
Images	Real photographs and video frames, encoded inside TFRecord
Evaluation code	`eval_harness.py` (Python) with Gemini and OpenAI adapters
Loader example	`loading_example.py` (Python) reference parser
API access	Gemini or OpenAI API key via env vars, CLI flags, or keys file
Compute	Inference only; no training; modest CPU and memory
License	CC BY 4.0

A researcher with a single API key can run the full benchmark in minutes and produce a score that lines up with the published Gemini Robotics numbers. ^[2]

strengths and limitations

ERQA's main selling point is its design rather than its scale. Hand-authoring 400 multi-choice questions with manual verification produces a dataset that is cheap to run, hard enough to discriminate between flagship VLMs, and structured enough to break down by reasoning category. Real images from established robotics and egocentric video datasets mean questions look like inputs an actual robot would see, which is not always true of synthetic VQA benchmarks. The seven-category taxonomy maps each label to a recognizable robot pipeline component (perception, motion prediction, planning, multi-view fusion). ^[1]^[2]

limitations

Limitation	Impact
400 questions only	Wide confidence intervals on per-category breakdowns
English only	Limits use in non-English research and production
Static dataset	Risk of overfitting and data contamination over time
Multiple choice format	Cannot capture free-form rationales or alternative correct actions
Web-image overlap	Some source datasets may already be in VLM pretraining; motivated the egocentric redesign in ERQA+
No human baseline	The paper does not report human accuracy on ERQA

The Gemini Robotics paper notes that even the strongest model studied still struggles with spatial relationships in long videos and fine-grained robot control, and frames ERQA as a starting point rather than a final exam. ^[1] ERQA+ targets the contamination and shortcut concerns directly with egocentric frames and multi-stage filtering. ^[7]

why ERQA matters

Once flagship VLMs began clearing 70% to 80% on RealWorldQA and saturating older multimodal benchmarks, the field needed a harder, more embodied probe to separate the next generation of models. A 400-question, hand-authored, real-image test that the best public model in early 2025 could only solve a little better than half the time fits that role. ERQA is also one of the first widely cited benchmarks released alongside a robotics-specific frontier model (Gemini Robotics-ER) and a VLA model (Gemini Robotics), a pairing successor releases such as Gemini Robotics-ER 1.5 and 1.6 still cite. ^[1]^[3]^[4]^[5]^[6]

references

Gemini Robotics Team, Google DeepMind. *Gemini Robotics: Bringing AI into the Physical World*. arXiv:2503.20020, 25 March 2025. https://arxiv.org/abs/2503.20020 ↩
Google DeepMind. *embodiedreasoning/ERQA: Embodied Reasoning Question Answer (ERQA) Benchmark*. GitHub repository, 2025. https://github.com/embodiedreasoning/ERQA ↩
Google DeepMind. *Gemini Robotics: Bringing AI into the Physical World* (HTML version). arXiv, March 2025. https://arxiv.org/html/2503.20020v1 ↩
Google DeepMind. *Gemini Robotics 1.5 brings AI agents into the physical world*. DeepMind blog, 25 September 2025. https://deepmind.google/blog/gemini-robotics-15-brings-ai-agents-into-the-physical-world/ ↩
Google DeepMind. *Gemini Robotics 1.5: Pushing the Frontier of Generalist Robots* (technical report). 2025. https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf ↩
Google DeepMind. *Gemini Robotics ER 1.6: Enhanced Embodied Reasoning*. DeepMind blog, 14 April 2026. https://deepmind.google/blog/gemini-robotics-er-1-6/ ↩
BAAI FlagEval Team. *ERQA+: An Enhanced Benchmark on Embodied Reasoning*. Project page, 2025. https://flageval-baai.github.io/ERQA-Plus-page/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AIME 2025 PaLM-E: An Embodied Multimodal Language Model

background and motivation

dataset structure

question composition

reasoning categories

data sources

evaluation methodology

scoring

running the harness

results on the initial release

use in later Gemini Robotics releases

related benchmarks

ERQA+

practical considerations

strengths and limitations

limitations

why ERQA matters

see also

references

Improve this article

Related Articles

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

Vision-language-action model

NVIDIA Cosmos Reason

RoboCat

Gemini 3

What links here

Related Articles

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

Vision-language-action model

NVIDIA Cosmos Reason

RoboCat

Gemini 3

What links here