ERQA
Last reviewed
May 10, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,543 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 2,543 words
Add missing citations, update stale details, or suggest a clearer explanation.
| ERQA | |
|---|---|
| Overview | |
| Full name | Embodied Reasoning Question Answering |
| Release date | 2025-03-25 |
| Latest version | 1.0 |
| Authors | Gemini Robotics Team, Google DeepMind |
| Type | Embodied Reasoning, Visual Question Answering, Robotics |
| Modality | Vision, Text |
| Task format | Multiple-choice VQA (4 options: A, B, C, D) |
| Reasoning categories | 7 |
| Total examples | 400 questions |
| Multi-image share | 112 (about 28%) |
| Single-image share | 288 (about 72%) |
| Evaluation metric | Accuracy |
| Languages | English |
| Random baseline | 25.0% |
| Best score (no CoT) | 53.3% (Gemini Robotics-ER) |
| Best general VLM (no CoT) | 48.3% (Gemini 2.0 Pro Experimental) |
| Best score (with CoT) | 54.8% (Gemini 2.0 Pro Experimental) |
| Saturated | No |
| GitHub | embodiedreasoning/ERQA |
| Dataset format | TFRecord |
| Paper | arXiv:2503.20020 |
| License | CC BY 4.0 |
ERQA (Embodied Reasoning Question Answering) is a multimodal benchmark released by Google DeepMind in March 2025 to evaluate the embodied reasoning capabilities of vision-language models (VLMs) on robotics scenes. It consists of 400 multiple-choice Visual Question Answering (VQA) questions covering seven reasoning skills important for an agent acting in the physical world: spatial, trajectory, action, state estimation, pointing, multi-view, and task reasoning. ERQA was open sourced alongside the Gemini Robotics technical report and is the public benchmark most closely associated with Gemini Robotics-ER. [1][2]
ERQA is intentionally complementary to existing VLM benchmarks such as RealWorldQA, BLINK, and MMMU. Rather than testing atomic visual skills like object recognition or counting, it asks compound questions about what an embodied agent should do given a scene, where it should look or move, and how multiple frames relate to each other. At release in February 2025, the strongest reported scores were 48.3% for Gemini 2.0 Pro Experimental without chain-of-thought and 54.8% with chain-of-thought, well above the 25% random baseline but well below saturation. [1][3]
Most multimodal models in 2024 and early 2025 were evaluated on benchmarks that emphasize atomic capabilities: identifying objects, counting them, locating them, or answering trivia about web images. Those skills are necessary for a robot but not sufficient. A useful embodied agent has to reason about which object to grasp, where a thrown ball will land, whether a cup is empty or full, and what task the human in front of it is doing. The Gemini Robotics team built ERQA to make progress on those compound skills measurable. [1]
The benchmark first appeared in the paper Gemini Robotics: Bringing AI into the Physical World, posted to arXiv on 25 March 2025. The paper introduces two model families. Gemini Robotics is a vision-language-action (VLA) model that turns visual observations and instructions into motor commands. Gemini Robotics-ER is a vision-language model focused on perception and planning: pointing, 2D and 3D bounding boxes, trajectory prediction, multi-view correspondence, and task decomposition. ERQA is one of the public benchmarks used to compare Gemini Robotics-ER to general-purpose VLMs, distributed under CC BY 4.0 through the GitHub repository embodiedreasoning/ERQA, which ships an evaluation harness for the Gemini API and OpenAI API. [1][2]
ERQA is small by modern benchmark standards, and that compactness is deliberate. Every question was hand authored and manually verified so there is exactly one defensible correct answer among the four choices. Each TFRecord example contains a textual question, one or more encoded images, the index positions where each image is interleaved with the prompt, and a single-letter ground truth answer (A, B, C, or D). [1][2]
| Component | Quantity / Description |
|---|---|
| Total questions | 400 multiple-choice VQA, hand authored and human verified |
| Answer options | 4 per question, labeled A, B, C, D |
| Single-image questions | 288 (about 72%) |
| Multi-image questions | 112 (about 28%) |
| Storage format | TFRecord with question, image/encoded, answer, question_type, visual_indices |
| Languages | English only |
The paper reports that the multi-image subset is consistently harder than the single-image subset across every model evaluated, suggesting current VLMs still struggle to fuse evidence across frames in the way an embodied agent needs. [1]
Questions are tagged with one of seven question_type labels, chosen to mirror the decisions a real robot pipeline has to make rather than an academic taxonomy. [1]
| Category | What it tests | Example |
|---|---|---|
| Spatial reasoning | 3D relationships between objects and the viewer | "Which object is closest to the gripper?" |
| Trajectory reasoning | Motion paths and end positions | "Where will the rolled ball end up?" |
| Action reasoning | Consequences of physical actions | "What happens to the stack if the robot pushes this block?" |
| State estimation | Discrete or continuous object states | "Is the kettle on or off?" |
| Pointing | Identifying an image region from a description | "Point to the handle the robot should grasp." |
| Multi-view reasoning | Aligning entities across frames | "Which object in image 2 corresponds to the one in image 1?" |
| Task reasoning | Identifying the goal or sub-step | "What task is the person most likely doing?" |
Images come from public robotics and egocentric video datasets, with a small number of custom captures for edge cases. Real footage means questions can lean on textures, occlusions, and clutter that simulator scenes lack. [1][2]
| Source dataset | Contribution |
|---|---|
| Open X-Embodiment (OXE) | Diverse arm, gripper, and table-top robot scenes |
| Universal Manipulation Interface (UMI) | Hand-held gripper, first-person manipulation footage |
| MECCANO | Egocentric assembly video for part identification and step ordering |
| HoloAssist | AR-headset video for multi-view and human-robot collaboration |
| EGTEA Gaze+ | First-person cooking video for long-horizon task and state reasoning |
| Custom captures | Hand-authored scenes filling multi-view pointing and other gaps |
The evaluation protocol is austere: a model sees the prompt and candidate images in the order set by visual_indices, outputs one of A, B, C, or D, and its answer is compared to the ground truth. Accuracy is the only headline metric. The harness also supports a chain-of-thought variant in which the model can reason first; only the final letter is graded. [2]
| Aspect | Implementation |
|---|---|
| Format | Multiple choice (A, B, C, D) |
| Manual verification | All 400 questions human verified |
| API support | Native Gemini 2.0 and OpenAI bindings |
| Chain of thought | Optional reasoning prefix |
| Retry mechanism | Configurable backoff on rate limits |
| Random baseline | 25.0% (4-way uniform) |
The eval_harness.py script takes a TFRecord path, an API choice (gemini or openai), a model name, and a sample size. The default model list at release covered gemini-2.0-flash, gemini-2.0-pro, gemini-2.0-pro-exp-02-05, gpt-4o-2024-11-20, and gpt-4o-mini. Multiple API keys can be supplied via a keys file, and the harness retries on resource-exhaustion errors. The visual_indices field controls where images are interleaved with the prompt, which matters for multi-image questions where order is part of the problem. [2]
The Gemini Robotics paper reports ERQA scores for five general-purpose VLMs and for Gemini Robotics-ER itself, both with and without chain-of-thought prompting, with results collected in February 2025. The numbers below are reproduced from Table 2 of the paper. [1]
| Model | Accuracy without CoT | Accuracy with CoT | CoT gain |
|---|---|---|---|
| Random baseline | 25.0% | 25.0% | 0.0 |
| Claude 3.5 Sonnet | 35.5% | 45.8% | +10.3 |
| GPT-4o-mini | 37.3% | not reported | n/a |
| GPT-4o | 47.0% | 50.5% | +3.5 |
| Gemini 2.0 Flash | 46.3% | 50.3% | +4.0 |
| Gemini 2.0 Pro Experimental | 48.3% | 54.8% | +6.5 |
| Gemini Robotics-ER | 53.3% | not separately reported | n/a |
A few patterns stand out. The best general-purpose VLM at release, Gemini 2.0 Pro Experimental with CoT, scored just under 55%, and the authors describe ERQA as the most challenging of the three embodied benchmarks (ERQA, RealWorldQA, BLINK) they evaluated. Chain-of-thought prompting helps every model tested, sometimes by double digits (Claude 3.5 Sonnet gains 10.3 points), and the gain is largest for models that started lower. Gemini Robotics-ER beats every general-purpose VLM in the no-CoT setting, the headline embodied-reasoning claim of the original paper. For comparison, the same paper reports Gemini 2.0 Pro Experimental at 74.5% on RealWorldQA and 65.2% on BLINK, both well above its 48.3% on ERQA. [1]
In September 2025, Google DeepMind released Gemini Robotics-ER 1.5 alongside Gemini Robotics 1.5. ER 1.5 is described as the high-level planning component of an agentic robot stack: it reasons about a scene, calls digital tools, and produces multi-step plans for a separate VLA model. It is reported to achieve state-of-the-art performance on 15 academic embodied reasoning benchmarks, including ERQA by name. Per-benchmark numbers are not broken out in the press release, but the tech report attributes the gain to better multi-view, pointing, and trajectory understanding, all explicit ERQA categories. [4][5]
In April 2026, Google DeepMind released Gemini Robotics-ER 1.6 through the Gemini API and Google AI Studio. The 1.6 release focused on instrument reading (a new capability), pointing precision, multi-view success detection, and physical safety. ERQA was not the headline benchmark for ER 1.6, but the skills it measures, especially pointing and multi-view reasoning, remain core to the model's evaluation suite. [6]
| Benchmark | Focus | Scale | How it differs from ERQA |
|---|---|---|---|
| RealWorldQA | General real-world VQA from xAI | ~700 MCQ | Less robotics specific, no multi-view requirement |
| BLINK | Core human visual perception | 14 MCQ tasks | Pixel-level perception rather than action or task reasoning |
| RoboVQA | Long-horizon manipulation video | 800K+ entries (open ended) | Larger but less focused on multi-image embodied skills |
| AI2-THOR | Embodied simulation environment | Procedural trajectories | Full agent control rather than question answering |
| Habitat | Embodied navigation in 3D scans | Procedural trajectories | Navigation only, no manipulation or VQA |
| Point-Bench | Pointing accuracy on household scenes | Several thousand pointing tasks | Atomic skill that ERQA tests as one of seven categories |
| ERQA+ | Egocentric expanded ERQA-style benchmark | 800 mixed-format questions | Reduces data contamination from the original ERQA |
In 2025, the FlagEval team at the Beijing Academy of Artificial Intelligence (BAAI) released ERQA+, an enhanced benchmark building on the original ERQA design. ERQA+ doubles the dataset to 800 questions, anchors them on first-person robot video frames rather than web imagery, refines the taxonomy into planning, prediction, perception, and spatial reasoning, and broadens the question types beyond multiple choice (sorting, matching, counting, composite-judgment, open-ended). The reported leaderboard top is 57.3% for Gemini 3 Pro preview, while smaller open-weight models such as Gemma 3 12B reach only around 13.6%. ERQA+ is a follow-on rather than a replacement and credits ERQA explicitly. [7]
| Component | Description |
|---|---|
| Questions | 400 hand-authored VQA tasks in data/erqa.tfrecord |
| Images | Real photographs and video frames, encoded inside TFRecord |
| Evaluation code | eval_harness.py (Python) with Gemini and OpenAI adapters |
| Loader example | loading_example.py (Python) reference parser |
| API access | Gemini or OpenAI API key via env vars, CLI flags, or keys file |
| Compute | Inference only; no training; modest CPU and memory |
| License | CC BY 4.0 |
A researcher with a single API key can run the full benchmark in minutes and produce a score that lines up with the published Gemini Robotics numbers. [2]
ERQA's main selling point is its design rather than its scale. Hand-authoring 400 multi-choice questions with manual verification produces a dataset that is cheap to run, hard enough to discriminate between flagship VLMs, and structured enough to break down by reasoning category. Real images from established robotics and egocentric video datasets mean questions look like inputs an actual robot would see, which is not always true of synthetic VQA benchmarks. The seven-category taxonomy maps each label to a recognizable robot pipeline component (perception, motion prediction, planning, multi-view fusion). [1][2]
| Limitation | Impact |
|---|---|
| 400 questions only | Wide confidence intervals on per-category breakdowns |
| English only | Limits use in non-English research and production |
| Static dataset | Risk of overfitting and data contamination over time |
| Multiple choice format | Cannot capture free-form rationales or alternative correct actions |
| Web-image overlap | Some source datasets may already be in VLM pretraining; motivated the egocentric redesign in ERQA+ |
| No human baseline | The paper does not report human accuracy on ERQA |
The Gemini Robotics paper notes that even the strongest model studied still struggles with spatial relationships in long videos and fine-grained robot control, and frames ERQA as a starting point rather than a final exam. [1] ERQA+ targets the contamination and shortcut concerns directly with egocentric frames and multi-stage filtering. [7]
Once flagship VLMs began clearing 70% to 80% on RealWorldQA and saturating older multimodal benchmarks, the field needed a harder, more embodied probe to separate the next generation of models. A 400-question, hand-authored, real-image test that the best public model in early 2025 could only solve a little better than half the time fits that role. ERQA is also one of the first widely cited benchmarks released alongside a robotics-specific frontier model (Gemini Robotics-ER) and a VLA model (Gemini Robotics), a pairing successor releases such as Gemini Robotics-ER 1.5 and 1.6 still cite. [1][3][4][5][6]