ERQA

ERQA
Overview
Full name	Embodied Reasoning Question Answer
Abbreviation	ERQA
Description	A benchmark for evaluating embodied reasoning capabilities in AI models for robotics applications
Release date	2025-03
Latest version	1.0
Benchmark updated	2025-03
Authors	Google DeepMind Research Team
Organization	Google DeepMind
Technical Details
Type	Embodied Reasoning, Visual Question Answering, Robotics
Modality	Vision, Text
Task format	Multiple-choice Visual Question Answering
Number of tasks	7 reasoning categories
Total examples	400 questions
Evaluation metric	Accuracy
Domains	Spatial reasoning, Trajectory reasoning, Action reasoning, State estimation, Multi-view reasoning
Languages	English
Performance
Human performance	Not specified
Baseline	~25% (random guess)
SOTA score	48.3%
SOTA model	Gemini 2.0 Pro Experimental
SOTA date	2025-03
Saturated	No
Resources ;
GitHub	Repository
Dataset	Download
License	Not specified ;

ERQA (Embodied Reasoning Question Answer) is a benchmark designed to evaluate artificial intelligence models' ability to understand and reason about physical environments and robotic scenarios. Released in March 2025 by Google DeepMind's Gemini Robotics team^[1], ERQA addresses the critical need for standardized evaluation of embodied AI capabilities in robotics applications. The benchmark consists of 400 carefully curated visual question answering (VQA) tasks that test spatial reasoning, trajectory understanding, and physical world knowledge, with the best model (Gemini 2.0 Pro Experimental) achieving 48.3% accuracy.

Overview

ERQA represents a significant advancement in evaluating embodied reasoning capabilities, the ability of AI systems to understand and reason about physical environments, spatial relationships, and robotic interactions. Unlike traditional AI benchmarks that focus on abstract reasoning or language understanding, ERQA specifically targets the cognitive capabilities required for robots and embodied agents to operate effectively in real-world environments. The benchmark goes beyond testing atomic capabilities to provide integrated assessment of how well models can understand complex physical scenarios^[1].

The benchmark's development was motivated by the growing need for AI systems that can understand and interact with the physical world, particularly in robotics applications. As robots move from controlled industrial settings to dynamic real-world environments, their ability to reason about spatial relationships, predict trajectories, and understand action consequences becomes crucial. ERQA provides a standardized framework for measuring these capabilities across different AI models.

Significance

ERQA's importance in the field of embodied AI stems from several key contributions:

**Embodied Focus**: First major benchmark specifically designed for robotics reasoning evaluation
**Real-world Relevance**: Uses actual robotics dataset images rather than synthetic data
**Multi-modal Integration**: Tests combined vision-language understanding in physical contexts
**Comprehensive Coverage**: Evaluates eight distinct reasoning categories crucial for robotics
**Standardized Evaluation**: Provides consistent framework for comparing embodied AI capabilities

Dataset Structure

Question Composition

ERQA's 400 questions are carefully structured to evaluate diverse embodied reasoning capabilities^[1]:

Component	Quantity	Description
Total Questions	400	Multiple-choice VQA tasks
Answer Options	4 per question	Labeled A, B, C, D
Single-image Questions	288 (72%)	Reasoning from individual scenes
Multi-image Questions	112 (28%)	Cross-image reasoning required
Storage Format	TFRecord	TensorFlow Examples format

Reasoning Categories

The benchmark evaluates eight distinct types of embodied reasoning:

Category	Description	Example Task
Spatial Reasoning	Understanding 3D relationships	"Which object is to the left of the robot?"
Trajectory Reasoning	Predicting motion paths	"Where will the ball land?"
Action Reasoning	Understanding action consequences	"What happens if the robot pushes this?"
State Estimation	Inferring object/environment states	"Is the container full or empty?"
Pointing	Directional understanding	"What is the robot pointing at?"
Multi-view Reasoning	Integrating multiple perspectives	"How do these views relate?"
Task Reasoning	Understanding goal-directed behavior	"What task is being performed?"
World Knowledge	Applying real-world understanding	"What material is this made of?"

Data Sources

ERQA incorporates images from multiple prestigious robotics datasets:

Dataset	Type	Contribution
OXE	Open X-Embodiment	Diverse robotic scenarios
UMI Data	Universal Manipulation Interface	Manipulation tasks
MECCANO	Multi-modal dataset	Complex interactions
HoloAssist	Augmented reality assistance	Human-robot collaboration
EGTEA Gaze+	Egocentric video	First-person perspectives

Evaluation Methodology

Evaluation Framework

ERQA employs a rigorous evaluation process^[1]:

Aspect	Implementation	Purpose
Format	Multiple-choice (A/B/C/D)	Standardized comparison
Manual Verification	All questions human-verified	Quality assurance
API Support	Gemini 2.0, OpenAI compatible	Flexible testing
Chain-of-Thought	Optional CoT prompting	Enhanced reasoning
Retry Mechanism	Built-in error handling	Robust evaluation

Technical Implementation

The evaluation pipeline includes:

```python

Basic evaluation structure

{

 "question": "What action should the robot take?",
 "images": ["image1.jpg", "image2.jpg"],
 "options": {
   "A": "Move forward",
   "B": "Turn left",
   "C": "Grasp object",
   "D": "Stop"
 },
 "correct_answer": "C"

} ```

Performance Metrics

ERQA uses straightforward accuracy measurement:

**Primary Metric**: Percentage of correctly answered questions
**Baseline**: 25% (random guessing among 4 options)
**Analysis Dimensions**: Performance by category, single vs multi-image

Performance Analysis

Current Results (March 2025)

Model	Accuracy (No CoT)	Accuracy (With CoT)	Improvement
Gemini 2.0 Pro Experimental	48.3%	54.8%	+6.5%
Gemini 2.0 Flash	46.3%	50.3%	+4.0%
Gemini Robotics-ER	State-of-the-art	-	-
Random Baseline	25.0%	25.0%	0%

Performance Insights

Key findings from initial evaluations:

Finding	Implication
Multi-image Challenge	Questions with multiple images significantly harder
CoT Benefit	Chain-of-thought consistently improves performance
Category Variance	Some reasoning types more challenging than others
Below 60% Ceiling	Substantial room for improvement

Technical Specifications

Repository Structure

Component	Description	Format
Questions	400 VQA tasks	TFRecord
Images	Scene photographs	Various formats
Evaluation Code	API integration	Python
Documentation	Usage instructions	Markdown

Usage Requirements

Requirement	Specification
Data Format	TensorFlow Examples
API Access	Gemini or OpenAI keys
Memory	Varies by model
Processing	GPU recommended

Research Applications

Use Cases

ERQA enables several research directions:

Application	Description	Impact
Robotics Development	Evaluate perception systems	Better robot understanding
Multi-modal Research	Test vision-language integration	Improved fusion methods
Spatial AI	Assess 3D reasoning	Enhanced navigation
Safety Testing	Evaluate action prediction	Safer robotic systems

Related Benchmarks

Benchmark	Focus	Relation to ERQA
RoboVQA	Robot-specific VQA	Similar domain, different scale
AI2-THOR	Embodied AI simulation	Virtual vs real images
Habitat	Navigation tasks	Narrower focus
ERQA	Comprehensive embodied reasoning	Broader evaluation

Limitations and Future Work

Current Limitations

Limitation	Description	Impact
Scale	400 questions	Statistical limitations
English Only	Single language	Limited accessibility
Static Dataset	Fixed questions	Potential overfitting
Multiple Choice	Limited format	May not capture full reasoning

Future Directions

Potential improvements include: 1. **Dataset Expansion**: Increasing to thousands of questions 2. **Multi-lingual Support**: Adding other languages 3. **Free-form Answers**: Beyond multiple choice 4. **Dynamic Generation**: Procedural question creation 5. **Human Baselines**: Establishing human performance metrics

Significance

ERQA represents a crucial step forward in evaluating AI systems' ability to understand and reason about the physical world, a fundamental requirement for advancing robotics and embodied AI. By providing a standardized benchmark specifically designed for embodied reasoning, it enables systematic comparison and improvement of models intended for real-world robotic applications. The benchmark's focus on practical scenarios from actual robotics datasets, combined with its comprehensive coverage of eight reasoning categories, makes it an essential tool for developing AI systems capable of physical world interaction.

The relatively low performance of current state-of-the-art models (54.8% with chain-of-thought) highlights the significant challenges remaining in embodied AI and the importance of continued research in this area. As robots increasingly operate in human environments, benchmarks like ERQA will be crucial for ensuring these systems can reliably understand and reason about the physical world around them.

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 Google DeepMind. (2025). "ERQA: Embodied Reasoning Question Answer Benchmark". GitHub. Retrieved from https://github.com/embodiedreasoning/ERQA

[erqa_github-1] 1.0 ^1.1 ^1.2 ^1.3 Google DeepMind. (2025). "ERQA: Embodied Reasoning Question Answer Benchmark". GitHub. Retrieved from https://github.com/embodiedreasoning/ERQA

[1]

Overview

Significance

Dataset Structure

Question Composition

Reasoning Categories

Data Sources

Evaluation Methodology

Evaluation Framework

Technical Implementation

Performance Metrics

Performance Analysis

Current Results (March 2025)

Performance Insights

Technical Specifications

Repository Structure

Usage Requirements

Research Applications

Use Cases

Related Benchmarks

Limitations and Future Work

Current Limitations

Future Directions

Significance

See Also

References