SIMPLER

AI Benchmarks Robotics

9 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 1,896 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is SIMPLER?

SIMPLER (Simulated Manipulation Policy Evaluation for Real Robot Setups) is a collection of simulated robot manipulation environments, released in 2024, that reproduce the conditions of specific real-robot evaluations so that a policy's success rate in simulation correlates strongly with its success rate on the matching physical robot ^[1]^[2]. It was built to make the evaluation of generalist manipulation policies, especially vision-language-action models, cheap, fast, and reproducible, replacing slow and hard-to-reproduce real-robot testing with a simulated proxy. SIMPLER covers two widely used real-robot platforms, the Google robot used for the RT-1 and RT-2 experiments and the WidowX arm used in the BridgeData V2 collection, and exposes them through a uniform OpenAI Gym style interface ^[1]^[2].

The benchmark was introduced in the paper "Evaluating Real-World Robot Manipulation Policies in Simulation" by Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, and 12 collaborators, and was presented at the Conference on Robot Learning (CoRL) 2024 ^[1]^[6]. For the Google robot under its highest-fidelity setup, the authors report a Pearson correlation of 0.924 between simulated and real per-task success rates, far above the 0.308 obtained by a validation-loss baseline ^[1]. SIMPLER has since become one of the standard low-cost evaluation suites for robot learning, routinely reported alongside LIBERO in papers on generalist manipulation policies such as OpenVLA, Octo, pi-0, and SpatialVLA ^[3]^[4]^[5].

Why was SIMPLER created? The cost of real-robot evaluation

Building generalist robot manipulation policies has advanced quickly, but evaluating them reliably has remained a bottleneck. Running a policy on a physical robot is slow, expensive, and operationally demanding: each trial requires a working robot, careful scene resetting between attempts, and human supervision, and obtaining statistically meaningful success rates can require hundreds of trials per policy ^[1]^[2]. Worse, real-robot results are hard to reproduce across laboratories. Small differences in lighting, camera placement, object instances, robot calibration, and background clutter mean that two groups evaluating the "same" task often cannot directly compare their numbers. This makes it difficult to track genuine progress or to fairly rank competing methods ^[1]. As the authors put it, "real-world evaluation of such policies is not scalable and faces reproducibility challenges, which are likely to worsen as policies broaden the spectrum of tasks they can perform" ^[1].

Pure simulation is the obvious alternative because it is fast, parallelizable, and perfectly reproducible. However, naive simulation suffers from a large sim-to-real gap. A simulated scene that looks and behaves differently from the real setup will produce success rates that do not predict real-world performance, so simulated scores can be misleading. SIMPLER was motivated by the observation that this gap has two main components, a visual gap (the rendered camera image differs from the real camera image the policy was trained on) and a control gap (the simulated robot's response to an action command differs from the real robot's), and that reducing both is necessary before a simulated benchmark can serve as a trustworthy stand-in for real evaluation ^[1]^[2]. The paper's central claim is that this can be done without expensive digital twins: "We demonstrate that simulation-based evaluation can be a scalable, reproducible, and reliable proxy for real-world evaluation" ^[1].

How does SIMPLER work? Matching real setups

Rather than attempting to build a fully detailed digital twin of each robot cell, the SIMPLER authors developed targeted techniques to shrink the visual and control gaps so that simulated and real success rates line up. The environments are implemented on top of the SAPIEN physics engine and the CPU-based ManiSkill2 manipulation framework, with a later GPU-accelerated ManiSkill3 implementation that the maintainers report runs roughly 10-15 times faster than the ManiSkill2 version ^[2]^[6].

To reduce the visual gap, SIMPLER uses an approach the authors call visual matching. As the paper describes it, visual matching consists of "(1) green screening, i.e. segmenting out interactive simulated assets and overlaying them onto real-world backgrounds; and (2) texture matching, which involves projecting real object textures onto simulation assets and tuning robot arm colors using real videos" ^[1]. The result is a rendered observation that closely mimics what the policy saw during real-world data collection ^[1]^[2].

To reduce the control gap, SIMPLER applies system identification (SysID), tuning the simulated robot's dynamics and controller so that a given action command produces a motion comparable to the real robot's response. Together, visual matching and system identification let the simulated environment behave enough like the real cell that policy rankings carry over ^[1].

SIMPLER provides two evaluation methodologies that trade off fidelity against robustness ^[1]^[2]:

Setup	Description	Purpose
Visual Matching	Closely replicates the real scene appearance via green screening and texture matching	High-fidelity, low-gap evaluation that aligns sim and real per task
Variant Aggregation	Averages success across many environment variants (different backgrounds, lighting, distractors, table textures, camera poses)	Robustness check that is less sensitive to any single visual choice

Which policies and robots does SIMPLER evaluate?

The two real-robot platforms come with their own task suites. For the Google robot, the tasks mirror the original RT-1 and RT-2 evaluations: pick up a Coke can, move one object near another ("move near"), open and close a drawer, and place an object (such as an apple) into a drawer ^[1]^[2]^[6]. For the WidowX and BridgeData V2 setup, the suite includes putting a spoon on a towel, putting a carrot on a plate, stacking a green block on a yellow block, and putting an eggplant into a yellow basket in a sink scene ^[2]^[6]. The four core Bridge environment identifiers are spoon_on_towel, carrot_on_plate, stack_cube, and put_eggplant_in_basket ^[6]. All environments share a common Gym API, and SIMPLER ships open-source inference code for the reference policies along with a guide for adding new policies and new tasks ^[2]^[6].

The reference policies shipped and validated with the benchmark are RT-1, RT-1-X, RT-2-X, and Octo, spanning both the Google robot and WidowX platforms ^[1]^[6]. These provide the paired sim-and-real anchor points against which new policies are compared.

How accurate is SIMPLER? Sim-to-real correlation

The core scientific claim of SIMPLER is that performance inside the simulated environments tracks performance on the real robots. The authors validated this through paired sim-and-real evaluations, running the same set of policies, including RT-1, RT-1-X, RT-2-X, and Octo, in both SIMPLER and on the corresponding physical robots and then comparing the outcomes ^[1]^[2].

To quantify alignment, the paper reports two metrics. The first is the Pearson correlation coefficient between simulated and real per-task success rates. The second is a metric the authors introduced called Mean Maximum Rank Violation (MMRV), which measures how often the simulated evaluation ranks two policies in the wrong order relative to their real-world performance, weighting each violation by the real-world performance margin between the mis-ordered policies. MMRV directly captures the property that matters most in practice, whether the benchmark would lead a researcher to pick the better policy ^[1]. A good simulated evaluation pipeline has a high Pearson r and a low MMRV ^[1].

Across the evaluated policies and tasks, SIMPLER showed strong agreement with reality. For the Google robot under the Visual Matching setup, the paper reports an average Pearson correlation of 0.924 and an MMRV of 0.056, compared with 0.308 and 0.375 respectively for a validation-loss (MSE) baseline, indicating that the simulated rankings closely matched the real ones while a naive offline metric did not ^[1]. Correlations for the BridgeData V2 (WidowX) tasks were also strong, though they varied more across individual tasks ^[1]^[2]. The authors further report that "SIMPLER evaluations accurately reflect real-world policy behavior modes such as sensitivity to various distribution shifts" ^[1]. These results support using SIMPLER as a low-cost screening tool: a policy that performs well in SIMPLER is likely, though not guaranteed, to perform well on the matching real robot.

Setup (Google robot)	Pearson r	MMRV
SIMPLER Visual Matching	0.924	0.056
Validation MSE baseline	0.308	0.375

How is SIMPLER used as a VLA benchmark?

Because it removes the need for physical hardware while remaining predictive, SIMPLER was quickly adopted as a default evaluation for vision-language-action models. The benchmark is commonly cited as "SimplerEnv," after its software repository and project page ^[2]^[6].

OpenVLA, an open-source 7-billion-parameter VLA released in 2024, used SIMPLER among its evaluations and reported outperforming the much larger 55-billion-parameter RT-2-X across the WidowX and Google robot tasks, illustrating how the benchmark lets smaller open models be compared against proprietary baselines without real-robot access ^[3]. Subsequent models including Octo, the pi-0 (also written pi-zero) flow-based policy from Physical Intelligence, SpatialVLA, and CogACT have likewise reported SIMPLER numbers, typically broken out across the individual Google robot and WidowX tasks and aggregated under the Visual Matching and Variant Aggregation settings ^[3]^[4]^[5]. In these comparisons SIMPLER serves as the sim-to-real-aligned counterpart to in-simulation-only suites, giving readers a sense of how a policy might transfer to the physical platforms it was modeled on.

It is worth noting that SIMPLER is a proxy, not a replacement for real evaluation. Reported success rates depend on the specific checkpoint, action space, and inference settings used, and the community has observed that some policies score differently in SIMPLER than expected, which keeps real-robot confirmation relevant for high-stakes claims ^[3]^[7].

How does SIMPLER relate to other robot benchmarks?

SIMPLER occupies a specific niche in the landscape of manipulation benchmarks. It is most often paired with LIBERO, a simulated benchmark built on the robosuite and MuJoCo stack that offers task suites for testing spatial understanding, object generalization, goal-conditioned behavior, and long-horizon tasks. The two are complementary: LIBERO probes a policy's breadth across many procedurally varied simulated tasks, while SIMPLER probes how well simulated success predicts real success on a small number of carefully matched real-robot tasks. Many VLA papers report both, using LIBERO for capability coverage and SIMPLER for sim-to-real alignment ^[4]^[5].

SIMPLER is also tightly connected to the data and policies of the Open X-Embodiment collaboration. The Google robot tasks correspond to the Fractal/RT-1 data, and the WidowX tasks correspond to BridgeData V2, both of which are major components of the Open X-Embodiment dataset used to train cross-embodiment policies such as RT-1-X and RT-2-X ^[1]^[2]. By mirroring those exact platforms, SIMPLER provides a natural simulated testbed for policies trained on that data. In the broader picture, SIMPLER sits alongside real-robot evaluation as the cheap front end of a two-stage process: rapid, reproducible screening in simulation, followed by confirmation on the physical robot when it counts ^[1]^[2].

References

Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, Ted Xiao. "Evaluating Real-World Robot Manipulation Policies in Simulation." arXiv:2405.05941, May 2024 (CoRL 2024). https://arxiv.org/abs/2405.05941 ↩
SIMPLER project page, "Evaluating Real-World Robot Manipulation Policies in Simulation." https://simpler-env.github.io/ ↩
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, et al. "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246, 2024. https://arxiv.org/abs/2406.09246 ↩
Delin Qu, et al. "SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model." arXiv:2501.15830, 2025. https://arxiv.org/abs/2501.15830 ↩
Qixiu Li, et al. "CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation." arXiv:2411.19650, 2024. https://cogact.github.io/ ↩
SimplerEnv GitHub repository, simpler-env/SimplerEnv. https://github.com/simpler-env/SimplerEnv ↩
SimplerEnv GitHub issues, "Poor performance of OpenVLA on Bridge" (Issue #78). https://github.com/simpler-env/SimplerEnv/issues/78 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

τ-bench