SIMPLER
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,616 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,616 words
Add missing citations, update stale details, or suggest a clearer explanation.
SIMPLER (Simulated Manipulation Policy Evaluation for Real Robot Setups) is a collection of simulated robot manipulation environments, released in 2024, that are carefully built to reproduce the conditions of specific real-robot evaluations. Its central goal is to provide a cheap, fast, and reproducible proxy for real-world testing of vision-language-action models and other manipulation policies: a policy's success rate inside SIMPLER is designed to correlate strongly with the success rate the same policy would achieve on the corresponding physical robot [1][2]. SIMPLER covers two widely used real-robot platforms, the Google robot used for the RT-1 and RT-2 experiments and the WidowX arm used in the BridgeData V2 collection, and exposes them through a uniform OpenAI Gym style interface [2].
The benchmark was introduced in the paper "Evaluating Real-World Robot Manipulation Policies in Simulation" by Xuanlin Li, Kyle Hsu, Jiayuan Gu, and collaborators, and was presented at the Conference on Robot Learning (CoRL) 2024 [1][2]. It has since become one of the standard low-cost evaluation suites for robot learning, routinely reported alongside LIBERO in papers on generalist manipulation policies such as OpenVLA, Octo, pi-0, and SpatialVLA [3][4][5].
Building generalist robot manipulation policies has advanced quickly, but evaluating them reliably has remained a bottleneck. Running a policy on a physical robot is slow, expensive, and operationally demanding: each trial requires a working robot, careful scene resetting between attempts, and human supervision, and obtaining statistically meaningful success rates can require hundreds of trials per policy [1][2]. Worse, real-robot results are hard to reproduce across laboratories. Small differences in lighting, camera placement, object instances, robot calibration, and background clutter mean that two groups evaluating the "same" task often cannot directly compare their numbers. This makes it difficult to track genuine progress or to fairly rank competing methods [1].
Pure simulation is the obvious alternative because it is fast, parallelizable, and perfectly reproducible. However, naive simulation suffers from a large sim-to-real gap. A simulated scene that looks and behaves differently from the real setup will produce success rates that do not predict real-world performance, so simulated scores can be misleading. SIMPLER was motivated by the observation that this gap has two main components, a visual gap (the rendered camera image differs from the real camera image the policy was trained on) and a control gap (the simulated robot's response to an action command differs from the real robot's), and that reducing both is necessary before a simulated benchmark can serve as a trustworthy stand-in for real evaluation [1][2].
Rather than attempting to build a fully detailed digital twin of each robot cell, the SIMPLER authors developed targeted techniques to shrink the visual and control gaps so that simulated and real success rates line up. The environments are implemented on top of the SAPIEN physics engine and the ManiSkill2 manipulation framework, with a later GPU-accelerated ManiSkill3 implementation [2].
To reduce the visual gap, SIMPLER uses an approach the authors call visual matching. This combines "green screening," in which the simulated foreground objects and robot are composited over the real-world background image, with texture matching, in which the textures of the simulated objects and robot are tuned to resemble their real counterparts. The result is a rendered observation that closely mimics what the policy saw during real-world data collection [1][2].
To reduce the control gap, SIMPLER applies system identification (SysID), tuning the simulated robot's dynamics and controller so that a given action command produces a motion comparable to the real robot's response. Together, visual matching and system identification let the simulated environment behave enough like the real cell that policy rankings carry over [1].
SIMPLER provides two evaluation methodologies that trade off fidelity against robustness [1][2]:
| Setup | Description | Purpose |
|---|---|---|
| Visual Matching | Closely replicates the real scene appearance via green screening and texture matching | High-fidelity, low-gap evaluation that aligns sim and real per task |
| Variant Aggregation | Averages success across many environment variants (different backgrounds, lighting, distractors, table textures, camera poses) | Robustness check that is less sensitive to any single visual choice |
The two real-robot platforms come with their own task suites. For the Google robot, the tasks mirror the original RT-1 and RT-2 evaluations: pick up a Coke can, move one object near another ("move near"), open and close a drawer, and place an object (such as an apple) into a drawer [1][2][6]. For the WidowX and BridgeData V2 setup, the suite includes putting a spoon on a towel, putting a carrot on a plate, stacking a green block on a yellow block, and putting an eggplant into a yellow basket in a sink scene [2][6]. All environments share a common Gym API, and SIMPLER ships open-source inference code for the reference policies along with a guide for adding new policies and new tasks [2].
The core scientific claim of SIMPLER is that performance inside the simulated environments tracks performance on the real robots. The authors validated this through paired sim-and-real evaluations, running the same set of policies, including RT-1, RT-1-X, RT-2-X, and Octo, in both SIMPLER and on the corresponding physical robots and then comparing the outcomes [1][2].
To quantify alignment, the paper reports two metrics. The first is the Pearson correlation coefficient between simulated and real per-task success rates. The second is a metric the authors introduced called Mean Maximum Rank Violation (MMRV), which measures how often the simulated evaluation ranks two policies in the wrong order relative to their real-world performance, weighting each violation by the real-world performance margin between the mis-ordered policies. MMRV directly captures the property that matters most in practice, whether the benchmark would lead a researcher to pick the better policy [1].
Across the evaluated policies and tasks, SIMPLER showed strong agreement with reality. For the Google robot under the Visual Matching setup, the paper reports a Pearson correlation of roughly 0.92 and an MMRV of about 0.06, indicating that the simulated rankings closely matched the real ones [1]. Correlations for the BridgeData V2 (WidowX) tasks were also strong, though they varied more across individual tasks [1][2]. These results support using SIMPLER as a low-cost screening tool: a policy that performs well in SIMPLER is likely, though not guaranteed, to perform well on the matching real robot.
Because it removes the need for physical hardware while remaining predictive, SIMPLER was quickly adopted as a default evaluation for vision-language-action models. The benchmark is commonly cited as "SimplerEnv," after its software repository and project page [2].
OpenVLA, an open-source 7-billion-parameter VLA released in 2024, used SIMPLER among its evaluations and reported outperforming the much larger 55-billion-parameter RT-2-X across the WidowX and Google robot tasks, illustrating how the benchmark lets smaller open models be compared against proprietary baselines without real-robot access [3]. Subsequent models including Octo, the pi-0 (also written pi-zero) flow-based policy from Physical Intelligence, SpatialVLA, and CogACT have likewise reported SIMPLER numbers, typically broken out across the individual Google robot and WidowX tasks and aggregated under the Visual Matching and Variant Aggregation settings [3][4][5]. In these comparisons SIMPLER serves as the sim-to-real-aligned counterpart to in-simulation-only suites, giving readers a sense of how a policy might transfer to the physical platforms it was modeled on.
It is worth noting that SIMPLER is a proxy, not a replacement for real evaluation. Reported success rates depend on the specific checkpoint, action space, and inference settings used, and the community has observed that some policies score differently in SIMPLER than expected, which keeps real-robot confirmation relevant for high-stakes claims [3][7].
SIMPLER occupies a specific niche in the landscape of manipulation benchmarks. It is most often paired with LIBERO, a simulated benchmark built on the robosuite and MuJoCo stack that offers task suites for testing spatial understanding, object generalization, goal-conditioned behavior, and long-horizon tasks. The two are complementary: LIBERO probes a policy's breadth across many procedurally varied simulated tasks, while SIMPLER probes how well simulated success predicts real success on a small number of carefully matched real-robot tasks. Many VLA papers report both, using LIBERO for capability coverage and SIMPLER for sim-to-real alignment [4][5].
SIMPLER is also tightly connected to the data and policies of the Open X-Embodiment collaboration. The Google robot tasks correspond to the Fractal/RT-1 data, and the WidowX tasks correspond to BridgeData V2, both of which are major components of the Open X-Embodiment dataset used to train cross-embodiment policies such as RT-1-X and RT-2-X [1][2]. By mirroring those exact platforms, SIMPLER provides a natural simulated testbed for policies trained on that data. In the broader picture, SIMPLER sits alongside real-robot evaluation as the cheap front end of a two-stage process: rapid, reproducible screening in simulation, followed by confirmation on the physical robot when it counts [1][2].