LIBERO

AI Benchmarks Robotics

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v1 · 1,567 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

LIBERO ("LIfelong learning BEnchmark on RObot manipulation tasks") is an AI benchmark for studying knowledge transfer in lifelong robot learning. Introduced in the 2023 paper "LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning" by Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone, and published at the Conference on Neural Information Processing Systems (NeurIPS) 2023 Datasets and Benchmarks Track, LIBERO provides 130 language-conditioned manipulation tasks in a simulated tabletop environment, organized into task suites that each isolate a different kind of distribution shift ^[1]^[2]^[3].

LIBERO was originally designed to study lifelong (continual) learning for decision-making agents, but it has since become a de facto standard evaluation for vision-language-action models (VLA models), with policies such as OpenVLA, Octo, and pi-0 from Physical Intelligence reporting per-suite success rates on it ^[4]^[5].

The benchmark is built on top of the robosuite simulation framework and the MuJoCo physics engine, using a simulated Franka Emika Panda 7 degree-of-freedom arm. It ships with high-quality, human-teleoperated demonstrations for every task, making it usable for imitation learning, multitask learning, and pretraining in addition to continual learning ^[2]^[6].

Motivation: lifelong robot learning

The central premise of LIBERO is that lifelong learning in decision-making (LLDM) differs in an important way from lifelong learning in static domains such as image classification or text. Whereas continual learning in vision and language is dominated by the transfer of declarative knowledge (facts about entities and concepts), an embodied agent that learns a stream of manipulation tasks must also transfer procedural knowledge: the actions, skills, and behaviors needed to accomplish goals. LIBERO is constructed to let researchers disentangle and study both kinds of transfer, as well as their combination ^[1]^[2].

The authors frame the benchmark around five research questions in lifelong decision-making: how to efficiently transfer declarative knowledge, procedural knowledge, or a mixture of the two; how to design effective policy architectures; how to design effective lifelong-learning algorithms; how robust a lifelong learner is to the order in which tasks are presented; and what effect model pretraining has on subsequent continual learning ^[2]^[3]. A notable empirical finding reported in the paper is that a simple sequential fine-tuning baseline can outperform several established lifelong-learning algorithms, and that naive supervised pretraining can actually hurt downstream performance in the lifelong setting, results that motivated treating LIBERO as an open research testbed rather than a solved benchmark ^[1]^[7].

To make the task space open-ended, LIBERO includes a procedural generation pipeline that can, in principle, generate an unbounded number of manipulation tasks. The pipeline extracts behavioral templates from human-activity data, instantiates them into natural-language instructions paired with formal goal predicates, samples scene layouts, and encodes each task specification in a behavior-description format that robosuite can render and simulate automatically ^[2]^[6].

Structure: the task suites

LIBERO's 130 tasks are grouped so that researchers can control which type of distribution shift a policy must handle. Three suites of 10 tasks each hold most factors fixed while varying exactly one, and a larger collection of 100 tasks captures entangled shifts ^[2]^[3].

The four task suites most commonly cited in the literature are:

Suite	Tasks	What varies (controlled shift)
LIBERO-Spatial	10	Same objects and task; spatial layout/arrangement changes
LIBERO-Object	10	Same task and layout; object types change
LIBERO-Goal	10	Same objects and layout; the goal/task changes
LIBERO-Long (LIBERO-10)	10	Long-horizon tasks mixing object, layout, and goal shifts

In the original paper's organization, the 130 tasks consist of LIBERO-Spatial, LIBERO-Object, and LIBERO-Goal (10 tasks each, with controlled distribution shifts requiring transfer of one specific kind of knowledge) plus LIBERO-100, a set of 100 tasks that require transfer of entangled knowledge. LIBERO-100 is in turn split into LIBERO-90, a 90-task suite intended for large-scale pretraining and multitask learning, and LIBERO-Long (also called LIBERO-10), 10 long-horizon, multi-step tasks held out for evaluation ^[2]^[3]. The four-suite framing used by most VLA papers (Spatial, Object, Goal, Long) therefore reports on 40 tasks drawn from this larger pool, while LIBERO-90 and LIBERO-100 are used when a method targets multitask or pretraining evaluation ^[4]^[5].

Each task ships with 50 high-quality human-teleoperated demonstrations. A demonstration records RGB camera images, robot proprioception, and continuous 7 degree-of-freedom actions, giving roughly 6,500 demonstrations across the four core suites alone ^[2]^[6]. Tasks are specified with natural-language instructions (for example, picking up a particular object and placing it in a target location), making LIBERO well suited to language-conditioned and instruction-following policies.

Original use

In its original framing, LIBERO is a testbed for lifelong-learning research in robotics. The paper evaluates combinations of three lifelong-learning algorithms, two neural network policy architectures (a recurrent variant and a transformer variant), and several knowledge-transfer strategies, measuring how well agents acquire new manipulation skills over a sequence of tasks without forgetting earlier ones ^[2]^[7]. Because the suites isolate spatial, object, goal, and long-horizon shifts, researchers can attribute forgetting or transfer to a specific source of variation rather than to an uncontrolled mixture.

The primary metric is per-suite success rate: the fraction of evaluation rollouts in which the robot achieves the task's goal predicate. A rollout is typically counted as successful when the goal condition is satisfied and held for a short window of simulation steps. Lifelong-learning studies additionally report aggregate measures such as forward transfer, backward transfer (forgetting), and area-under-the-curve over the task sequence, computed from these success rates ^[2]^[7].

Adoption as a VLA benchmark

Although LIBERO was conceived for continual learning, its clean, discrete suites and ready-made demonstrations made it a convenient yardstick for the wave of generalist robot policies that followed. Beginning around 2024, vision-language-action models adopted LIBERO as a standard simulated manipulation benchmark, usually reporting success rates on LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long after fine-tuning on the per-suite demonstrations ^[4]^[5].

OpenVLA, an open-source 7B-parameter VLA released in 2024, popularized this protocol. Kim et al. report OpenVLA success rates of 84.7 percent on LIBERO-Spatial, 87.0 percent on LIBERO-Object, 76.2 percent on LIBERO-Goal, and 51.8 percent on LIBERO-Long, for an average near 74.9 percent, with each number averaged over multiple seeds and 500 rollouts per suite ^[4]. The Octo generalist policy is a common baseline, reported around 78.9 percent (Spatial), 85.7 percent (Object), 84.6 percent (Goal), and 51.1 percent (Long) ^[5]^[8]. Physical Intelligence's pi-0 and its faster pi-0-FAST variant push scores higher; pi-0 is reported at roughly 96.8, 98.8, 95.8, and 85.2 percent across the four suites (average about 94.2 percent), while pi-0-FAST reports about 96.4, 96.8, 88.6, and 60.2 percent ^[5]^[8]. Numerous later systems, including TraceVLA, SpatialVLA, and reinforcement-learning-tuned policies, report LIBERO results, with several approaching or exceeding 98 percent average on the four canonical suites ^[5].

A recurring caution in this literature is that very high LIBERO scores reflect relatively clean train/test splits within each suite. Follow-up robustness studies, such as LIBERO-Plus and related perturbation analyses, show that policies scoring near 98 percent on the standard suites can degrade sharply (in some cases toward single-digit or low success rates) when systematic perturbations to lighting, camera, objects, or layout are introduced, suggesting that strong in-distribution LIBERO numbers do not by themselves establish robust generalization ^[9].

Relationship to other robot benchmarks

LIBERO sits within a broader ecosystem of robot manipulation and VLA benchmarks, each emphasizing different axes of evaluation ^[10]:

CALVIN is a simulated benchmark for long-horizon, language-conditioned manipulation that emphasizes chaining multiple subtasks from free-form instructions. Like LIBERO it is simulation-based and language-driven, and the two are frequently reported together by VLA papers.
RoboCasa is a large-scale simulated benchmark of everyday kitchen tasks, also built on robosuite and MuJoCo, with thousands of procedurally generated scenes; it stresses scene and asset diversity more heavily than LIBERO.
SIMPLER (Simulated Manipulation Policy Evaluation for Real Robots) provides simulated environments designed to correlate with real-robot performance for policies such as RT-1 and RT-2, targeting the simulation-to-reality evaluation gap rather than continual learning.
Meta-World is an earlier suite of 50 simulated manipulation tasks built for multitask and meta-reinforcement learning; it predates the language-conditioned VLA wave and focuses on state-based reinforcement learning rather than instruction-following from pixels.

A 2026 audit of manipulation benchmarks that examined LIBERO, CALVIN, SimplerEnv, RoboCasa, and RoboTwin 2.0 found that LIBERO and CALVIN failed several diagnostic checks (for example, exhibiting exploitable shortcuts), whereas RoboCasa and RoboTwin 2.0 failed fewer despite appearing less often in headline progress claims, reinforcing the view that LIBERO is best used alongside other benchmarks rather than as a sole measure of manipulation capability ^[10].

References

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, Peter Stone. "LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning." arXiv:2306.03310, 2023. https://arxiv.org/abs/2306.03310 ↩
LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. NeurIPS 2023 Datasets and Benchmarks Track proceedings (PDF). https://proceedings.neurips.cc/paper_files/paper/2023/file/8c3c666820ea055a77726d66fc7d447f-Paper-Datasets_and_Benchmarks.pdf ↩
UT Austin Robot Perception and Learning Lab, project page for "LIBERO: Benchmarking Knowledge Transfer in Lifelong Robot Learning." https://rpl.cs.utexas.edu/publications/2023/12/10/liu-neurips23-libero/ ↩
Moo Jin Kim et al. "OpenVLA: An Open-Source Vision-Language-Action Model." 2024. Project and code: https://github.com/openvla/openvla ↩
EmergentMind, "LIBERO Benchmark: Vision-Language-Action in Robotics" (topic overview). https://www.emergentmind.com/topics/libero ↩
Lifelong-Robot-Learning, LIBERO code repository. https://github.com/Lifelong-Robot-Learning/LIBERO ↩
LIBERO, OpenReview record (reviews and discussion), NeurIPS 2023 Datasets and Benchmarks. https://openreview.net/forum?id=xzEtNSuDJk ↩
Physical Intelligence, pi-0 and openpi resources, LIBERO evaluation. https://github.com/Physical-Intelligence/openpi ↩
"LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models." arXiv:2510.13626, 2025. https://arxiv.org/abs/2510.13626 ↩
"What Are We Actually Benchmarking in Robot Manipulation?" arXiv (2026). https://arxiv.org/html/2606.04233 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SIMPLER SmolVLA

Overview

Motivation: lifelong robot learning

Structure: the task suites

Original use

Adoption as a VLA benchmark

Relationship to other robot benchmarks

References

Improve this article

Related Articles

SIMPLER

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

What links here

Related Articles

SIMPLER

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

What links here