LIBERO
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,567 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,567 words
Add missing citations, update stale details, or suggest a clearer explanation.
LIBERO ("LIfelong learning BEnchmark on RObot manipulation tasks") is an AI benchmark for studying knowledge transfer in lifelong robot learning. Introduced in the 2023 paper "LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning" by Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone, and published at the Conference on Neural Information Processing Systems (NeurIPS) 2023 Datasets and Benchmarks Track, LIBERO provides 130 language-conditioned manipulation tasks in a simulated tabletop environment, organized into task suites that each isolate a different kind of distribution shift [1][2][3].
LIBERO was originally designed to study lifelong (continual) learning for decision-making agents, but it has since become a de facto standard evaluation for vision-language-action models (VLA models), with policies such as OpenVLA, Octo, and pi-0 from Physical Intelligence reporting per-suite success rates on it [4][5].
The benchmark is built on top of the robosuite simulation framework and the MuJoCo physics engine, using a simulated Franka Emika Panda 7 degree-of-freedom arm. It ships with high-quality, human-teleoperated demonstrations for every task, making it usable for imitation learning, multitask learning, and pretraining in addition to continual learning [2][6].
The central premise of LIBERO is that lifelong learning in decision-making (LLDM) differs in an important way from lifelong learning in static domains such as image classification or text. Whereas continual learning in vision and language is dominated by the transfer of declarative knowledge (facts about entities and concepts), an embodied agent that learns a stream of manipulation tasks must also transfer procedural knowledge: the actions, skills, and behaviors needed to accomplish goals. LIBERO is constructed to let researchers disentangle and study both kinds of transfer, as well as their combination [1][2].
The authors frame the benchmark around five research questions in lifelong decision-making: how to efficiently transfer declarative knowledge, procedural knowledge, or a mixture of the two; how to design effective policy architectures; how to design effective lifelong-learning algorithms; how robust a lifelong learner is to the order in which tasks are presented; and what effect model pretraining has on subsequent continual learning [2][3]. A notable empirical finding reported in the paper is that a simple sequential fine-tuning baseline can outperform several established lifelong-learning algorithms, and that naive supervised pretraining can actually hurt downstream performance in the lifelong setting, results that motivated treating LIBERO as an open research testbed rather than a solved benchmark [1][7].
To make the task space open-ended, LIBERO includes a procedural generation pipeline that can, in principle, generate an unbounded number of manipulation tasks. The pipeline extracts behavioral templates from human-activity data, instantiates them into natural-language instructions paired with formal goal predicates, samples scene layouts, and encodes each task specification in a behavior-description format that robosuite can render and simulate automatically [2][6].
LIBERO's 130 tasks are grouped so that researchers can control which type of distribution shift a policy must handle. Three suites of 10 tasks each hold most factors fixed while varying exactly one, and a larger collection of 100 tasks captures entangled shifts [2][3].
The four task suites most commonly cited in the literature are:
| Suite | Tasks | What varies (controlled shift) |
|---|---|---|
| LIBERO-Spatial | 10 | Same objects and task; spatial layout/arrangement changes |
| LIBERO-Object | 10 | Same task and layout; object types change |
| LIBERO-Goal | 10 | Same objects and layout; the goal/task changes |
| LIBERO-Long (LIBERO-10) | 10 | Long-horizon tasks mixing object, layout, and goal shifts |
In the original paper's organization, the 130 tasks consist of LIBERO-Spatial, LIBERO-Object, and LIBERO-Goal (10 tasks each, with controlled distribution shifts requiring transfer of one specific kind of knowledge) plus LIBERO-100, a set of 100 tasks that require transfer of entangled knowledge. LIBERO-100 is in turn split into LIBERO-90, a 90-task suite intended for large-scale pretraining and multitask learning, and LIBERO-Long (also called LIBERO-10), 10 long-horizon, multi-step tasks held out for evaluation [2][3]. The four-suite framing used by most VLA papers (Spatial, Object, Goal, Long) therefore reports on 40 tasks drawn from this larger pool, while LIBERO-90 and LIBERO-100 are used when a method targets multitask or pretraining evaluation [4][5].
Each task ships with 50 high-quality human-teleoperated demonstrations. A demonstration records RGB camera images, robot proprioception, and continuous 7 degree-of-freedom actions, giving roughly 6,500 demonstrations across the four core suites alone [2][6]. Tasks are specified with natural-language instructions (for example, picking up a particular object and placing it in a target location), making LIBERO well suited to language-conditioned and instruction-following policies.
In its original framing, LIBERO is a testbed for lifelong-learning research in robotics. The paper evaluates combinations of three lifelong-learning algorithms, two neural network policy architectures (a recurrent variant and a transformer variant), and several knowledge-transfer strategies, measuring how well agents acquire new manipulation skills over a sequence of tasks without forgetting earlier ones [2][7]. Because the suites isolate spatial, object, goal, and long-horizon shifts, researchers can attribute forgetting or transfer to a specific source of variation rather than to an uncontrolled mixture.
The primary metric is per-suite success rate: the fraction of evaluation rollouts in which the robot achieves the task's goal predicate. A rollout is typically counted as successful when the goal condition is satisfied and held for a short window of simulation steps. Lifelong-learning studies additionally report aggregate measures such as forward transfer, backward transfer (forgetting), and area-under-the-curve over the task sequence, computed from these success rates [2][7].
Although LIBERO was conceived for continual learning, its clean, discrete suites and ready-made demonstrations made it a convenient yardstick for the wave of generalist robot policies that followed. Beginning around 2024, vision-language-action models adopted LIBERO as a standard simulated manipulation benchmark, usually reporting success rates on LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, and LIBERO-Long after fine-tuning on the per-suite demonstrations [4][5].
OpenVLA, an open-source 7B-parameter VLA released in 2024, popularized this protocol. Kim et al. report OpenVLA success rates of 84.7 percent on LIBERO-Spatial, 87.0 percent on LIBERO-Object, 76.2 percent on LIBERO-Goal, and 51.8 percent on LIBERO-Long, for an average near 74.9 percent, with each number averaged over multiple seeds and 500 rollouts per suite [4]. The Octo generalist policy is a common baseline, reported around 78.9 percent (Spatial), 85.7 percent (Object), 84.6 percent (Goal), and 51.1 percent (Long) [5][8]. Physical Intelligence's pi-0 and its faster pi-0-FAST variant push scores higher; pi-0 is reported at roughly 96.8, 98.8, 95.8, and 85.2 percent across the four suites (average about 94.2 percent), while pi-0-FAST reports about 96.4, 96.8, 88.6, and 60.2 percent [5][8]. Numerous later systems, including TraceVLA, SpatialVLA, and reinforcement-learning-tuned policies, report LIBERO results, with several approaching or exceeding 98 percent average on the four canonical suites [5].
A recurring caution in this literature is that very high LIBERO scores reflect relatively clean train/test splits within each suite. Follow-up robustness studies, such as LIBERO-Plus and related perturbation analyses, show that policies scoring near 98 percent on the standard suites can degrade sharply (in some cases toward single-digit or low success rates) when systematic perturbations to lighting, camera, objects, or layout are introduced, suggesting that strong in-distribution LIBERO numbers do not by themselves establish robust generalization [9].
LIBERO sits within a broader ecosystem of robot manipulation and VLA benchmarks, each emphasizing different axes of evaluation [10]:
A 2026 audit of manipulation benchmarks that examined LIBERO, CALVIN, SimplerEnv, RoboCasa, and RoboTwin 2.0 found that LIBERO and CALVIN failed several diagnostic checks (for example, exhibiting exploitable shortcuts), whereas RoboCasa and RoboTwin 2.0 failed fewer despite appearing less often in headline progress claims, reinforcing the view that LIBERO is best used alongside other benchmarks rather than as a sole measure of manipulation capability [10].