π*0.6 (pi-star-0.6)
Last reviewed
Jun 7, 2026
Sources
9 citations
Review status
Source-backed
Revision
v3 · 1,536 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 7, 2026
Sources
9 citations
Review status
Source-backed
Revision
v3 · 1,536 words
Add missing citations, update stale details, or suggest a clearer explanation.
π*0.6 (written "Pi-star-0.6") is a vision-language-action robot foundation model developed by Physical Intelligence, a San Francisco robotics startup. Announced on November 17, 2025, it is the company's first model trained to improve from its own on-robot experience, not just from human demonstrations. The accompanying research describes a recipe called RECAP, short for Reinforcement learning with Experience and Corrections via Advantage-conditioned Policies, that combines demonstration data, expert corrections, and autonomous trial-and-error into a single training pipeline. Physical Intelligence reports that the resulting policies run for long stretches without help and handle some of the hardest manipulation tasks the company has attempted, including making espresso on a commercial machine, folding laundry, and assembling cardboard boxes. [1][2]
The model sits at the end of a short lineage of Physical Intelligence releases. It builds directly on π0 and π0.5, the company's earlier generalist policies, and represents the point where the lab moved past pure imitation learning toward reinforcement learning from real-world deployment. The star in the name marks that shift: π0.6 is the underlying supervised model, and π*0.6 is the version refined with experience through RECAP. [1][3]
Most recent robot learning systems are trained by behavioral cloning. A person teleoperates the robot to perform a task many times, and the policy learns to copy those demonstrations. This works well enough to produce impressive demos, but it has a known weakness. A policy trained only to imitate has never seen what happens after it makes a mistake, so once it drifts even slightly off the distribution of expert behavior, errors tend to compound. A small slip leads to an unfamiliar state, the unfamiliar state produces a worse action, and reliability falls apart over long horizons. Physical Intelligence frames π*0.6 as an answer to exactly this problem: a robot that gets better the more it practices, the way a person improves at a skill through repetition rather than by watching more examples. [1][2]
The company's stated goal is a single general robot foundation model that can be adapted to many physical tasks, in the same way a large language model is adapted to many text tasks. Demonstrations alone are expensive to collect and cap out at human-level execution. Learning from autonomous experience offers a path to keep improving after deployment, using data the robot generates itself. [1]
RECAP is the training method behind the model rather than a separate system. It brings together three kinds of data, each addressing a different gap in how the policy learns. [1][2]
| Data source | What it provides | Role in training |
|---|---|---|
| Demonstrations | Teleoperated examples of the task done well | Supervised base policy, the same foundation used for π0 and π0.5 |
| Expert corrections | Human interventions that take over when the robot starts to fail, then hand control back | Teaches recovery from mistakes and covers states demonstrations never reach |
| Autonomous experience | The robot's own attempts during long unattended runs, labeled by outcome | On-policy data that lets the policy self-improve through trial and error |
The technical core is what Physical Intelligence calls advantage conditioning. In reinforcement learning, the advantage of an action measures how much better or worse it is than the policy's average behavior in a given state. RECAP trains a value function to estimate this signal, then conditions the policy on it during training. In effect the model is told which of its past actions were good and which were bad, and it learns to reproduce the good ones at inference time. The appeal of this design is that it keeps all of the training data in play, including failed attempts, instead of discarding everything except successful trajectories. The policy learns from the full range of outcomes rather than only from clean expert behavior. [1][2]
The pipeline runs in stages. The model is first pre-trained offline with reinforcement learning on a large mixed dataset, which gives a strong general starting point. It is then specialized to particular tasks by alternating between autonomous data collection on real robots and further RECAP training on the experience that collection produces. Each round of practice feeds the next round of learning. [1][2]
The underlying architecture follows the family pattern Physical Intelligence established with π0. A roughly 5-billion-parameter vision-language model handles perception and instruction following, and a separate action expert produces the continuous motor commands that drive the robot. π0.6 is described as a refinement of π0.5 with a somewhat larger backbone, and π*0.6 is that model after RECAP-based refinement on experience and corrections. [1][3]
| Model | Year | What it is |
|---|---|---|
| π0 | 2024 | First generalist vision-language-action flow policy from Physical Intelligence |
| π0.5 | 2025 | Successor focused on open-world generalization to new homes and environments |
| π0.6 | 2025 | Supervised base model, a refinement of π0.5 with a larger backbone |
| π*0.6 | 2025 | π0.6 trained with RECAP to learn from experience and corrections |
Physical Intelligence reports sizable gains from adding experience and corrections on top of the supervised base. The figures below are the company's own measurements from its blog post and paper, not independent benchmarks. [1][2]
According to the company, throughput on some of the hardest tasks more than doubles, and failure rates fall by a factor of two or more, when policies are trained with RECAP rather than demonstrations alone. The headline demonstrations are about endurance as much as accuracy. The company says a π*0.6 policy can make espresso drinks on a commercial machine for an entire day, reaching over a 90 percent success rate on the task. In a laundry test the robot folded roughly 50 unfamiliar items of clothing over several hours without a person stepping in. In a packaging test it assembled and labeled 59 boxes in a row. These runs are meant to show that the policy can sustain useful work over long unattended periods, which is where imitation-only systems tend to break down. [1][2]
| Task | Reported result |
|---|---|
| Espresso (commercial machine) | Over 90 percent success; runs for a full day |
| Folding laundry | About 50 novel garments folded over several hours, unattended |
| Assembling and labeling boxes | 59 boxes completed in sequence |
| Hardest tasks overall | Throughput more than doubled; failure rate cut by 2x or more vs. demonstrations alone |
The espresso and box tasks are notable because they involve precise, multi-step manipulation with real consequences for small errors, the kind of work where a policy needs to recover gracefully rather than simply repeat a memorized motion. [1][2]
π*0.6 is part of a broader move in robotics toward foundation models that learn from deployment rather than from fixed datasets alone. Several groups have built large vision-language-action models, but most are trained primarily by imitation. By folding autonomous experience and human corrections into the same recipe and showing day-long reliability on practical tasks, Physical Intelligence makes a concrete case that on-robot reinforcement learning can push past the ceiling of behavioral cloning. The advantage-conditioning idea is also a pragmatic way to use reinforcement learning with large pre-trained policies without throwing away the imitation data those policies depend on. [1][2]
The long-run demonstrations matter for the field's near-term ambitions. A robot that can fold laundry for an afternoon or pull espresso shots all day is closer to the kind of sustained, real-world usefulness that has been hard to reach, even if the tasks remain narrow and the settings controlled. [1][2]
The results come from Physical Intelligence's own evaluations rather than third-party testing, and the headline figures cover a small set of curated tasks in controlled settings. The method depends on a steady supply of human corrections during training, which keeps a person in the loop and limits how far the autonomy extends in practice. Collecting on-robot experience is slower and more costly than gathering data in simulation, and the company has not published broad comparisons against other robot foundation models on shared benchmarks. As with earlier models in the family, generalization to genuinely new objects, tasks, and environments outside the training distribution remains an open question. The work is best read as evidence that experience-driven training improves reliability on hard manipulation, not as a claim of general-purpose physical intelligence. [1][2]