π*0.6 (pi-star-0.6)

AI Models Reinforcement Learning Robotics

8 min read

Updated Jun 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 7, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 1,536 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

π*0.6 (written "Pi-star-0.6") is a vision-language-action robot foundation model developed by Physical Intelligence, a San Francisco robotics startup. Announced on November 17, 2025, it is the company's first model trained to improve from its own on-robot experience, not just from human demonstrations. The accompanying research describes a recipe called RECAP, short for Reinforcement learning with Experience and Corrections via Advantage-conditioned Policies, that combines demonstration data, expert corrections, and autonomous trial-and-error into a single training pipeline. Physical Intelligence reports that the resulting policies run for long stretches without help and handle some of the hardest manipulation tasks the company has attempted, including making espresso on a commercial machine, folding laundry, and assembling cardboard boxes. ^[1]^[2]

The model sits at the end of a short lineage of Physical Intelligence releases. It builds directly on π0 and π0.5, the company's earlier generalist policies, and represents the point where the lab moved past pure imitation learning toward reinforcement learning from real-world deployment. The star in the name marks that shift: π0.6 is the underlying supervised model, and π*0.6 is the version refined with experience through RECAP. ^[1]^[3]

Background and motivation

Most recent robot learning systems are trained by behavioral cloning. A person teleoperates the robot to perform a task many times, and the policy learns to copy those demonstrations. This works well enough to produce impressive demos, but it has a known weakness. A policy trained only to imitate has never seen what happens after it makes a mistake, so once it drifts even slightly off the distribution of expert behavior, errors tend to compound. A small slip leads to an unfamiliar state, the unfamiliar state produces a worse action, and reliability falls apart over long horizons. Physical Intelligence frames π*0.6 as an answer to exactly this problem: a robot that gets better the more it practices, the way a person improves at a skill through repetition rather than by watching more examples. ^[1]^[2]

The company's stated goal is a single general robot foundation model that can be adapted to many physical tasks, in the same way a large language model is adapted to many text tasks. Demonstrations alone are expensive to collect and cap out at human-level execution. Learning from autonomous experience offers a path to keep improving after deployment, using data the robot generates itself. ^[1]

What RECAP does

RECAP is the training method behind the model rather than a separate system. It brings together three kinds of data, each addressing a different gap in how the policy learns. ^[1]^[2]

Data source	What it provides	Role in training
Demonstrations	Teleoperated examples of the task done well	Supervised base policy, the same foundation used for π0 and π0.5
Expert corrections	Human interventions that take over when the robot starts to fail, then hand control back	Teaches recovery from mistakes and covers states demonstrations never reach
Autonomous experience	The robot's own attempts during long unattended runs, labeled by outcome	On-policy data that lets the policy self-improve through trial and error

The technical core is what Physical Intelligence calls advantage conditioning. In reinforcement learning, the advantage of an action measures how much better or worse it is than the policy's average behavior in a given state. RECAP trains a value function to estimate this signal, then conditions the policy on it during training. In effect the model is told which of its past actions were good and which were bad, and it learns to reproduce the good ones at inference time. The appeal of this design is that it keeps all of the training data in play, including failed attempts, instead of discarding everything except successful trajectories. The policy learns from the full range of outcomes rather than only from clean expert behavior. ^[1]^[2]

The pipeline runs in stages. The model is first pre-trained offline with reinforcement learning on a large mixed dataset, which gives a strong general starting point. It is then specialized to particular tasks by alternating between autonomous data collection on real robots and further RECAP training on the experience that collection produces. Each round of practice feeds the next round of learning. ^[1]^[2]

Model and lineage

The underlying architecture follows the family pattern Physical Intelligence established with π0. A roughly 5-billion-parameter vision-language model handles perception and instruction following, and a separate action expert produces the continuous motor commands that drive the robot. π0.6 is described as a refinement of π0.5 with a somewhat larger backbone, and π*0.6 is that model after RECAP-based refinement on experience and corrections. ^[1]^[3]

Model	Year	What it is
π0	2024	First generalist vision-language-action flow policy from Physical Intelligence
π0.5	2025	Successor focused on open-world generalization to new homes and environments
π0.6	2025	Supervised base model, a refinement of π0.5 with a larger backbone
π*0.6	2025	π0.6 trained with RECAP to learn from experience and corrections

Reported results

Physical Intelligence reports sizable gains from adding experience and corrections on top of the supervised base. The figures below are the company's own measurements from its blog post and paper, not independent benchmarks. ^[1]^[2]

According to the company, throughput on some of the hardest tasks more than doubles, and failure rates fall by a factor of two or more, when policies are trained with RECAP rather than demonstrations alone. The headline demonstrations are about endurance as much as accuracy. The company says a π*0.6 policy can make espresso drinks on a commercial machine for an entire day, reaching over a 90 percent success rate on the task. In a laundry test the robot folded roughly 50 unfamiliar items of clothing over several hours without a person stepping in. In a packaging test it assembled and labeled 59 boxes in a row. These runs are meant to show that the policy can sustain useful work over long unattended periods, which is where imitation-only systems tend to break down. ^[1]^[2]

Task	Reported result
Espresso (commercial machine)	Over 90 percent success; runs for a full day
Folding laundry	About 50 novel garments folded over several hours, unattended
Assembling and labeling boxes	59 boxes completed in sequence
Hardest tasks overall	Throughput more than doubled; failure rate cut by 2x or more vs. demonstrations alone

The espresso and box tasks are notable because they involve precise, multi-step manipulation with real consequences for small errors, the kind of work where a policy needs to recover gracefully rather than simply repeat a memorized motion. ^[1]^[2]

Significance

π*0.6 is part of a broader move in robotics toward foundation models that learn from deployment rather than from fixed datasets alone. Several groups have built large vision-language-action models, but most are trained primarily by imitation. By folding autonomous experience and human corrections into the same recipe and showing day-long reliability on practical tasks, Physical Intelligence makes a concrete case that on-robot reinforcement learning can push past the ceiling of behavioral cloning. The advantage-conditioning idea is also a pragmatic way to use reinforcement learning with large pre-trained policies without throwing away the imitation data those policies depend on. ^[1]^[2]

The long-run demonstrations matter for the field's near-term ambitions. A robot that can fold laundry for an afternoon or pull espresso shots all day is closer to the kind of sustained, real-world usefulness that has been hard to reach, even if the tasks remain narrow and the settings controlled. ^[1]^[2]

Limitations

The results come from Physical Intelligence's own evaluations rather than third-party testing, and the headline figures cover a small set of curated tasks in controlled settings. The method depends on a steady supply of human corrections during training, which keeps a person in the loop and limits how far the autonomy extends in practice. Collecting on-robot experience is slower and more costly than gathering data in simulation, and the company has not published broad comparisons against other robot foundation models on shared benchmarks. As with earlier models in the family, generalization to genuinely new objects, tasks, and environments outside the training distribution remains an open question. The work is best read as evidence that experience-driven training improves reliability on hard manipulation, not as a claim of general-purpose physical intelligence. ^[1]^[2]

References

Physical Intelligence. "π*0.6: a VLA That Learns From Experience." Physical Intelligence blog, November 17, 2025. https://www.pi.website/blog/pistar06 ↩
Physical Intelligence. "Recipes for Pre-training Robot Foundation Models with RL: π*0.6." arXiv:2511.14759, November 2025. https://arxiv.org/abs/2511.14759 ↩
Physical Intelligence. "π0.5: a VLA with Open-World Generalization." Physical Intelligence blog, 2025. https://www.pi.website/blog/pi05 ↩
Physical Intelligence. "π0: Our First Generalist Policy." Physical Intelligence blog, 2024. https://www.pi.website/blog/pi0
Physical Intelligence. π*0.6 model card. 2025. https://www.physicalintelligence.company/
Physical Intelligence (company homepage). https://www.pi.website/
Hugging Face. "Recipes for Pre-training Robot Foundation Models with RL (Paper page)." https://huggingface.co/papers/2511.14759
Black, K., et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164, 2024. https://arxiv.org/abs/2410.24164
The Moonlight. "Review: Recipes for Pre-training Robot Foundation Models with RL." https://www.themoonlight.io/en/review/recipes-for-pre-training-robot-foundation-models-with-rl

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Embodied AI

Background and motivation

What RECAP does

Model and lineage

Reported results

Significance

Limitations

References

Improve this article

Related Articles

Robotics Models

MuZero

Kimi K1.5

Gato (DeepMind)

Embodied AI

Robot learning

What links here

Related Articles

Robotics Models

MuZero

Kimi K1.5

Gato (DeepMind)

Embodied AI

Robot learning