π₀ (pi-zero)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,317 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,317 words
Add missing citations, update stale details, or suggest a clearer explanation.
π₀ (pronounced pi-zero and sometimes written pi0 or pizero) is a vision-language-action model (VLA) developed by the robotics foundation-model startup Physical Intelligence. Announced on 31 October 2024 in the technical report π₀: A Vision-Language-Action Flow Model for General Robot Control by Kevin Black, Noah Brown, Danny Driess and 21 co-authors, π₀ is the company's first generalist robot policy and one of the most widely cited flagship VLAs of the 2024-2025 wave of generalist manipulation systems.[1][2] The model couples a 3-billion-parameter PaliGemma vision-language backbone with a separate 300 M-parameter "action expert" that produces continuous, high-frequency motor commands using a flow matching objective borrowed from continuous-time diffusion modelling.[2] Together with the proprietary cross-embodiment dataset of roughly 10 000 hours of teleoperated data on which it is pretrained, π₀ established a template that was rapidly adopted across the field: a pretrained VLM contributing semantic grounding, an action head producing 50 Hz action chunks, and large heterogeneous robot data providing cross-embodiment transfer.[2][3]
The model attracted unusual attention for two reasons. First, the accompanying demonstrations, recorded on bimanual mobile manipulators in real apartments and offices, showed sustained, multi-minute dexterous behaviour such as folding laundry from a tangled pile, bussing dining tables, assembling cardboard boxes and bagging groceries, far exceeding the seconds-long horizons typical of prior VLAs such as RT-2, OpenVLA or Octo.[1] Second, on 4 February 2025 Physical Intelligence released the model weights, training code and fine-tuning recipes under an Apache 2.0 licence in the openpi repository, making π₀ the first frontier-scale generalist robot policy with openly downloadable weights.[4][5] Subsequent variants, π₀-FAST (January 2025), introducing the frequency-space action tokenizer of the same name, and π₀.₅ (April 2025), targeting open-world generalisation, are released through the same repository.[6][7][8]
In framing the model, Physical Intelligence positioned it as a "first step" rather than a finished product, drawing an explicit analogy with the trajectory of early generative pretrained text models: pretrain on a broad cross-embodiment data soup, fine-tune on a task-specific dataset of a few hours, deploy on a robot. That recipe, even where individual components were not new (action chunking from ACT, flow matching from continuous diffusion, large robot data pooling from Open X-Embodiment), turned out to be unusually effective in combination, and its publication catalysed a wave of open-source VLA follow-ons throughout 2025.
Physical Intelligence (often stylised "π" or "PI") is a San Francisco artificial-intelligence company founded in March 2024 to build foundation models for general-purpose robot control.[9] Its co-founding team is dominated by veterans of academic robot-learning labs and of Google's robotics group: Karol Hausman (CEO; formerly Staff Research Scientist and Robot Manipulation Lead at Google Brain), Sergey Levine (Chief Scientist; associate professor at UC Berkeley), Chelsea Finn (Research Lead; assistant professor at Stanford), Brian Ichter (formerly Research Scientist at Google DeepMind), Quan Vuong, Adnan Esmail (engineering; previously Anduril and Tesla) and Lachy Groom (operations; previously product lead at Stripe).[9][10] Several of the team's senior members were authors of earlier influential robot-learning papers such as RT-1, RT-2, PaLM-E and Octo, providing direct continuity from Google-era VLA research into the company's product line.
Physical Intelligence emerged from stealth in March 2024 with a $70 million seed round and subsequently raised a $400 million Series A in November 2024 led by Jeff Bezos, OpenAI, Thrive Capital and Lux Capital at a roughly $2.4 billion valuation, then a $600 million Series B in November 2025 led by CapitalG at a $5.6 billion valuation, with NVIDIA participating through NVentures.[9][11] The company has positioned itself as a software-only "robot brain" provider rather than a robot manufacturer, training its models across third-party hardware including ALOHA/ALOHA 2 bimanual rigs, AgileX Trossen Arx arms, UR5e and Franka single-arm platforms, and Fibocom mobile bases. π₀ is the flagship public artefact of that strategy and was followed in 2025 by π₀.₅ and the experimental π∗₀.₆ ("pi-star-zero-point-six") policy, which adds reinforcement-learning fine-tuning via the RECAP (Reinforcement Learning with Experience and Corrections via Advantage-conditioned Policies) algorithm.[12]
The choice to centre the company around a single foundation model rather than a portfolio of task-specific policies was deliberate, and reflects an explicit bet by the founding team that the same scaling hypothesis underlying large language models will eventually apply to robot manipulation: that performance and generality will both improve with more parameters, more diverse demonstrations and more compute, given the right architecture. The π₀ technical report is, in effect, the first concrete formulation of that bet from Physical Intelligence, and many of its design choices, particularly the decision to keep the action expert physically separate from the VLM and to use action chunking rather than per-step prediction, are framed as engineering compromises in service of that scaling argument.[1][2]
π₀ follows the now-standard recipe of pairing an internet-scale pretrained vision-language model with a robot-specific action head, but it differs from earlier VLAs in two key ways: (i) the action head is a separate transformer expert that runs alongside the VLM rather than sharing all of its weights, and (ii) it generates actions through flow matching rather than autoregressive discrete tokens.[2]
The VLM backbone is Google's PaliGemma, a 3-billion-parameter vision-language transformer combining a 400 M-parameter SigLIP image encoder with the 2.6 B-parameter Gemma decoder-only language model.[2][13] PaliGemma was selected because it is one of the smallest contemporary VLMs with strong open-image-and-text capabilities, which keeps inference latency on a single GPU compatible with real-time control. The backbone is initialised from the publicly released PaliGemma weights, then trained jointly with the action expert during robot pretraining.[2]
In parallel with the language tokens, π₀ injects a stream of action tokens and state tokens into the same transformer stack. These tokens are processed not by the original PaliGemma weights but by a separate set of 300 million additional parameters that are randomly initialised and trained from scratch. The Physical Intelligence team calls this set the "action expert"; architecturally it is a smaller transformer that shares the attention with the VLM tokens but maintains its own MLP and projection weights, so total model size is 3.3 billion parameters.[2]
Each timestep, the model receives one or more RGB images (typically three, from base and wrist cameras), a natural-language instruction, the current proprioceptive robot state, and a vector of Gaussian noise. The VLM tokens fully attend to one another with bidirectional attention, while state and action-noise tokens attend through a block-causal mask. This design preserves the bidirectional attention pattern of the original VLM (preventing catastrophic forgetting of internet pretraining) while letting actions and states form their own temporally structured sequence.[2][14]
Rather than discretising actions into bins and emitting them autoregressively (the approach used by RT-2 and OpenVLA), π₀ predicts a vector field that transports random Gaussian noise to a chunk of future actions, in the spirit of flow matching and rectified flow.[2] At inference, the model starts from random noise of shape H × A (where H is the chunk length and A is the action dimension), runs roughly 10 integration steps using the predicted vector field, and outputs a smooth H-step action chunk. Because all H actions are produced in a single forward pass per integration step, the per-chunk latency is dominated by the small number of denoising steps, allowing chunk-rate inference roughly every 0.5-0.8 seconds and per-action control frequencies of up to 50 Hz when the chunk is consumed by an underlying controller.[14]
π₀ predicts an action chunk of length H = 50 future timesteps at each invocation, an approach borrowed from earlier action-chunking work such as ACT and Diffusion Policy that smooths transitions and reduces compounding error.[2] To unify heterogeneous robot platforms with different numbers of joints, gripper modalities and command modes, π₀ pads all state and action vectors to the dimension of the largest robot in the training mix (18 dimensions in the published configuration) and zero-pads narrower platforms; the language prompt and visual inputs disambiguate which embodiment is currently in use.[2]
A subtle but important consequence of chunking with flow matching is that the policy effectively reasons over half-second horizons in a single forward pass. Whereas autoregressive token-by-token decoding has to repeatedly recommit to a previous action prefix and is correspondingly sensitive to early mistakes, π₀ can re-sample the entire 50-step trajectory whenever the latest observation suggests a strategy change. The team reports that this property is critical for fine, dynamic behaviours such as catching a falling utensil or shaking out a tangled garment, which would be brittle if decoded one action at a time. It also means that the perceived "control frequency" of the system is governed by how often new chunks are generated rather than by raw action sampling; with chunks regenerated approximately every 0.5-0.8 seconds and a low-level controller interpolating between them, the effective closed-loop bandwidth on dexterous tasks is closer to 50 Hz than to 1-2 Hz.[14]
| Component | Parameters | Description |
|---|---|---|
| SigLIP visual encoder (within PaliGemma) | ~400 M | Frozen-then-fine-tuned image encoder, sigmoid-loss CLIP variant |
| Gemma decoder (within PaliGemma) | ~2.6 B | Decoder-only language model providing text-side processing |
| PaliGemma VLM backbone (total) | ~3.0 B | Pretrained on web image-text data, fine-tuned on robot data |
| Action expert (separate MLP / projection weights) | ~300 M | Randomly initialised, processes state/action/noise tokens |
| Total π₀ | ~3.3 B | Real-time inference at ~50 Hz on consumer GPUs |
Action chunk length H | 50 timesteps | Generated per forward integration |
| Control frequency | up to 50 Hz | 20 Hz on slower UR5e/Franka setups |
| Inference time (3 cameras, RTX 4090) | ~73 ms / chunk | ~10 flow-matching integration steps |
There is also a smaller π₀-small variant of approximately 470 M parameters that omits PaliGemma initialisation; it is used in the paper for ablations isolating the contribution of internet-scale pretraining.[2]
Pretraining π₀ to behave as a generalist required a robot dataset large and diverse enough to be reminiscent of internet-scale text corpora. Physical Intelligence assembled two such pools.
The bulk of the data is an in-house dataset collected by company teleoperators on seven robot configurations across approximately 68 tasks, including: a single-arm UR5e, a bimanual UR5e, a single-arm Franka, a bimanual Trossen, a bimanual AgileX Arx, a mobile bimanual Trossen and a mobile bimanual Fibocom platform.[2] The full corpus comprises roughly 903 million timesteps, equivalent to around 10 000 hours of teleoperated robot experience, making it by some margin the largest robot manipulation dataset assembled in 2024.[14] Tasks include cloth folding, table bussing, grocery bagging, box assembly, plug insertion, food packing, drawer manipulation, dish loading and a long tail of household manipulation behaviours.
The proprietary data is mixed with the public Open X-Embodiment (OXE) dataset, the 2023 community release aggregating over 1 million trajectories from 22 distinct robot embodiments across 21 research institutions, of which π₀ uses a curated subset.[2] OXE provides additional embodiment diversity (especially for single-arm robots not represented in the proprietary mix) and approximately 90 million additional timesteps; mixing weights are tuned to balance high- and low-quality demonstrations.[14]
The model is trained in two stages:
This decoupling of broad pretraining and task-specific post-training mirrors the practice of language-model finetuning and is, in the company's framing, one of the key arguments for VLAs over per-task imitation policies.[1]
A practical complication of building such a heterogeneous dataset is that demonstrations vary widely in quality. Even within a single platform, some episodes are collected by an experienced operator under good lighting and clear instructions, while others contain noisy teleoperation, recovery from grasp failures, dropped objects or partial task completion. Rather than discarding lower-quality data outright, the π₀ team mixes it with carefully weighted batch sampling so that pretraining sees the full distribution but post-training is dominated by high-quality, in-domain examples. This loosely parallels the "curriculum" strategy used in modern language-model finetuning, where instruction-tuning data is filtered for quality even as the base pretraining corpus tolerates noise.[2][14]
The accompanying blog post and videos demonstrate π₀ executing several long-horizon, dexterous behaviours that prior VLAs could not sustain:[1]
Several of these behaviours run for more than 100 seconds end-to-end and in some cases approach 10 minutes for full clothing folding episodes, an order-of-magnitude jump in temporal horizon over what was previously reported for generalist policies.[1]
The π₀ paper presents two main evaluations: (i) zero-shot performance on tasks drawn from the pretraining distribution, and (ii) fine-tuned performance on novel downstream tasks. In both settings, π₀ outperforms strong contemporaneous baselines: the 7-billion-parameter OpenVLA, the 93-million-parameter Octo diffusion policy, the Diffusion Policy of Chi et al. (2023), and the per-task imitation baselines ACT and BC (behaviour cloning).[2]
| Model | Backbone / family | Params | Action representation | Notes |
|---|---|---|---|---|
| π₀ | PaliGemma + action expert | 3.3 B | Flow-matching action chunks, 50 Hz | Generalist VLA, multi-embodiment |
| π₀-small | Custom transformer | 470 M | Flow-matching | Ablation without VLM init |
| π₀-FAST | PaliGemma + autoregressive head | 3 B+ | FAST discrete tokens, autoregressive | Up to 5× faster training[6] |
| π₀.₅ | π₀ + knowledge insulation | 3 B+ | Hybrid (text plan + flow actions) | Open-world generalisation[7] |
| OpenVLA | Llama-2-7B + DINOv2/SigLIP | 7 B | Discretised action bins (autoregressive) | Open-weights generalist (Stanford/Berkeley, 2024) |
| RT-2 | PaLI-X / PaLM-E | 55 B | Discretised action tokens | Google DeepMind, 2023 |
| Octo | Custom transformer | 93 M | Diffusion head | Open-weights generalist (Berkeley, 2024) |
| Diffusion Policy | UNet / Transformer | ~100 M | Continuous diffusion | Single-task imitation baseline |
| ACT | Conditional VAE | ~80 M | Direct action regression | Single-task imitation baseline |
On the four "seen" tasks reported in the paper (shirt folding, table bussing, grocery bagging, removing toast from a toaster), π₀ achieves an average normalised success score of roughly 0.8 out of 1.0, against approximately 0.35 for the strongest baseline, Diffusion Policy. On the bowl-stacking subtask in particular, π₀ scores near 1.0 while OpenVLA and Octo each score below 0.1.[14][15] Across language-conditioned tasks π₀ also follows mid-level instructions (for example, "pick up the green block and place it in the brown bowl") substantially more reliably than the baselines.[1] On unseen but related downstream fine-tuning tasks, π₀ reaches 40-60 % success in zero-shot rollouts and 80-95 % after small amounts of task-specific fine-tuning, again above the baselines.[15]
A robustness caveat is that, in independent third-party evaluations (notably the "π₀ in the Wild" study by Penn-PAL Lab), the gap to baselines narrows on tasks involving distractor objects, novel camera positions and unusual lighting, indicating that part of π₀'s advantage stems from in-distribution coverage of its proprietary dataset rather than purely from architecture.[16] This motivated the follow-on π₀.₅ release.
On 4 February 2025 Physical Intelligence released code and weights for π₀ in the public openpi repository at github.com/Physical-Intelligence/openpi, under the Apache 2.0 licence (with an additional Gemma licence applying to the PaliGemma-derived weights).[4][5] The repository was an unusually open release for a frontier robot model and quickly became a community reference implementation, with downstream PyTorch ports (notably HuggingFace's integration into LeRobot) appearing within days.[17]
The initial release included:
Reported hardware requirements are modest by VLA standards: inference fits comfortably on an 8 GB GPU; LoRA fine-tuning requires roughly 22 GB (an RTX 4090 suffices); full fine-tuning requires 70+ GB and is typically performed on A100 or H100 GPUs.[5] In Physical Intelligence's own experiments, only 1-20 hours of task-specific data are needed to fine-tune the base model to a new manipulation task, which represents a step-change reduction relative to the hundreds of hours typically required to train a single-task imitation policy from scratch.[4]
Subsequent updates to the repository (through to 2025) added:
In January 2025 Physical Intelligence published the paper FAST: Efficient Action Tokenization for Vision-Language-Action Models (Pertsch, Stachowicz, Ichter, Driess et al., arXiv:2501.09747), introducing FAST (Frequency-space Action Sequence Tokenization).[6] Standard per-dimension binning of robot actions, the discretisation used by RT-2 and OpenVLA, was shown to fail on high-frequency dexterous tasks because consecutive action vectors are highly correlated and binning each independently produces an enormous redundant token stream. FAST instead applies a per-dimension Discrete Cosine Transform to action chunks, prunes low-magnitude high-frequency coefficients and applies Byte-Pair Encoding to the resulting integer sequences, yielding a compact, lossless and almost universal tokenisation governed by only two hyperparameters (a scaling coefficient and a BPE vocabulary size).
π₀-FAST combines this tokenizer with the same PaliGemma backbone as π₀, but predicts actions autoregressively instead of via flow matching. The training data and overall architecture are otherwise comparable. The principal benefit is training efficiency: π₀-FAST reaches similar performance to flow-matching π₀ while training up to 5× faster on the same data, because each example contributes more tokens per gradient step and autoregressive prediction is well-supported by existing language-model training kernels.[6] A pretrained universal tokenizer, FAST+, trained on 1 million action sequences from single-arm, bimanual and mobile platforms, is released as a black-box tool and is the default for downstream π₀-FAST users.[6]
π₀-FAST checkpoints are included in openpi from February 2025 onwards, alongside the original π₀.[4]
The most prominent follow-up model, π₀.₅ (often written pi-zero-point-five or pi0.5), was announced on 22 April 2025 in the paper π₀.₅: a Vision-Language-Action Model with Open-World Generalization (arXiv:2504.16054).[7][8] Its principal motivation is open-world generalisation: π₀ achieves strong performance in environments resembling its training distribution but degrades on entirely novel homes, kitchens and offices. π₀.₅ targets this gap with three main changes:
The released demos show a mobile bimanual robot cleaning up entirely unseen kitchens and bedrooms, putting dishes in sinks, making beds, organising clothes in laundry baskets and wiping spills with sponges. The team reports an out-of-distribution success rate of roughly 94 % at following language commands, with performance approaching in-distribution baselines after training on around 100 distinct home environments.[7] π₀.₅ checkpoints are included in openpi from September 2025.
A further iteration, π∗₀.₆, was posted to arXiv in November 2025 (π∗₀.₆: a VLA That Learns From Experience, arXiv:2511.14759), adding the RECAP reinforcement-learning algorithm to allow continued improvement from real-world deployment data, but is outside the scope of this article.[12]
π₀'s significance lies less in any single architectural component and more in the way it crystallised several converging trends in robot learning into a coherent, openly available system.
Limitations are also widely acknowledged. Independent third-party evaluations have found that π₀'s zero-shot success on out-of-distribution objects and environments degrades substantially relative to in-distribution rates, and that long-horizon stability still depends on careful prompting and high-quality post-training data.[16] These observations motivated the π₀.₅ generalisation work and the π∗₀.₆ continual-learning approach. They also frame π₀ less as a finished product than as the first widely available "GPT-2 moment" for generalist robot policies.[4]