π0
Last reviewed
May 4, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,234 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 4, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,234 words
Add missing citations, update stale details, or suggest a clearer explanation.
π0 (pronounced "pi-zero") is a vision-language-action model for general-purpose robot control developed by Physical Intelligence, a San Francisco-based robotics startup. The model was introduced in the paper "π0: A Vision-Language-Action Flow Model for General Robot Control," first posted to arXiv on October 31, 2024. π0 combines a pretrained vision-language backbone with a flow-matching action head, allowing a single neural network to control multiple robot embodiments and to perform dexterous manipulation tasks at high control rates.
π0 was Physical Intelligence's first publicly released model and the company's debut research artifact, appearing approximately eight months after the company was founded. It uses Google's PaliGemma 3 billion parameter vision-language model as its backbone and adds an "action expert" of roughly 300 million parameters that produces continuous joint commands through flow matching, a generative modeling technique closely related to diffusion models. The total model size is approximately 3.3 billion parameters. Training data spanned more than 10,000 hours of robot demonstrations collected on seven distinct robot configurations, plus open datasets such as Open X-Embodiment.
The public demonstration that accompanied the October 2024 release showed a single π0 policy folding laundry, bussing tables, assembling cardboard boxes, and bagging groceries, marking one of the first examples of a single foundation model executing long-horizon dexterous skills across very different robot platforms. Physical Intelligence subsequently released model weights and source code under an Apache 2.0 license through the openpi GitHub repository in February 2025, alongside an autoregressive variant called π0-FAST and the follow-up model π0.5.
| Field | Details |
|---|---|
| Developer | Physical Intelligence |
| First arXiv release | October 31, 2024 |
| Model type | Vision-language-action (VLA) flow-matching policy |
| Base model | PaliGemma 3B (SigLIP-So400m vision encoder + Gemma 2B language model) |
| Architecture | Transformer VLM with separate "action expert" and flow-matching action head |
| Total parameters | ~3.3 billion (3B VLM backbone + ~300M action expert) |
| Action representation | Action chunks of H = 50 future actions, generated via flow matching |
| Control frequency | Up to 50 Hz |
| Training data | 10,000+ hours of robot demonstrations across 7 robot configurations and 68 tasks, plus Open X-Embodiment, Bridge v2, DROID, and PaliGemma's web pretraining |
| Embodiments supported | Single-arm (UR5e, Franka), bimanual (UR5e, Trossen ViperX, ARX/AgileX), mobile manipulators (Trossen, Fibocom) |
| Open source | openpi GitHub repository, released February 4, 2025 |
| License | Apache 2.0 (with included Gemma license terms) |
| Successors | π0-FAST (January 2025), π0.5 (April 2025), π0.5+KI (Knowledge Insulation, 2025), π0.6 (November 2025) |
| Venue | Robotics: Science and Systems (RSS) 2025 |
Physical Intelligence (often stylized as π) was incorporated in San Francisco in 2024 to build foundation models for robotics. The company's co-founders are Karol Hausman (Chief Executive Officer, formerly a staff research scientist at Google DeepMind and adjunct professor at Stanford), Sergey Levine (Chief Scientist and a tenured faculty member at the University of California, Berkeley), Chelsea Finn (assistant professor at Stanford), Brian Ichter, Quan Vuong, Adnan Esmail, and the entrepreneur and investor Lachy Groom. The company raised an initial $70 million seed round in March 2024.
In early November 2024, days after the π0 paper appeared, Physical Intelligence announced a $400 million Series A at a $2.4 billion post-money valuation. The round was led by Jeff Bezos, Thrive Capital, and Lux Capital, with participation from OpenAI, Bond, Sequoia Capital, and Khosla Ventures.
The paper "π0: A Vision-Language-Action Flow Model for General Robot Control" was posted to arXiv as 2410.24164 on October 31, 2024 with 24 listed authors, including Kevin Black, Noah Brown, Danny Driess, Chelsea Finn, Karol Hausman, Brian Ichter, Sergey Levine, Karl Pertsch, Lucy Xiaoyang Shi, Quan Vuong, and Ury Zhilinsky among others. The accompanying blog post, titled "π0: Our First Generalist Policy," included videos of robots folding laundry, bussing tables, and assembling cardboard boxes. Physical Intelligence stated that the model had been built over roughly eight months of work.
A short academic and industrial cycle followed:
| Date | Milestone |
|---|---|
| March 2024 | Physical Intelligence raises $70M seed round |
| October 31, 2024 | π0 paper posted to arXiv |
| November 4, 2024 | $400M Series A at $2.4B post-money valuation announced |
| January 16, 2025 | FAST paper (arXiv 2501.09747) introduces π0-FAST tokenizer variant |
| February 4, 2025 | openpi repository released with π0 and π0-FAST weights and code |
| April 22, 2025 | π0.5 paper (arXiv 2504.16054) released, adding open-world generalization |
| 2025 | Knowledge Insulation (KI) variant published, used in π0.5+KI |
| November 17, 2025 | π0.6 model card released |
The model was later accepted for presentation at Robotics: Science and Systems 2025.
π0 is a transformer-based policy that fuses three modalities: images, language, and proprioceptive robot state. Its design treats action generation as a conditional flow-matching problem on top of a pretrained vision-language model.
The backbone is the publicly released PaliGemma 3B model from Google, which combines the SigLIP-So400m image encoder with the Gemma 2B autoregressive language model and projects image patch features into the language model's token space. PaliGemma uses full bidirectional attention over its prefix (images plus text instruction) and causal attention for any generated text. π0 inherits this prefix structure and feeds it the camera images from the robot together with a natural-language task instruction.
In addition to the PaliGemma backbone, π0 introduces a dedicated "action expert" submodule of roughly 300 million parameters. The action expert is a separate set of transformer weights that processes proprioceptive robot state tokens and noisy candidate action tokens. The two modules share attention so that action tokens can read from the image and language tokens, but the action expert has its own parameters and is initialized from scratch rather than from PaliGemma weights. The total parameter count of the combined model is approximately 3.3 billion.
Rather than emitting a single action per timestep, π0 generates an "action chunk" of H = 50 future actions. At a 50 Hz control rate, this corresponds to one second of motion. Action chunks are produced through flow matching, a continuous-time generative formulation that learns to integrate a velocity field from Gaussian noise to clean actions. At inference time, π0 uses 10 integration steps to denoise a noisy action chunk into the final commanded trajectory. Flow matching produces smooth, multi-modal action distributions that are well suited to high-frequency dexterous manipulation, where discretized token-based policies have historically struggled.
π0 uses a custom block-sparse two-dimensional attention mask. Image and language tokens form a bidirectional prefix in the style of PaliGemma. Robot state tokens attend to the prefix, and action tokens attend to all earlier tokens (prefix and state) plus to one another within a chunk. The Hugging Face port implements this mask with PyTorch FlexAttention.
π0 is trained as a single policy that maps to many robot bodies. Robot-specific information such as joint counts, action dimensions, and image viewpoints is encoded into the input sequence so that the same network can drive single-arm, bimanual, and mobile manipulators without architectural changes. A 470-million-parameter ablation called π0-small, trained without any VLM pretraining, is reported in the paper as a comparison baseline.
π0 is trained on a mixture of Physical Intelligence's own teleoperated robot data and publicly available cross-embodiment datasets, on top of PaliGemma's prior web-scale pretraining.
| Data source | Description |
|---|---|
| π Dataset (proprietary) | Approximately 903 million timesteps collected by Physical Intelligence across 68 tasks and 7 robot configurations |
| Open X-Embodiment (RT-X) | Pooled open dataset spanning 22 robot embodiments contributed by 21 institutions |
| Bridge v2 | UC Berkeley dataset of single-arm tabletop manipulation |
| DROID | Distributed open dataset of Franka Emika Panda manipulation collected by a multi-university consortium |
| Web data | Inherited from PaliGemma's image-text pretraining |
The paper reports that the combined corpus represents over 10,000 hours of robot interaction data, which Physical Intelligence describes as the largest cross-embodiment robot training mix used at the time of release. The seven robot configurations covered by π's own data are listed below.
| Configuration | Type | Notes |
|---|---|---|
| UR5e | Single-arm | 6-DoF industrial arm |
| Bimanual UR5e | Two arms | Two UR5e arms on a shared workspace |
| Franka | Single-arm | Franka research arm |
| Bimanual Trossen ViperX | Two arms | ALOHA-style low-cost bimanual setup |
| Bimanual ARX / AgileX | Two arms | Higher-payload bimanual platform |
| Mobile Trossen / ARX | Mobile manipulator | Bimanual arms on a nonholonomic mobile base |
| Mobile Fibocom | Mobile manipulator | Bimanual arms on a holonomic base |
Most of the demonstration data was collected through human teleoperation, building on the imitation learning tradition rather than reinforcement learning.
The original release demonstrated a single π0 policy performing a wide range of long-horizon dexterous tasks. The paper and accompanying blog post show:
| Task | Embodiment shown | Notes |
|---|---|---|
| Laundry folding | Mobile bimanual and static bimanual | The task that drew the most attention; a single policy fetches laundry from a dryer, transports it, and folds clothing |
| Table bussing | Bimanual | Clearing dishes and trash |
| Box assembly | Bimanual | Folding flat-pack cardboard boxes |
| Grocery bagging | Bimanual | Packing items into shopping bags |
| Food prep and scooping | ALOHA-style bimanual | Demonstrated in the openpi fine-tuning examples |
| Object retrieval | Single-arm and bimanual | Picking and placing objects from clutter |
A distinguishing property of π0 is that the model was shown executing chained sub-skills from a single language instruction, such as "fold the laundry," rather than depending on a hand-engineered task scheduler. Physical Intelligence reported that fine-tuning π0 on between one and twenty hours of additional task-specific data was typically sufficient to adapt the base model to a new manipulation task on the company's robots.
The demonstrations were performed at the company's San Francisco facility and were not teleoperated during evaluation, although the underlying training data was teleoperated. The publicly shown laundry-folding video, in particular, drew comparisons in trade press to the well-known difficulty of cloth manipulation, which involves deformable objects, occlusions, and very long task horizons.
In January 2025, Physical Intelligence published "FAST: Efficient Action Tokenization for Vision-Language-Action Models" (arXiv 2501.09747) by Karl Pertsch and colleagues. The paper introduces FAST, short for Frequency-space Action Sequence Tokenization, and the corresponding π0-FAST policy variant.
FAST converts a continuous action chunk into a sequence of discrete tokens through the following pipeline:
The scheme is invertible, so the autoregressive model can decode tokens back to a continuous action chunk. Because actions are now discrete tokens, π0-FAST can be trained as a standard autoregressive language-model objective using cross-entropy loss, with no flow-matching or diffusion sampling at inference time.
The authors report that π0-FAST trains roughly five times faster than the original flow-matching π0 while matching or improving its task performance. Physical Intelligence also released a universal tokenizer called FAST+, trained on one million action sequences spanning single-arm, bimanual, and mobile-manipulation robots, available on Hugging Face for use as a black-box action tokenizer for other VLA projects. π0-FAST weights were released as part of the openpi repository in February 2025.
On April 22, 2025, Physical Intelligence released "π0.5: a Vision-Language-Action Model with Open-World Generalization" (arXiv 2504.16054). π0.5 is built on the π0 architecture but is co-trained on a substantially broader mixture of data, including:
The central claim of the paper is open-world generalization. π0.5 was evaluated by deploying it in real homes that the model had never seen during training, where it performed long-horizon manipulation tasks such as cleaning kitchens and bedrooms, putting dishes in sinks, and tidying bedding. Physical Intelligence described π0.5 as the first end-to-end learned robotic policy able to perform multi-step manipulation in entirely new home environments.
A later paper, "Knowledge Insulating Vision-Language-Action Models," formalizes a single-stage training recipe associated with π0.5 in which the action expert is updated alongside the VLM backbone but with action gradients prevented from propagating into the VLM weights. Physical Intelligence reports that this Knowledge Insulation (KI) recipe requires roughly 7.5 times fewer training steps than the original π0 schedule while preserving inference-time performance. The technique is the basis for the π0.5+KI variant. In November 2025 the company published a model card for π0.6, the next iteration in the family.
π0 was widely covered as a notable step toward general-purpose robot foundation models. The Robot Report, IEEE Spectrum, InfoQ, New Atlas, and CNBC reported on the model and on the related Series A funding announcement. Press attention focused on the laundry-folding video, which demonstrated a complete clothing-folding workflow on a bimanual mobile robot, and on the broader claim that one model could control several distinct robot bodies.
Within the robotics research community, π0 was compared with Google DeepMind's RT-2 and with the OpenVLA project from Stanford and collaborators. Commentary noted that π0 was the first widely circulated VLA built on the open PaliGemma backbone and the first to use flow matching as the action representation, in contrast to the discrete-token approach used by RT-2 and OpenVLA. The model's relatively modest 3.3 billion parameter count, compared with the 55 billion parameters of RT-2, was also seen as evidence that smaller, well-architected VLAs could produce high-frequency dexterous behavior. Independent in-the-wild evaluations of π0, including the "Evaluating π0 in the Wild" study from the University of Pennsylvania PAL Lab, examined the model's strengths and failure modes on tasks outside its training distribution.
Physical Intelligence is widely seen as one of the leading robot-foundation-model startups alongside Skild AI, Covariant, and others. Comparisons were also drawn with Tesla's Optimus program and with NVIDIA's GR00T humanoid foundation models, although those efforts target different deployment hardware.
On February 4, 2025, Physical Intelligence released the openpi repository at github.com/Physical-Intelligence/openpi. The release was announced in a blog post titled "Open Sourcing π0" and included:
The distribution is licensed under Apache 2.0, with the underlying Gemma terms also included. Subsequent additions to the repository have included weights and code for π0.5 and the Knowledge Insulation variant.
The fine-tuned checkpoints made available at launch covered the following platforms:
| Checkpoint | Robot platform | Notes |
|---|---|---|
| π0-DROID | Franka via DROID | Single-arm tabletop manipulation |
| π0-FAST-DROID | Franka via DROID | Autoregressive variant on DROID |
| π0-ALOHA-towel | ALOHA bimanual | Towel folding on internal data |
| π0-ALOHA-tupperware | ALOHA bimanual | Container manipulation |
| π0-ALOHA-pen-uncap | ALOHA bimanual | Trained on the public ALOHA pen-uncap dataset |
Reported hardware requirements for the released code are roughly 8 GB of GPU memory for inference, 22.5 GB for LoRA fine-tuning, and 70 GB or more for full fine-tuning, with Ubuntu 22.04 as the recommended host operating system. Hugging Face also released a PyTorch port of π0 and π0-FAST integrated into the LeRobot library.
| Model | Developer | Year | Parameters | Action representation | Open weights | Backbone |
|---|---|---|---|---|---|---|
| RT-2 | Google DeepMind | 2023 | 55B (RT-2-X variant) | Discrete action tokens | No | PaLI-X / PaLM-E |
| OpenVLA | Stanford and collaborators | June 2024 | 7B | Discrete action tokens | Yes | Llama-2 with DINOv2 + SigLIP vision |
| π0 | Physical Intelligence | October 2024 | ~3.3B | Continuous action chunks via flow matching | Yes (Feb 2025) | PaliGemma 3B |
| π0-FAST | Physical Intelligence | January 2025 | ~3.3B | Discrete tokens via FAST (DCT + BPE) | Yes | PaliGemma 3B |
| π0.5 | Physical Intelligence | April 2025 | Comparable to π0 | Flow matching with co-training on heterogeneous data | Yes | PaliGemma-based |
| GR00T N1 | NVIDIA | 2025 | ~2B | Dual-system VLM + action transformer | Yes | NVIDIA in-house |
The table shows the broad design space at the time of π0's release. Compared with RT-2, π0 is much smaller and openly distributed, and it uses continuous flow-matching actions rather than discretized text tokens. Compared with OpenVLA, π0 uses the smaller PaliGemma backbone but trains on a larger mixture of teleoperated robot data and supports more embodiments. Compared with NVIDIA's GR00T N1, π0 emphasizes mobile and tabletop manipulators rather than humanoids and uses a single-network design rather than an explicit System 1 / System 2 split.