π0

AI Models Robotics

17 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v3 · 3,421 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

π0 (pronounced "pi-zero") is a vision-language-action model for general-purpose robot control developed by Physical Intelligence, a San Francisco-based robotics startup, and introduced on October 31, 2024.^[1]^[2] It is a single neural network that controls multiple robot bodies and performs dexterous manipulation tasks such as folding laundry, bussing tables, and assembling boxes by combining a pretrained vision-language backbone with a flow matching action head.^[1] The model was the first widely circulated robot foundation model built on Google's open PaliGemma backbone, and Physical Intelligence describes it as "our first generalist policy" capable of performing "a wide range of different skills" while controlling "a wide range of different robots."^[2]

π0 was introduced in the paper "π0: A Vision-Language-Action Flow Model for General Robot Control," first posted to arXiv on October 31, 2024.^[1] It uses PaliGemma, a 3 billion parameter vision-language model, as its backbone and adds an "action expert" of roughly 300 million parameters that produces continuous joint commands through flow matching, a generative modeling technique closely related to diffusion models.^[1] The total model size is approximately 3.3 billion parameters.^[1] Training data spanned more than 10,000 hours of robot demonstrations collected on seven distinct robot configurations, plus open datasets such as Open X-Embodiment.^[1] π0 was Physical Intelligence's first publicly released model and the company's debut research artifact, appearing approximately eight months after the company was founded.^[2]

The paper frames the work around "generalist robot policies (i.e., robot foundation models)," proposing "a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge."^[1] The public demonstration that accompanied the October 2024 release showed a single π0 policy folding laundry, bussing tables, assembling cardboard boxes, and bagging groceries, marking one of the first examples of a single foundation model executing long-horizon dexterous skills across very different robot platforms.^[2] Physical Intelligence subsequently released model weights and source code under an Apache 2.0 license through the openpi GitHub repository in February 2025, alongside an autoregressive variant called π0-FAST and the follow-up model π0.5.^[4]^[5]

Infobox

Field	Details
Developer	Physical Intelligence
First arXiv release	October 31, 2024
Model type	Vision-language-action (VLA) flow-matching policy
Base model	PaliGemma 3B (SigLIP-So400m vision encoder + Gemma 2B language model)
Architecture	Transformer VLM with separate "action expert" and flow-matching action head
Total parameters	~3.3 billion (3B VLM backbone + ~300M action expert)
Action representation	Action chunks of H = 50 future actions, generated via flow matching
Control frequency	Up to 50 Hz
Training data	10,000+ hours of robot demonstrations across 7 robot configurations and 68 tasks, plus Open X-Embodiment, Bridge v2, DROID, and PaliGemma's web pretraining
Embodiments supported	Single-arm (UR5e, Franka), bimanual (UR5e, Trossen ViperX, ARX/AgileX), mobile manipulators (Trossen, Fibocom)
Open source	openpi GitHub repository, released February 4, 2025
License	Apache 2.0 (with included Gemma license terms)
Successors	π0-FAST (January 2025), π0.5 (April 2025), π0.5+KI (Knowledge Insulation, 2025), π0.6 (November 2025)
Venue	Robotics: Science and Systems (RSS) 2025

When was π0 released and who built it?

Physical Intelligence (often stylized as π) was incorporated in San Francisco in 2024 to build foundation models for robotics. The company's co-founders are Karol Hausman (Chief Executive Officer, formerly a staff research scientist at Google DeepMind and adjunct professor at Stanford), Sergey Levine (Chief Scientist and a tenured faculty member at the University of California, Berkeley), Chelsea Finn (assistant professor at Stanford), Brian Ichter, Quan Vuong, Adnan Esmail, and the entrepreneur and investor Lachy Groom. The company raised an initial $70 million seed round in March 2024.

In early November 2024, days after the π0 paper appeared, Physical Intelligence announced a $400 million Series A at a $2.4 billion post-money valuation.^[8] The round was led by Jeff Bezos, Thrive Capital, and Lux Capital, with participation from OpenAI, Bond, Sequoia Capital, and Khosla Ventures.^[8]

The paper "π0: A Vision-Language-Action Flow Model for General Robot Control" was posted to arXiv as 2410.24164 on October 31, 2024 with 24 listed authors, including Kevin Black, Noah Brown, Danny Driess, Chelsea Finn, Karol Hausman, Brian Ichter, Sergey Levine, Karl Pertsch, Lucy Xiaoyang Shi, Quan Vuong, and Ury Zhilinsky among others.^[1] The accompanying blog post, titled "π0: Our First Generalist Policy," included videos of robots folding laundry, bussing tables, and assembling cardboard boxes.^[2] Physical Intelligence stated that the model had been built over roughly eight months of work.^[2]

A short academic and industrial cycle followed:

Date	Milestone
March 2024	Physical Intelligence raises $70M seed round
October 31, 2024	π0 paper posted to arXiv
November 4, 2024	$400M Series A at $2.4B post-money valuation announced
January 16, 2025	FAST paper (arXiv 2501.09747) introduces π0-FAST tokenizer variant
February 4, 2025	openpi repository released with π0 and π0-FAST weights and code
April 22, 2025	π0.5 paper (arXiv 2504.16054) released, adding open-world generalization
2025	Knowledge Insulation (KI) variant published, used in π0.5+KI
November 17, 2025	π0.6 model card released

The model was later accepted for presentation at Robotics: Science and Systems 2025.

How does π0 work?

π0 is a transformer-based policy that fuses three modalities: images, language, and proprioceptive robot state.^[1] Its design treats action generation as a conditional flow-matching problem on top of a pretrained vision-language model.^[1]

Vision-language backbone

The backbone is the publicly released PaliGemma 3B model from Google, which combines the SigLIP-So400m image encoder with the Gemma 2B autoregressive language model and projects image patch features into the language model's token space.^[11] PaliGemma uses full bidirectional attention over its prefix (images plus text instruction) and causal attention for any generated text.^[11] π0 inherits this prefix structure and feeds it the camera images from the robot together with a natural-language task instruction.^[1]

Action expert

In addition to the PaliGemma backbone, π0 introduces a dedicated "action expert" submodule of roughly 300 million parameters.^[1] The action expert is a separate set of transformer weights that processes proprioceptive robot state tokens and noisy candidate action tokens. The two modules share attention so that action tokens can read from the image and language tokens, but the action expert has its own parameters and is initialized from scratch rather than from PaliGemma weights.^[1] The total parameter count of the combined model is approximately 3.3 billion.^[1]

Flow-matching action head

Rather than emitting a single action per timestep, π0 generates an "action chunk" of H = 50 future actions.^[1] At a 50 Hz control rate, this corresponds to one second of motion. Physical Intelligence describes flow matching as "a variant of diffusion models" used to "augment pre-trained VLMs with continuous action outputs," letting the robot output motor commands "up to 50 times per second."^[2] Action chunks are produced through flow matching, a continuous-time generative formulation that learns to integrate a velocity field from Gaussian noise to clean actions.^[1] At inference time, π0 uses 10 integration steps to denoise a noisy action chunk into the final commanded trajectory.^[1] Flow matching produces smooth, multi-modal action distributions that are well suited to high-frequency dexterous manipulation, where discretized token-based policies have historically struggled.

Attention pattern

π0 uses a custom block-sparse two-dimensional attention mask.^[1] Image and language tokens form a bidirectional prefix in the style of PaliGemma. Robot state tokens attend to the prefix, and action tokens attend to all earlier tokens (prefix and state) plus to one another within a chunk.^[1] The Hugging Face port implements this mask with PyTorch FlexAttention.^[10]

Cross-embodiment handling

π0 is trained as a single policy that maps to many robot bodies.^[1] Robot-specific information such as joint counts, action dimensions, and image viewpoints is encoded into the input sequence so that the same network can drive single-arm, bimanual, and mobile manipulators without architectural changes.^[1] A 470-million-parameter ablation called π0-small, trained without any VLM pretraining, is reported in the paper as a comparison baseline.^[1]^[2]

What data was π0 trained on?

π0 is trained on a mixture of Physical Intelligence's own teleoperated robot data and publicly available cross-embodiment datasets, on top of PaliGemma's prior web-scale pretraining.^[1]

Data source	Description
π Dataset (proprietary)	Approximately 903 million timesteps collected by Physical Intelligence across 68 tasks and 7 robot configurations
Open X-Embodiment (RT-X)	Pooled open dataset spanning 22 robot embodiments contributed by 21 institutions
Bridge v2	UC Berkeley dataset of single-arm tabletop manipulation
DROID	Distributed open dataset of Franka Emika Panda manipulation collected by a multi-university consortium
Web data	Inherited from PaliGemma's image-text pretraining

The paper reports that the combined corpus represents over 10,000 hours of robot interaction data, which Physical Intelligence describes as the largest cross-embodiment robot training mix used at the time of release.^[1] The seven robot configurations covered by π's own data are listed below.

Configuration	Type	Notes
UR5e	Single-arm	6-DoF industrial arm
Bimanual UR5e	Two arms	Two UR5e arms on a shared workspace
Franka	Single-arm	Franka research arm
Bimanual Trossen ViperX	Two arms	ALOHA-style low-cost bimanual setup
Bimanual ARX / AgileX	Two arms	Higher-payload bimanual platform
Mobile Trossen / ARX	Mobile manipulator	Bimanual arms on a nonholonomic mobile base
Mobile Fibocom	Mobile manipulator	Bimanual arms on a holonomic base

Most of the demonstration data was collected through human teleoperation, building on the imitation learning tradition rather than reinforcement learning.^[1]

What can π0 do?

The original release demonstrated a single π0 policy performing a wide range of long-horizon dexterous tasks.^[2] The paper and accompanying blog post show:

Task	Embodiment shown	Notes
Laundry folding	Mobile bimanual and static bimanual	The task that drew the most attention; a single policy fetches laundry from a dryer, transports it, and folds clothing
Table bussing	Bimanual	Clearing dishes and trash
Box assembly	Bimanual	Folding flat-pack cardboard boxes
Grocery bagging	Bimanual	Packing items into shopping bags
Food prep and scooping	ALOHA-style bimanual	Demonstrated in the openpi fine-tuning examples
Object retrieval	Single-arm and bimanual	Picking and placing objects from clutter

A distinguishing property of π0 is that the model was shown executing chained sub-skills from a single language instruction, such as "fold the laundry," rather than depending on a hand-engineered task scheduler.^[2] Physical Intelligence reported that fine-tuning π0 on between one and twenty hours of additional task-specific data was typically sufficient to adapt the base model to a new manipulation task on the company's robots.^[1]

The demonstrations were performed at the company's San Francisco facility and were not teleoperated during evaluation, although the underlying training data was teleoperated.^[2] The publicly shown laundry-folding video, in particular, drew comparisons in trade press to the well-known difficulty of cloth manipulation, which involves deformable objects, occlusions, and very long task horizons.^[12]

π0-FAST

In January 2025, Physical Intelligence published "FAST: Efficient Action Tokenization for Vision-Language-Action Models" (arXiv 2501.09747) by Karl Pertsch and colleagues.^[3] The paper introduces FAST, short for Frequency-space Action Sequence Tokenization, and the corresponding π0-FAST policy variant.^[3]

FAST converts a continuous action chunk into a sequence of discrete tokens through the following pipeline:^[3]

Per-dimension normalization that maps the 1st and 99th quantiles of each action dimension to the range [-1, 1].
A discrete cosine transform along the time dimension, moving the action chunk from the time domain to the frequency domain.
Scale-and-round of the resulting coefficients to retain only the most significant frequency components.
Byte-pair encoding over the resulting integer sequences, producing a compact discrete vocabulary.

The scheme is invertible, so the autoregressive model can decode tokens back to a continuous action chunk. Because actions are now discrete tokens, π0-FAST can be trained as a standard autoregressive language-model objective using cross-entropy loss, with no flow-matching or diffusion sampling at inference time.^[3]

The authors report that π0-FAST trains roughly five times faster than the original flow-matching π0 while matching or improving its task performance.^[3] Physical Intelligence also released a universal tokenizer called FAST+, trained on one million action sequences spanning single-arm, bimanual, and mobile-manipulation robots, available on Hugging Face for use as a black-box action tokenizer for other VLA projects.^[3] π0-FAST weights were released as part of the openpi repository in February 2025.^[5]

What is π0.5 and how does it differ from π0?

On April 22, 2025, Physical Intelligence released "π0.5: a Vision-Language-Action Model with Open-World Generalization" (arXiv 2504.16054).^[6] π0.5 is built on the π0 architecture but is co-trained on a substantially broader mixture of data, including:^[6]

Robot teleoperation data from multiple platforms (the same kind of data used to train π0).
Web data, including image-text pairs and image captioning corpora.
Human-video data and verbal instruction data from people.
High-level semantic prediction tasks such as object detection and subtask labeling.

The central claim of the paper is open-world generalization.^[6] In the π0.5 training mixture, only a minority of the data comes from mobile manipulators, with the company reporting that 97.6% of the total training data comes from sources other than the target mobile-manipulation platform.^[6] π0.5 was evaluated by deploying it in real homes that the model had never seen during training, where it performed long-horizon manipulation tasks such as cleaning kitchens and bedrooms, putting dishes in sinks, and tidying bedding.^[6] Physical Intelligence described π0.5 as the first end-to-end learned robotic policy able to perform multi-step manipulation in entirely new home environments.^[6]

A later paper, "Knowledge Insulating Vision-Language-Action Models," formalizes a single-stage training recipe associated with π0.5 in which the action expert is updated alongside the VLM backbone but with action gradients prevented from propagating into the VLM weights.^[7] Physical Intelligence reports that this Knowledge Insulation (KI) recipe requires roughly 7.5 times fewer training steps than the original π0 schedule while preserving inference-time performance.^[7] The technique is the basis for the π0.5+KI variant.^[7] In November 2025 the company published a model card for π0.6, the next iteration in the family.^[14]

Reception

π0 was widely covered as a notable step toward general-purpose robot foundation models. The Robot Report, IEEE Spectrum, InfoQ, New Atlas, and CNBC reported on the model and on the related Series A funding announcement.^[9]^[12]^[8] Press attention focused on the laundry-folding video, which demonstrated a complete clothing-folding workflow on a bimanual mobile robot, and on the broader claim that one model could control several distinct robot bodies.^[12]

Within the robotics research community, π0 was compared with Google DeepMind's RT-2 and with the OpenVLA project from Stanford and collaborators. Commentary noted that π0 was the first widely circulated VLA built on the open PaliGemma backbone and the first to use flow matching as the action representation, in contrast to the discrete-token approach used by RT-2 and OpenVLA. The model's relatively modest 3.3 billion parameter count, compared with the 55 billion parameters of RT-2, was also seen as evidence that smaller, well-architected VLAs could produce high-frequency dexterous behavior. Independent in-the-wild evaluations of π0, including the "Evaluating π0 in the Wild" study from the University of Pennsylvania PAL Lab, examined the model's strengths and failure modes on tasks outside its training distribution.^[13]

Physical Intelligence is widely seen as one of the leading robot-foundation-model startups alongside Skild AI, Covariant, and others. Comparisons were also drawn with Tesla's Optimus program and with NVIDIA's GR00T humanoid foundation models, although those efforts target different deployment hardware.

Is π0 open source?

Yes. On February 4, 2025, Physical Intelligence released the openpi repository at github.com/Physical-Intelligence/openpi.^[4]^[5] The release was announced in a blog post titled "Open Sourcing π0" and included:^[4]

Pretrained base checkpoints for π0 and for π0-FAST, both trained on the company's full 10,000-hour robot data mix.
Reference JAX code for model definition, inference, and fine-tuning.
Fine-tuned checkpoints intended to work out of the box on common community platforms.
Recipes and scripts for fine-tuning on a new robot or task.

The distribution is licensed under Apache 2.0, with the underlying Gemma terms also included.^[5] Subsequent additions to the repository have included weights and code for π0.5 and the Knowledge Insulation variant.^[5]

The fine-tuned checkpoints made available at launch covered the following platforms:

Checkpoint	Robot platform	Notes
π0-DROID	Franka via DROID	Single-arm tabletop manipulation
π0-FAST-DROID	Franka via DROID	Autoregressive variant on DROID
π0-ALOHA-towel	ALOHA bimanual	Towel folding on internal data
π0-ALOHA-tupperware	ALOHA bimanual	Container manipulation
π0-ALOHA-pen-uncap	ALOHA bimanual	Trained on the public ALOHA pen-uncap dataset

Reported hardware requirements for the released code are roughly 8 GB of GPU memory for inference, 22.5 GB for LoRA fine-tuning, and 70 GB or more for full fine-tuning, with Ubuntu 22.04 as the recommended host operating system.^[5] Hugging Face also released a PyTorch port of π0 and π0-FAST integrated into the LeRobot library.^[10]

How does π0 compare with other VLA models?

Model	Developer	Year	Parameters	Action representation	Open weights	Backbone
RT-2	Google DeepMind	2023	55B (RT-2-X variant)	Discrete action tokens	No	PaLI-X / PaLM-E
OpenVLA	Stanford and collaborators	June 2024	7B	Discrete action tokens	Yes	Llama-2 with DINOv2 + SigLIP vision
π0	Physical Intelligence	October 2024	~3.3B	Continuous action chunks via flow matching	Yes (Feb 2025)	PaliGemma 3B
π0-FAST	Physical Intelligence	January 2025	~3.3B	Discrete tokens via FAST (DCT + BPE)	Yes	PaliGemma 3B
π0.5	Physical Intelligence	April 2025	Comparable to π0	Flow matching with co-training on heterogeneous data	Yes	PaliGemma-based
GR00T N1	NVIDIA	2025	~2B	Dual-system VLM + action transformer	Yes	NVIDIA in-house

The table shows the broad design space at the time of π0's release. Compared with RT-2, π0 is much smaller and openly distributed, and it uses continuous flow-matching actions rather than discretized text tokens. Compared with OpenVLA, π0 uses the smaller PaliGemma backbone but trains on a larger mixture of teleoperated robot data and supports more embodiments. Compared with NVIDIA's GR00T N1, π0 emphasizes mobile and tabletop manipulators rather than humanoids and uses a single-network design rather than an explicit System 1 / System 2 split.

References

Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., and Zhilinsky, U. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164, October 31, 2024. https://arxiv.org/abs/2410.24164 ↩
Physical Intelligence. "π0: Our First Generalist Policy." Blog post, October 31, 2024. https://www.pi.website/blog/pi0 ↩
Pertsch, K. et al. "FAST: Efficient Action Tokenization for Vision-Language-Action Models." arXiv:2501.09747, January 16, 2025. https://arxiv.org/abs/2501.09747 ↩
Physical Intelligence. "Open Sourcing π0." Blog post, February 4, 2025. https://www.pi.website/blog/openpi ↩
Physical-Intelligence. openpi GitHub repository. https://github.com/Physical-Intelligence/openpi ↩
Physical Intelligence. "π0.5: a Vision-Language-Action Model with Open-World Generalization." arXiv:2504.16054, April 22, 2025. https://arxiv.org/abs/2504.16054 ↩
Physical Intelligence. "Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better." Research page. https://www.pi.website/research/knowledge_insulation ↩
Beilein, R., Kahn, J., and Field, H. "Jeff Bezos and OpenAI invest in robot startup Physical Intelligence at $2.4 billion valuation." CNBC, November 4, 2024. https://www.cnbc.com/2024/11/04/jeff-bezos-and-openai-invest-in-robot-startup-physical-intelligence.html ↩
Crowe, S. "Physical Intelligence open-sources Pi0 robotics foundation model." The Robot Report, February 2025. https://www.therobotreport.com/physical-intelligence-open-sources-pi0-robotics-foundation-model/ ↩
Hugging Face. "π0 and π0-FAST: Vision-Language-Action Models for General Robot Control." Hugging Face Blog. https://huggingface.co/blog/pi0 ↩
Beyer, L. et al. "PaliGemma: A versatile 3B VLM for transfer." arXiv:2407.07726. https://arxiv.org/abs/2407.07726 ↩
Blain, L. "Incredible generalist robots do your laundry and dishes." New Atlas, November 2024. https://newatlas.com/robotics/pi-generalist-autonomous-robot/ ↩
Penn PAL Lab. "Evaluating π0 in the Wild: Strengths, Problems, and the Future of Generalist Robot Policies." https://penn-pal-lab.github.io/Pi0-Experiment-in-the-Wild/ ↩
Physical Intelligence. "π0.6 Model Card." November 17, 2025. https://website.pi-asset.com/pi06star/PI06_model_card.pdf ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

AI robotics Chelsea Finn Cognitive robotics GigaAI Helix (VLA model)LINGO-2 (Wayve)Open X-Embodiment OpenPI OpenVLA Physical AI RFM-1 (Robotics Foundation Model)RT-2 Robot Robot foundation model Skild AI π*0.6 (pi-star-0.6)π0.5

Infobox

When was π0 released and who built it?

How does π0 work?

Vision-language backbone

Action expert

Flow-matching action head

Attention pattern

Cross-embodiment handling

What data was π0 trained on?

What can π0 do?

π0-FAST

What is π0.5 and how does it differ from π0?

Reception

Is π0 open source?

How does π0 compare with other VLA models?

See also

References

Improve this article

Related Articles

SmolVLA

Robot foundation model

Gemini Robotics

NVIDIA Cosmos

Skild AI

OpenVLA

What links here

Related Articles

SmolVLA

Robot foundation model

Gemini Robotics

NVIDIA Cosmos

Skild AI

OpenVLA

What links here