# π₀ (pi-zero)

> Source: https://aiwiki.ai/wiki/pi_zero
> Updated: 2026-05-20
> Categories: AI Models, Embodied AI, Robotics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# π₀ (pi-zero)

**π₀** (pronounced *pi-zero* and sometimes written *pi0* or *pizero*) is a [vision-language-action model](vision_language_action_model) (VLA) developed by the robotics foundation-model startup [Physical Intelligence](physical_intelligence). Announced on 31 October 2024 in the technical report *π₀: A Vision-Language-Action Flow Model for General Robot Control* by Kevin Black, Noah Brown, Danny Driess and 21 co-authors, π₀ is the company's first generalist robot policy and one of the most widely cited flagship VLAs of the 2024-2025 wave of generalist manipulation systems.[^1][^2] The model couples a 3-billion-parameter [PaliGemma](paligemma) vision-language backbone with a separate 300 M-parameter "action expert" that produces continuous, high-frequency motor commands using a [flow matching](flow_matching) objective borrowed from continuous-time diffusion modelling.[^2] Together with the proprietary cross-embodiment dataset of roughly 10 000 hours of teleoperated data on which it is pretrained, π₀ established a template that was rapidly adopted across the field: a pretrained VLM contributing semantic grounding, an action head producing 50 Hz action chunks, and large heterogeneous robot data providing cross-embodiment transfer.[^2][^3]

The model attracted unusual attention for two reasons. First, the accompanying demonstrations, recorded on bimanual mobile manipulators in real apartments and offices, showed sustained, multi-minute dexterous behaviour such as folding laundry from a tangled pile, bussing dining tables, assembling cardboard boxes and bagging groceries, far exceeding the seconds-long horizons typical of prior VLAs such as [RT-2](rt_2), [OpenVLA](openvla) or [Octo](octo).[^1] Second, on 4 February 2025 Physical Intelligence released the model weights, training code and fine-tuning recipes under an Apache 2.0 licence in the **openpi** repository, making π₀ the first frontier-scale generalist robot policy with openly downloadable weights.[^4][^5] Subsequent variants, **π₀-FAST** (January 2025), introducing the frequency-space action tokenizer of the same name, and **π₀.₅** (April 2025), targeting open-world generalisation, are released through the same repository.[^6][^7][^8]

In framing the model, Physical Intelligence positioned it as a "first step" rather than a finished product, drawing an explicit analogy with the trajectory of early generative pretrained text models: pretrain on a broad cross-embodiment data soup, fine-tune on a task-specific dataset of a few hours, deploy on a robot. That recipe, even where individual components were not new (action chunking from ACT, flow matching from continuous diffusion, large robot data pooling from Open X-Embodiment), turned out to be unusually effective in combination, and its publication catalysed a wave of open-source VLA follow-ons throughout 2025.

## Background: Physical Intelligence

Physical Intelligence (often stylised "π" or "PI") is a San Francisco artificial-intelligence company founded in March 2024 to build foundation models for general-purpose robot control.[^9] Its co-founding team is dominated by veterans of academic robot-learning labs and of Google's robotics group: Karol Hausman (CEO; formerly Staff Research Scientist and Robot Manipulation Lead at Google Brain), Sergey Levine (Chief Scientist; associate professor at UC Berkeley), Chelsea Finn (Research Lead; assistant professor at Stanford), Brian Ichter (formerly Research Scientist at Google DeepMind), Quan Vuong, Adnan Esmail (engineering; previously Anduril and Tesla) and Lachy Groom (operations; previously product lead at Stripe).[^9][^10] Several of the team's senior members were authors of earlier influential robot-learning papers such as RT-1, RT-2, PaLM-E and Octo, providing direct continuity from Google-era VLA research into the company's product line.

Physical Intelligence emerged from stealth in March 2024 with a $70 million seed round and subsequently raised a $400 million Series A in November 2024 led by Jeff Bezos, OpenAI, Thrive Capital and Lux Capital at a roughly $2.4 billion valuation, then a $600 million Series B in November 2025 led by CapitalG at a $5.6 billion valuation, with NVIDIA participating through NVentures.[^9][^11] The company has positioned itself as a software-only "robot brain" provider rather than a robot manufacturer, training its models across third-party hardware including [ALOHA](aloha_robot)/[ALOHA 2](aloha_2) bimanual rigs, AgileX Trossen Arx arms, UR5e and Franka single-arm platforms, and Fibocom mobile bases. π₀ is the flagship public artefact of that strategy and was followed in 2025 by π₀.₅ and the experimental π∗₀.₆ ("pi-star-zero-point-six") policy, which adds reinforcement-learning fine-tuning via the RECAP (Reinforcement Learning with Experience and Corrections via Advantage-conditioned Policies) algorithm.[^12]

The choice to centre the company around a single foundation model rather than a portfolio of task-specific policies was deliberate, and reflects an explicit bet by the founding team that the same scaling hypothesis underlying large language models will eventually apply to robot manipulation: that performance and generality will both improve with more parameters, more diverse demonstrations and more compute, given the right architecture. The π₀ technical report is, in effect, the first concrete formulation of that bet from Physical Intelligence, and many of its design choices, particularly the decision to keep the action expert physically separate from the VLM and to use action chunking rather than per-step prediction, are framed as engineering compromises in service of that scaling argument.[^1][^2]

## Architecture

π₀ follows the now-standard recipe of pairing an internet-scale pretrained vision-language model with a robot-specific action head, but it differs from earlier VLAs in two key ways: (i) the action head is a *separate* transformer expert that runs alongside the VLM rather than sharing all of its weights, and (ii) it generates actions through flow matching rather than autoregressive discrete tokens.[^2]

### PaliGemma backbone

The VLM backbone is Google's [PaliGemma](paligemma), a 3-billion-parameter vision-language transformer combining a 400 M-parameter SigLIP image encoder with the 2.6 B-parameter Gemma decoder-only language model.[^2][^13] PaliGemma was selected because it is one of the smallest contemporary VLMs with strong open-image-and-text capabilities, which keeps inference latency on a single GPU compatible with real-time control. The backbone is initialised from the publicly released PaliGemma weights, then trained jointly with the action expert during robot pretraining.[^2]

### Action expert

In parallel with the language tokens, π₀ injects a stream of **action tokens** and **state tokens** into the same transformer stack. These tokens are processed not by the original PaliGemma weights but by a separate set of 300 million additional parameters that are randomly initialised and trained from scratch. The Physical Intelligence team calls this set the "action expert"; architecturally it is a smaller transformer that shares the *attention* with the VLM tokens but maintains its own MLP and projection weights, so total model size is 3.3 billion parameters.[^2]

Each timestep, the model receives one or more RGB images (typically three, from base and wrist cameras), a natural-language instruction, the current proprioceptive robot state, and a vector of Gaussian noise. The VLM tokens fully attend to one another with bidirectional attention, while state and action-noise tokens attend through a block-causal mask. This design preserves the bidirectional attention pattern of the original VLM (preventing catastrophic forgetting of internet pretraining) while letting actions and states form their own temporally structured sequence.[^2][^14]

### Flow matching for action generation

Rather than discretising actions into bins and emitting them autoregressively (the approach used by RT-2 and OpenVLA), π₀ predicts a vector field that transports random Gaussian noise to a chunk of future actions, in the spirit of *flow matching* and rectified flow.[^2] At inference, the model starts from random noise of shape `H × A` (where `H` is the chunk length and `A` is the action dimension), runs roughly 10 integration steps using the predicted vector field, and outputs a smooth `H`-step action chunk. Because all `H` actions are produced in a single forward pass per integration step, the per-chunk latency is dominated by the small number of denoising steps, allowing chunk-rate inference roughly every 0.5-0.8 seconds and per-action control frequencies of up to 50 Hz when the chunk is consumed by an underlying controller.[^14]

### Action chunking and cross-embodiment encoding

π₀ predicts an action chunk of length **H = 50** future timesteps at each invocation, an approach borrowed from earlier action-chunking work such as ACT and Diffusion Policy that smooths transitions and reduces compounding error.[^2] To unify heterogeneous robot platforms with different numbers of joints, gripper modalities and command modes, π₀ pads all state and action vectors to the dimension of the largest robot in the training mix (18 dimensions in the published configuration) and zero-pads narrower platforms; the language prompt and visual inputs disambiguate which embodiment is currently in use.[^2]

A subtle but important consequence of chunking with flow matching is that the policy effectively reasons over half-second horizons in a single forward pass. Whereas autoregressive token-by-token decoding has to repeatedly recommit to a previous action prefix and is correspondingly sensitive to early mistakes, π₀ can re-sample the entire 50-step trajectory whenever the latest observation suggests a strategy change. The team reports that this property is critical for fine, dynamic behaviours such as catching a falling utensil or shaking out a tangled garment, which would be brittle if decoded one action at a time. It also means that the perceived "control frequency" of the system is governed by how often new chunks are generated rather than by raw action sampling; with chunks regenerated approximately every 0.5-0.8 seconds and a low-level controller interpolating between them, the effective closed-loop bandwidth on dexterous tasks is closer to 50 Hz than to 1-2 Hz.[^14]

| Component | Parameters | Description |
|---|---|---|
| SigLIP visual encoder (within PaliGemma) | ~400 M | Frozen-then-fine-tuned image encoder, sigmoid-loss CLIP variant |
| Gemma decoder (within PaliGemma) | ~2.6 B | Decoder-only language model providing text-side processing |
| **PaliGemma VLM backbone (total)** | **~3.0 B** | Pretrained on web image-text data, fine-tuned on robot data |
| Action expert (separate MLP / projection weights) | ~300 M | Randomly initialised, processes state/action/noise tokens |
| **Total π₀** | **~3.3 B** | Real-time inference at ~50 Hz on consumer GPUs |
| Action chunk length `H` | 50 timesteps | Generated per forward integration |
| Control frequency | up to 50 Hz | 20 Hz on slower UR5e/Franka setups |
| Inference time (3 cameras, RTX 4090) | ~73 ms / chunk | ~10 flow-matching integration steps |

There is also a smaller **π₀-small** variant of approximately 470 M parameters that omits PaliGemma initialisation; it is used in the paper for ablations isolating the contribution of internet-scale pretraining.[^2]

## Training data

Pretraining π₀ to behave as a generalist required a robot dataset large and diverse enough to be reminiscent of internet-scale text corpora. Physical Intelligence assembled two such pools.

### Proprietary cross-embodiment π dataset

The bulk of the data is an in-house dataset collected by company teleoperators on seven robot configurations across approximately 68 tasks, including: a single-arm UR5e, a bimanual UR5e, a single-arm Franka, a bimanual Trossen, a bimanual AgileX Arx, a mobile bimanual Trossen and a mobile bimanual Fibocom platform.[^2] The full corpus comprises roughly **903 million timesteps**, equivalent to around **10 000 hours** of teleoperated robot experience, making it by some margin the largest robot manipulation dataset assembled in 2024.[^14] Tasks include cloth folding, table bussing, grocery bagging, box assembly, plug insertion, food packing, drawer manipulation, dish loading and a long tail of household manipulation behaviours.

### Open X-Embodiment

The proprietary data is mixed with the public [Open X-Embodiment](open_x_embodiment) (OXE) dataset, the 2023 community release aggregating over **1 million trajectories** from **22 distinct robot embodiments** across 21 research institutions, of which π₀ uses a curated subset.[^2] OXE provides additional embodiment diversity (especially for single-arm robots not represented in the proprietary mix) and approximately 90 million additional timesteps; mixing weights are tuned to balance high- and low-quality demonstrations.[^14]

### Two-stage training

The model is trained in two stages:

1. **Pretraining.** π₀ is trained on the full mixture (proprietary corpus + OXE) for hundreds of thousands of steps with the flow-matching objective. The resulting base model is a generalist that can be prompted in zero-shot for tasks resembling the training distribution. The flagship "full" π₀ checkpoint is trained for approximately 700 000 steps, with a "π₀-parity" ablation trained for 160 000 to match the compute used by some baselines.[^2]
2. **Post-training.** For specific deployments, the base model is fine-tuned on a smaller curated, high-quality dataset for the target task (typically 1-20 hours of additional demonstrations). Long-horizon tasks (laundry folding, box assembly) use additional supervision from a high-level VLM planner that issues mid-level sub-task instructions to the policy, an architecture similar to the SayCan/PaLM-E hierarchies.[^1][^2]

This decoupling of broad pretraining and task-specific post-training mirrors the practice of language-model finetuning and is, in the company's framing, one of the key arguments for VLAs over per-task imitation policies.[^1]

A practical complication of building such a heterogeneous dataset is that demonstrations vary widely in quality. Even within a single platform, some episodes are collected by an experienced operator under good lighting and clear instructions, while others contain noisy teleoperation, recovery from grasp failures, dropped objects or partial task completion. Rather than discarding lower-quality data outright, the π₀ team mixes it with carefully weighted batch sampling so that pretraining sees the full distribution but post-training is dominated by high-quality, in-domain examples. This loosely parallels the "curriculum" strategy used in modern language-model finetuning, where instruction-tuning data is filtered for quality even as the base pretraining corpus tolerates noise.[^2][^14]

## Demonstrated capabilities

The accompanying blog post and videos demonstrate π₀ executing several long-horizon, dexterous behaviours that prior VLAs could not sustain:[^1]

- **Laundry folding.** The bimanual mobile platform pulls one item at a time from a tangled basket of mixed clothing, flattens it on a counter and produces a folded stack. Single attempts run for several minutes; the model recovers gracefully from grasp slips and folds going wrong.
- **Table bussing.** π₀ clears a dining table, sorting plates, cutlery and trash into the correct bins; mid-task it exhibits emergent stacking strategies (e.g. nesting bowls) that were not explicitly demonstrated.
- **Box assembly.** Multi-stage cardboard folding and tucking, requiring precise bimanual coordination.
- **Grocery bagging.** Sorting heterogeneous items by fragility and density into shopping bags.
- **Plug insertion and cable routing.** Fine-motor contact-rich manipulation tasks.
- **Toast retrieval and dish loading.** Kitchen-style household chores.

Several of these behaviours run for more than 100 seconds end-to-end and in some cases approach 10 minutes for full clothing folding episodes, an order-of-magnitude jump in temporal horizon over what was previously reported for generalist policies.[^1]

## Comparison to baselines

The π₀ paper presents two main evaluations: (i) zero-shot performance on tasks drawn from the pretraining distribution, and (ii) fine-tuned performance on novel downstream tasks. In both settings, π₀ outperforms strong contemporaneous baselines: the 7-billion-parameter [OpenVLA](openvla), the 93-million-parameter Octo diffusion policy, the [Diffusion Policy](diffusion_policy) of Chi et al. (2023), and the per-task imitation baselines ACT and BC (behaviour cloning).[^2]

| Model | Backbone / family | Params | Action representation | Notes |
|---|---|---|---|---|
| π₀ | PaliGemma + action expert | 3.3 B | Flow-matching action chunks, 50 Hz | Generalist VLA, multi-embodiment |
| π₀-small | Custom transformer | 470 M | Flow-matching | Ablation without VLM init |
| π₀-FAST | PaliGemma + autoregressive head | 3 B+ | FAST discrete tokens, autoregressive | Up to 5× faster training[^6] |
| π₀.₅ | π₀ + knowledge insulation | 3 B+ | Hybrid (text plan + flow actions) | Open-world generalisation[^7] |
| [OpenVLA](openvla) | Llama-2-7B + DINOv2/SigLIP | 7 B | Discretised action bins (autoregressive) | Open-weights generalist (Stanford/Berkeley, 2024) |
| [RT-2](rt_2) | PaLI-X / PaLM-E | 55 B | Discretised action tokens | Google DeepMind, 2023 |
| Octo | Custom transformer | 93 M | Diffusion head | Open-weights generalist (Berkeley, 2024) |
| Diffusion Policy | UNet / Transformer | ~100 M | Continuous diffusion | Single-task imitation baseline |
| ACT | Conditional VAE | ~80 M | Direct action regression | Single-task imitation baseline |

On the four "seen" tasks reported in the paper (shirt folding, table bussing, grocery bagging, removing toast from a toaster), π₀ achieves an average normalised success score of roughly **0.8** out of 1.0, against approximately **0.35** for the strongest baseline, Diffusion Policy. On the bowl-stacking subtask in particular, π₀ scores near 1.0 while OpenVLA and Octo each score below 0.1.[^14][^15] Across language-conditioned tasks π₀ also follows mid-level instructions (for example, "pick up the green block and place it in the brown bowl") substantially more reliably than the baselines.[^1] On unseen but related downstream fine-tuning tasks, π₀ reaches 40-60 % success in zero-shot rollouts and 80-95 % after small amounts of task-specific fine-tuning, again above the baselines.[^15]

A robustness caveat is that, in independent third-party evaluations (notably the "π₀ in the Wild" study by Penn-PAL Lab), the gap to baselines narrows on tasks involving distractor objects, novel camera positions and unusual lighting, indicating that part of π₀'s advantage stems from in-distribution coverage of its proprietary dataset rather than purely from architecture.[^16] This motivated the follow-on π₀.₅ release.

## Open-source release: openpi

On **4 February 2025** Physical Intelligence released code and weights for π₀ in the public **openpi** repository at `github.com/Physical-Intelligence/openpi`, under the **Apache 2.0** licence (with an additional Gemma licence applying to the PaliGemma-derived weights).[^4][^5] The repository was an unusually open release for a frontier robot model and quickly became a community reference implementation, with downstream PyTorch ports (notably HuggingFace's integration into [LeRobot](lerobot)) appearing within days.[^17]

The initial release included:

- Base π₀ checkpoints pretrained on the 10 000+ hours of cross-embodiment data;
- "Expert" fine-tuned checkpoints for ALOHA (towel folding, food scooping, tupperware unpacking) and the DROID Franka setup (pen uncapping and similar tasks);
- Inference and policy-server code for running the model on real robots;
- A full JAX training stack with mixed-precision and FSDP support;
- Example fine-tuning scripts for the LIBERO simulation benchmark and for user-supplied datasets.

Reported hardware requirements are modest by VLA standards: inference fits comfortably on an 8 GB GPU; LoRA fine-tuning requires roughly 22 GB (an RTX 4090 suffices); full fine-tuning requires 70+ GB and is typically performed on A100 or H100 GPUs.[^5] In Physical Intelligence's own experiments, only **1-20 hours** of task-specific data are needed to fine-tune the base model to a new manipulation task, which represents a step-change reduction relative to the hundreds of hours typically required to train a single-task imitation policy from scratch.[^4]

Subsequent updates to the repository (through to 2025) added:

- π₀-FAST checkpoints (alongside the FAST tokenizer release in January 2025);[^6]
- π₀.₅ checkpoints (April 2025) and training instructions on the full DROID dataset;[^7]
- A PyTorch implementation (September 2025) validated for π₀ and π₀.₅ inference and fine-tuning, with feature parity gradually catching up with the JAX version.[^4]

## π₀-FAST

In January 2025 Physical Intelligence published the paper *FAST: Efficient Action Tokenization for Vision-Language-Action Models* (Pertsch, Stachowicz, Ichter, Driess et al., arXiv:2501.09747), introducing **FAST** (Frequency-space Action Sequence Tokenization).[^6] Standard per-dimension binning of robot actions, the discretisation used by RT-2 and OpenVLA, was shown to fail on high-frequency dexterous tasks because consecutive action vectors are highly correlated and binning each independently produces an enormous redundant token stream. FAST instead applies a per-dimension Discrete Cosine Transform to action chunks, prunes low-magnitude high-frequency coefficients and applies Byte-Pair Encoding to the resulting integer sequences, yielding a compact, lossless and almost universal tokenisation governed by only two hyperparameters (a scaling coefficient and a BPE vocabulary size).

**π₀-FAST** combines this tokenizer with the same PaliGemma backbone as π₀, but predicts actions *autoregressively* instead of via flow matching. The training data and overall architecture are otherwise comparable. The principal benefit is training efficiency: π₀-FAST reaches similar performance to flow-matching π₀ while training **up to 5× faster** on the same data, because each example contributes more tokens per gradient step and autoregressive prediction is well-supported by existing language-model training kernels.[^6] A pretrained universal tokenizer, **FAST+**, trained on 1 million action sequences from single-arm, bimanual and mobile platforms, is released as a black-box tool and is the default for downstream π₀-FAST users.[^6]

π₀-FAST checkpoints are included in openpi from February 2025 onwards, alongside the original π₀.[^4]

## π₀.₅

The most prominent follow-up model, **π₀.₅** (often written *pi-zero-point-five* or *pi0.5*), was announced on **22 April 2025** in the paper *π₀.₅: a Vision-Language-Action Model with Open-World Generalization* (arXiv:2504.16054).[^7][^8] Its principal motivation is open-world generalisation: π₀ achieves strong performance in environments resembling its training distribution but degrades on entirely novel homes, kitchens and offices. π₀.₅ targets this gap with three main changes:

- **Heterogeneous co-training.** Beyond the original cross-embodiment robot data, π₀.₅ co-trains on object-detection data, image-caption data, web visual-question-answering data and "semantic subtask" annotations describing high-level steps of long-horizon tasks. Ablations show that web data contributes the largest improvement on out-of-distribution object recognition, while cross-embodiment robot data dominates on cross-embodiment performance.[^7]
- **Hybrid action decoding.** π₀.₅ retains flow-matching continuous action prediction but augments it with autoregressive discrete-token decoding of high-level "chain-of-thought" plans expressed in natural language. The model first emits a textual sub-goal (e.g. "pick up the cup and place it in the sink") and then generates the corresponding flow-matching action chunk, in a hierarchical SayCan-like manner but with a single policy.[^7]
- **Knowledge insulation.** A training scheme intended to prevent robot data from washing out semantic knowledge acquired during web pretraining, by routing different data sources through different sub-paths of the model.[^7][^4]

The released demos show a mobile bimanual robot cleaning up entirely unseen kitchens and bedrooms, putting dishes in sinks, making beds, organising clothes in laundry baskets and wiping spills with sponges. The team reports an out-of-distribution success rate of roughly **94 %** at following language commands, with performance approaching in-distribution baselines after training on around 100 distinct home environments.[^7] π₀.₅ checkpoints are included in openpi from September 2025.

A further iteration, **π∗₀.₆**, was posted to arXiv in November 2025 (*π∗₀.₆: a VLA That Learns From Experience*, arXiv:2511.14759), adding the RECAP reinforcement-learning algorithm to allow continued improvement from real-world deployment data, but is outside the scope of this article.[^12]

## Significance

π₀'s significance lies less in any single architectural component and more in the way it crystallised several converging trends in robot learning into a coherent, openly available system.

- **Establishing flow matching as a viable action head.** Earlier generalist VLAs were dominated by autoregressive token discretisation ([RT-2](rt_2), [OpenVLA](openvla)) or diffusion ([Octo](octo), Diffusion Policy). π₀ showed that flow matching, with its smaller number of integration steps and continuous outputs, scales to billion-parameter models and delivers smooth, high-frequency control. Subsequent open-weights models including [SmolVLA](smolvla), and Figure AI's [Helix](helix_vla) adopted similar dual-branch flow-matching designs.
- **Validating cross-embodiment pretraining at scale.** By absorbing seven proprietary embodiments plus Open X-Embodiment into one model, π₀ provided the strongest evidence yet that a single policy can transfer across hardware platforms when given a sufficiently varied dataset. This argument has been central to subsequent foundation-model robot policies including NVIDIA's [GR00T N1](groot_n1) / [Isaac GR00T](isaac_gr00t) and DeepMind's Gemini Robotics line.
- **Opening frontier robot weights.** The openpi release on 4 February 2025 was the first time a frontier-scale generalist robot policy was published under a permissive licence with full training code; it has become a default starting point for academic VLA fine-tuning and a teaching reference for the field, paralleling the role of Llama and Mistral in language models.[^4]
- **Setting demonstration expectations.** The visual record of π₀ folding laundry, bussing tables and assembling boxes raised the perceived state-of-the-art for "what a VLA can do" and reshaped fundraising expectations for the entire generalist-robotics sector, contributing to Physical Intelligence's own valuation trajectory from $2.4 B (November 2024) to $5.6 B (November 2025).[^9]

Limitations are also widely acknowledged. Independent third-party evaluations have found that π₀'s zero-shot success on out-of-distribution objects and environments degrades substantially relative to in-distribution rates, and that long-horizon stability still depends on careful prompting and high-quality post-training data.[^16] These observations motivated the π₀.₅ generalisation work and the π∗₀.₆ continual-learning approach. They also frame π₀ less as a finished product than as the first widely available "GPT-2 moment" for generalist robot policies.[^4]

## See also

- [Physical Intelligence](physical_intelligence)
- [Vision-language-action model](vision_language_action_model)
- [PaliGemma](paligemma)
- [Flow matching](flow_matching)
- [OpenVLA](openvla)
- [RT-2](rt_2)
- [Octo](octo)
- [Diffusion Policy](diffusion_policy)
- [Helix VLA](helix_vla)
- [SmolVLA](smolvla)
- [GR00T N1](groot_n1) / [Isaac GR00T](isaac_gr00t)
- [Open X-Embodiment](open_x_embodiment)
- [LeRobot](lerobot)
- [ALOHA robot](aloha_robot) / [ALOHA 2](aloha_2)
- [World model](world_model)

## References

[^1]: Physical Intelligence. "π₀: Our First Generalist Policy". Blog post, 31 October 2024. https://www.physicalintelligence.company/blog/pi0 (and mirrored at https://www.pi.website/blog/pi0). Accessed 2026-05-20.

[^2]: Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L. X., Tanner, J., Vuong, Q., Walling, A., Wang, H., Zhilinsky, U. "π₀: A Vision-Language-Action Flow Model for General Robot Control". arXiv:2410.24164, v1 31 October 2024, v4 8 January 2026. https://arxiv.org/abs/2410.24164. Accessed 2026-05-20.

[^3]: PaliGemma Team. "PaliGemma: A versatile 3B VLM for transfer". arXiv:2407.07726, July 2024. https://arxiv.org/abs/2407.07726. Accessed 2026-05-20.

[^4]: Physical Intelligence. "openpi" GitHub repository. https://github.com/Physical-Intelligence/openpi. Accessed 2026-05-20.

[^5]: Physical Intelligence. "Open Sourcing π₀". Blog post, 4 February 2025. https://www.pi.website/blog/openpi. Accessed 2026-05-20.

[^6]: Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C., Levine, S. "FAST: Efficient Action Tokenization for Vision-Language-Action Models". arXiv:2501.09747, 16 January 2025. https://arxiv.org/abs/2501.09747. Accessed 2026-05-20.

[^7]: Physical Intelligence. "π₀.₅: a VLA with Open-World Generalization". Blog post, 22 April 2025. https://www.pi.website/blog/pi05. Accessed 2026-05-20.

[^8]: Physical Intelligence (Pertsch et al.). "π₀.₅: a Vision-Language-Action Model with Open-World Generalization". arXiv:2504.16054, April 2025. https://arxiv.org/abs/2504.16054. Accessed 2026-05-20.

[^9]: Crunchbase / Sequoia Capital / EMCAP portfolio entries on Physical Intelligence. https://sequoiacap.com/companies/physical-intelligence/ and https://www.emcap.com/portfolio/physical-intelligence. Accessed 2026-05-20.

[^10]: Hausman, K. "Why Robots Still Struggle With Simple Tasks (And What Might Finally Change That)". Interview, *The Generalist*. https://www.generalist.com/p/karol-hausman-physical-intelligence. Accessed 2026-05-20.

[^11]: Humanoids Daily. "Physical Intelligence Secures $600 Million to Build a Universal Robot Brain, Hitting $5.6 Billion Valuation". November 2025. https://www.humanoidsdaily.com/news/physical-intelligence-secures-600-million-to-build-a-universal-robot-brain-hitting-5-6-billion-valuation. Accessed 2026-05-20.

[^12]: Physical Intelligence. "π∗₀.₆: a VLA That Learns From Experience". arXiv:2511.14759, November 2025. https://arxiv.org/abs/2511.14759. Accessed 2026-05-20.

[^13]: Beyer, L., Steiner, A., Pinto, A. S., et al. "PaliGemma: A versatile 3B VLM for transfer". Google DeepMind, 2024. https://arxiv.org/pdf/2407.07726. Accessed 2026-05-20.

[^14]: Cloderic. "Notes on: 'π₀: A Vision-Language-Action Flow Model for General Robot Control'". 27 February 2025. https://www.cloderic.com/content/2025-02-27-notes-on-pi0. Accessed 2026-05-20.

[^15]: HuggingFace LeRobot Team. "π₀ and π₀-FAST: Vision-Language-Action Models for General Robot Control". HuggingFace blog. https://huggingface.co/blog/pi0. Accessed 2026-05-20.

[^16]: Penn PAL Lab. "Evaluating π₀ in the Wild: Strengths, Problems, and the Future of Generalist Robot Policies". 2025. https://penn-pal-lab.github.io/Pi0-Experiment-in-the-Wild/. Accessed 2026-05-20.

[^17]: HuggingFace. "π₀ (Pi0)" documentation in LeRobot. https://huggingface.co/docs/lerobot/pi0. Accessed 2026-05-20.

