π0.5

π0.5 (also written pi0.5 or π₀.₅) is a vision-language-action model developed by Physical Intelligence and released on April 22, 2025. Building directly on the π0 architecture, π0.5 addresses one of the most persistent obstacles in embodied AI: getting a robot to operate reliably in environments it has never seen before. Where most VLA models are evaluated in settings that closely resemble their training data, π0.5 was tested in entirely new homes, performing complex domestic tasks such as cleaning kitchens and tidying bedrooms without prior exposure to those spaces.

The paper describing the model, "π0.5: a Vision-Language-Action Model with Open-World Generalization," was published simultaneously as an arXiv preprint (arXiv:2504.16054) and as a Physical Intelligence blog post on April 22, 2025. It subsequently appeared in the Proceedings of The 9th Conference on Robot Learning (CoRL 2025), PMLR Volume 305, pages 17 to 40. The paper carries 35 listed authors from Physical Intelligence, including Kevin Black, Noah Brown, Chelsea Finn, Karol Hausman, Brian Ichter, Sergey Levine, Allen Z. Ren, and Homer Walke, among others.

The central technical contribution is a co-training framework that mixes six heterogeneous data categories, including mobile manipulator household demonstrations, non-mobile multi-environment robot data, cross-embodiment laboratory recordings, high-level semantic subtask annotations, internet-scale web data, and verbal instruction demonstrations. In experiments, π0.5 approached the performance of models trained directly on test-environment data while evaluating in three San Francisco rental homes the development team had never visited during data collection.

Background

Physical Intelligence

Physical Intelligence (abbreviated as π or pi) is a San Francisco robotics AI company founded in 2023. Its founders include Karol Hausman, Brian Ichter, Sergey Levine, and Chelsea Finn, researchers with backgrounds spanning Google Brain, Google DeepMind, UC Berkeley, and Stanford. The company focuses on developing foundation models for physical systems, pursuing the same scaling approach that produced general-purpose text and image models, applied to robot control. Physical Intelligence raised early funding from investors including Bezos Expeditions, OpenAI, Sequoia Capital, and Khosla Ventures.

The company's research strategy is to train a single large model on data from many different robot types and tasks, rather than building narrow task-specific policies. The hypothesis is that a model trained on diverse enough physical interaction data will develop representations that transfer across embodiments and environments, much as large language models trained on diverse text transfer across domains.

π0 and the VLA lineage

π0 was Physical Intelligence's first major model release, published in October 2024. The model combined a 3-billion-parameter PaliGemma vision-language backbone with a separately parameterized action expert. The action expert used flow matching, a technique from generative modeling that learns smooth vector fields to convert a Gaussian noise distribution into a target action distribution. During inference, the model runs 10 denoising steps to produce a 50-step action chunk covering one second of motor commands.

π0 demonstrated broad dexterity across tasks like laundry folding, box assembly, and table bussing. However, its generalization remained largely constrained to environments and object configurations seen during training. Evaluating it in genuinely new homes, with unfamiliar furniture arrangements and unseen objects, revealed substantial performance drops.

To extend generalization beyond the training distribution, Physical Intelligence pursued two parallel lines of follow-up work. One produced FAST (Frequency-based Action Sequence Tokenizer), which compresses robot trajectories into compact discrete token sequences using Discrete Cosine Transform, enabling more efficient training on heterogeneous data. The other produced π0.5, which incorporated FAST tokens into a broader co-training framework explicitly designed for open-world generalization.

The open-world generalization problem

Generalization in robotics has a specific meaning distinct from its use in machine learning benchmarks. A robot that "generalizes" in the standard ML sense may succeed on held-out test objects from the same distribution as training objects. Open-world generalization refers to a harder condition: operating in physical spaces with different floor plans, furniture arrangements, lighting conditions, and object inventories from anything seen during training.

Most VLA papers published before 2025 evaluated robots in environments that closely matched their training settings. Even when test objects differed, the room, the table positions, and the camera angles were familiar. π0.5 defined its evaluation around a stricter criterion: the three test homes used for evaluation were locations the Physical Intelligence team had never entered during data collection. This framing shaped the paper's entire methodology.

Release and reception

π0.5 was announced via a blog post on the Physical Intelligence website on April 22, 2025, accompanied by the arXiv preprint and demonstration videos showing the robot operating in unfamiliar San Francisco rental homes. The blog framed the work as "a significant step forward toward truly generalizable physical intelligence," emphasizing the departure from prior VLA evaluations in training-matched settings.

The release generated substantial discussion in the robotics and machine learning communities. A Hacker News thread on the paper drew attention to the 97.6% figure, the share of pretraining data drawn from non-mobile-manipulator sources, as counterintuitive. The model used to train a household mobile robot turned out to rely overwhelmingly on data from other robots, web images, and non-mobile manipulation setups, yet still generalized to mobile household tasks. Participants in the discussion noted that this data efficiency argument, that you do not need large amounts of on-robot household data if you train on enough diverse data from other sources, was one of the paper's most practically significant claims.

The paper's acceptance to CoRL 2025, the premier conference for robot learning research, confirmed its standing in the field. Physical Intelligence had previously published π0 at CoRL 2024.

Architecture

PaliGemma backbone

π0.5 is built on PaliGemma, Google's open-weight vision-language model. PaliGemma combines two pretrained components: SigLIP-So400m, a contrastive image encoder trained with SigmoidLoss on large image-text datasets, and Gemma-2B, a 2-billion-parameter decoder-only language model. These are joined by a learned linear projection that maps SigLIP image patch tokens into the Gemma token embedding space.

PaliGemma processes images at 224x224 resolution, generating 256 image tokens per image. The model can accept multiple images in a single context, which is how π0.5 handles the robot's four-camera input: the forward camera, the rear camera, and the two wrist cameras are all encoded as separate image token sequences concatenated into the context.

The total parameter count for π0.5 is approximately 3.3 billion. PaliGemma accounts for roughly 3 billion of these parameters. A separate action expert module contributes approximately 300 million additional parameters initialized from scratch rather than from pretrained weights.

Action expert

The action expert is a transformer-based module that takes PaliGemma's contextualized token representations and generates continuous robot action sequences. It processes a combination of image embeddings, language instruction tokens, robot proprioceptive state tokens, and, during post-training, semantic subtask prediction tokens.

The action expert uses adaptive RMSNorm (AdaRMS) to inject the flow-matching diffusion timestep into each transformer layer. A two-layer MLP maps the timestep scalar to a set of scale and shift parameters that modulate the layer normalization applied to each hidden state. This is the primary architectural change from π0's action expert, which fused the timestep directly with the noisy action input rather than conditioning each layer independently.

At inference time, the action expert runs 10 denoising steps using the flow matching procedure. Starting from Gaussian noise, each step moves the action representation along a learned vector field toward the target action distribution. The result is a 50-step action chunk covering one second of motion at 50 Hz, representing target end-effector positions for both arms and the mobile base.

Dual action representation

π0.5 uses two distinct action representations in different training phases. During pretraining, robot demonstration data is encoded using FAST tokens. The FAST tokenizer applies a Discrete Cosine Transform to an action trajectory, compressing the high-frequency components and retaining the low-frequency ones, then quantizing the result into a discrete codebook. FAST tokens allow robot demonstration data to be treated as discrete sequences comparable to text tokens, enabling joint training with language and web data using a single cross-entropy objective.

During post-training, the action expert transitions from producing FAST token predictions to producing continuous flow matching outputs. The FAST token head is retained but the flow matching head becomes the primary output for robot control. This design means the model can leverage the computational efficiency of discrete token training at scale during pretraining, then recover continuous action precision during the post-training specialization phase.

High-level subtask prediction

One of the clearest architectural additions in π0.5 relative to π0 is the explicit high-level subtask prediction stage. Before generating motor action tokens, the model first produces a short natural-language description of the current semantic subtask (for example, "pick up the mug" or "open the drawer"). This text is generated by the Gemma language head using autoregressive decoding.

The predicted subtask text is then prepended to the action expert's input context before flow matching begins. This gives the action expert an explicit semantic anchor: rather than inferring the current goal entirely from visual observations, it receives a natural-language description of what it is about to do. The authors described this as a form of chain-of-thought reasoning applied to robot control.

Ablation experiments showed that including subtask annotation data during training provided benefits even when the predicted subtask was not passed to the action expert at inference time. The training signal from subtask prediction improved the quality of the visual representations used for action generation, suggesting that semantic annotation is useful as a training objective independent of its role as an inference-time input.

Attention structure

Like π0, π0.5 uses a prefix-suffix attention structure. Image, language, and state tokens form the prefix, over which bidirectional (full) attention is applied. Action tokens form the suffix, over which causal (left-to-right) attention is applied, with each action token attending to all prefix tokens and to preceding action tokens. This structure ensures the model can integrate full context when generating actions while preserving the autoregressive structure needed for FAST token generation.

Training differences from π0

Several training-time choices distinguish π0.5 from π0. In π0, the PaliGemma backbone was kept frozen throughout training, with only the action expert parameters updated. In π0.5, the PaliGemma backbone is unfrozen during pretraining. The authors attributed this decision to the larger and more diverse training dataset: unfreezing the backbone lets the model adapt its visual and language representations to the robot-specific observation distribution without catastrophic forgetting, because the pretraining data includes enough web and language examples to maintain the backbone's general capabilities.

The π0 paper also described a technique called knowledge insulation, in which the gradient from the action expert is blocked from flowing back into the VLM backbone. π0.5 applies a variant of this during post-training: after pretraining with the unfrozen backbone, the VLM is frozen again for the post-training phase so that the action expert can be fine-tuned on mobile manipulation data without distorting the broader representations built during pretraining.

Training data and co-training methodology

Overview of co-training

The defining methodological contribution of π0.5 is training a single model jointly on six data categories that differ dramatically in their format, source robot, semantic content, and action space. The hypothesis driving this design is that different data sources teach complementary capabilities: mobile manipulation data teaches the target skill, non-mobile robot data teaches generalizable manipulation primitives, cross-embodiment data transfers motor knowledge across platforms, semantic annotation teaches the structure of household tasks, web data teaches object recognition and scene understanding, and verbal instruction data teaches language grounding.

Joint training on all six categories is not straightforward because the data types are heterogeneous: some examples include action sequences while others do not, some have bounding box annotations while others have only image-text pairs, and the robot embodiments have different observation and action spaces. The FAST tokenizer and the dual action representation solve part of this problem by providing a common discrete token format for action data. The remainder is handled through a shared token vocabulary and a multi-task loss that applies different output heads to different data types depending on what labels are available.

Data categories

Source	Abbreviation	Format	Scale
Mobile Manipulator data	MM	Video + joint states + actions	~400 hours, ~100 homes
Multi-Environment non-mobile	ME	Video + joint states + actions	Multiple indoor environments
Cross-Embodiment laboratory	CE	Video + joint states + actions	OXE dataset + in-house recordings
High-Level subtask prediction	HL	Video + natural language subtask labels	Manually annotated
Web Data	WD	Images + captions, VQA pairs, object bounding boxes	Internet-scale
Verbal Instruction demonstrations	VI	Video + natural language instructions	Expert-collected, post-training only

The most striking aspect of the data composition is the relative scale of non-MM sources. During pretraining, 97.6% of training examples come from sources other than mobile manipulator household data. Cross-embodiment recordings, multi-environment robot demonstrations, semantic annotations, and web images collectively dominate the training distribution. Despite this, the final model generalizes to mobile household manipulation.

Mobile manipulator data (MM)

The MM data was collected by Physical Intelligence's data collection team operating the mobile manipulator hardware in approximately 100 distinct household environments. Demonstrations were collected via teleoperation, with human operators remotely controlling the robot's base, torso, arms, and grippers while a researcher indicated the task objective. The target tasks were the four evaluation categories: dishes in sink, items in drawer, laundry basket, and make bed. Approximately 400 hours of MM demonstrations were collected across the roughly 100 homes.

Even with 100 training homes, the MM data represents a small fraction of the total training volume. The paper's key empirical finding is that this relatively modest collection of on-robot household data, combined with diverse data from other sources, is sufficient to achieve near-parity with oracle baselines trained directly on the test environments.

Cross-embodiment laboratory data (CE)

The CE data includes two components. The first is a subset of the Open X-Embodiment (OXE) dataset, a publicly available corpus assembled by Google and academic collaborators that aggregates robot manipulation demonstrations from over 20 institutions and 22 robot platforms. The second is in-house cross-embodiment data collected by Physical Intelligence across multiple robot configurations, including single-arm manipulators and bimanual setups used for the original π0 training.

The OXE and in-house CE data provide manipulation demonstrations across a wide range of object types, task structures, and robot kinematics. Because these demonstrations are mostly from laboratory settings rather than household environments, they do not directly teach household task completion. Instead, they appear to teach generalizable manipulation skills: how to grasp objects of different shapes, how to handle soft materials, how to operate drawers and doors.

Web data (WD)

The web data consists of internet-scale image understanding examples: image captioning pairs, visual question-answering examples, and object localization examples with bounding box annotations. These examples do not include robot actions. They are included in training as a non-robotic co-training signal.

The motivation for including web data is to prevent catastrophic forgetting of the PaliGemma backbone's visual recognition capabilities. A model trained exclusively on robot demonstration data may lose the ability to recognize object categories not represented in robot demonstrations. Since the test homes contained arbitrary household objects not present in training demonstrations, the web data was expected to be critical for out-of-distribution object recognition.

Ablation experiments confirmed this: removing web data produced the largest performance drop specifically for out-of-distribution object recognition tasks, while having a smaller effect on tasks involving familiar objects.

High-level subtask annotation (HL)

The HL data consists of robot observation frames or short video clips paired with manually written natural-language descriptions of the current semantic subtask. These descriptions are typically short phrases ("pick up the plate," "walk to the kitchen") rather than full sentences. Annotation was performed by the data collection team after the fact, labeling segments of existing robot demonstrations with semantic subtask descriptions.

The HL data trains the model's high-level language head: the capacity to generate a short text description of what the robot is about to do before generating actions. Including HL data also provides a richer supervision signal for the visual representations, since the model must learn to associate specific visual configurations with specific semantic subtask labels.

Two-stage training procedure

Training proceeds in two sequential phases.

The pretraining phase runs for 280,000 gradient steps and uses all six data categories (MM, ME, CE, HL, WD, and partial VI). During this phase, all action data is represented using FAST tokens. The PaliGemma backbone is unfrozen so it can adapt to the robot-specific distribution while retaining general visual and language representations through the stabilizing influence of web and language data. The loss function is a weighted sum of cross-entropy over all discrete token outputs.

The post-training phase runs for an additional 80,000 gradient steps and focuses the model on mobile manipulation. This phase uses only MM, ME, HL, and VI data. The PaliGemma backbone is frozen to preserve the representations built during pretraining. The flow matching action head is introduced during this phase, and its loss is combined with the cross-entropy loss at a weight of α = 10.0. Post-training specializes the action expert for the continuous action precision required by mobile manipulation tasks.

Open-world generalization

Definition and prior work context

The term "open-world generalization" in π0.5's framing refers to performance in physical environments not represented in the training data. Prior VLA evaluations, including those for π0 and most VLAs published before 2025, tested robots in environments matched to or very close to training conditions. Even evaluation on novel test objects typically occurred in familiar rooms with familiar furniture. The few prior works that did evaluate in new environments generally did so with simple pick-and-place tasks using a single robot arm, not with long-horizon mobile manipulation tasks.

π0.5 was evaluated in three rental homes in San Francisco that the development team had never visited during data collection. The homes had different furniture, kitchen layouts, counter arrangements, clutter distributions, and object inventories from anything in the training set. The evaluation tasks required the robot to complete 10 to 15 minute sequences of subtasks, navigating the space, identifying relevant objects, and performing dexterous manipulation.

Hardware platform

π0.5 was evaluated on two mobile manipulator platforms. Each robot has a wheeled holonomic base (allowing movement in any direction without turning), a torso lift, two 6-degree-of-freedom arms with parallel-jaw grippers, and four RGB cameras: one forward-facing mounted on the torso, one rear-facing, and one wrist-mounted camera on each arm. The combined state and action space has 18 to 19 degrees of freedom. Motor commands are tracked by a proportional-derivative controller at 50 Hz. The compute for neural network inference runs off-robot and is streamed to the robot over WiFi.

This hardware configuration is simpler than a humanoid robot but more capable than a fixed-base arm: the mobile base lets the robot navigate household spaces, the torso lift helps with varying counter and shelf heights, and the wrist cameras provide close-up views useful for grasping.

Evaluation tasks

Four primary tasks structured the evaluation:

Task	Description	Complexity
Dishes in sink	Locate dirty dishes distributed around the kitchen; transport each to the sink	Requires navigation, object search, repeated grasping
Items in drawer	Find scattered small items; sort and deposit in appropriate drawers	Requires fine manipulation and categorical sorting
Laundry basket	Collect clothing and textiles from floor and surfaces; deposit in basket	Requires handling deformable objects
Make bed	Smooth sheets, arrange pillows, tuck loose bedding	Requires whole-body manipulation of large soft objects

Each task was evaluated across 10 trials per task per environment. The primary metric was task progress: the fraction of relevant items or subtasks successfully completed in each trial. A secondary metric tracked out-of-distribution generalization specifically to objects not present in any training demonstration.

Quantitative results

Across the four tasks and three test homes, π0.5 achieved a 94% success rate on out-of-distribution object manipulation and an 83% to 86% success rate on in-distribution conditions, depending on the specific evaluation configuration. Performance approached that of an oracle baseline model trained directly on data from the test environments, demonstrating that the co-training methodology can produce nearly environment-matched performance without requiring data collection in the target environments.

The paper also reported a scaling curve: as the number of distinct training homes in the MM dataset increased from 3 to 104, task performance improved steadily and monotonically. At 3 training homes, generalization to new homes was poor. At approximately 100 training homes, performance approached oracle levels. This curve confirmed that environment diversity in training data was more important than total demonstration volume.

Ablation studies

The paper reported ablations isolating each data source's contribution:

Removing the CE (cross-embodiment) data caused consistent performance degradation across all four tasks, suggesting that data from laboratory robot platforms transfers generalizable manipulation skills to the mobile household setting.
Removing the ME (multi-environment non-mobile) data also hurt performance substantially, indicating that non-mobile demonstrations from diverse indoor environments contribute to spatial and object-level generalization beyond what the MM data alone provides.
Removing the WD (web data) produced the largest drop specifically for out-of-distribution object recognition. Tasks involving familiar training objects were less affected. This confirmed the web data hypothesis: internet images are the primary source of visual knowledge for recognizing unfamiliar household objects.
Removing the HL (high-level subtask) supervision caused noticeable performance decline even in conditions where the predicted subtask text was not fed to the action expert at inference time. The semantic annotation signal improved the quality of visual representations during training.
An ablation that removed subtask prediction at inference time while retaining the subtask training data showed intermediate performance, confirming that both the training signal and the inference-time chain-of-thought are independently useful.

Demonstrations

Physical Intelligence published demonstration videos alongside the paper showing π0.5 operating in the three previously unseen homes. The kitchen cleaning sequences showed the robot identifying scattered dishes on countertops and tables, grasping items one at a time, navigating to the sink, placing items in the sink, and then returning to find additional items. The bedroom sequences showed the robot picking up clothing from the floor, handling soft textiles, locating a laundry basket through visual search, and depositing items.

One demonstration sequence showed the robot's response to human interference. During a task, a person moved an object the robot had already handled. The robot perceived the changed state on its next visual scan and re-executed the relevant subtask, demonstrating closed-loop replanning rather than execution of a fixed predetermined sequence. The robot also handled spills using a sponge in one demonstration, a task requiring both tool use and surface contact estimation.

The blog post noted that the robot was observed to display "the flexibility and resourcefulness with which a person might approach a new challenge," adapting its approach based on the specific object arrangement it found in each room rather than following a memorized path through the space.

All demonstrations were conducted without prior teleoperation in the test homes. A physical setup visit to attach the robot's charging station and confirm the physical space was accessible occurred before evaluation, but no robot demonstrations or data collection took place in the test homes.

Comparison with π0

π0.5 extends π0 in several ways that reflect both architectural changes and methodological shifts in training:

Property	π0	π0.5
Release date	October 2024	April 22, 2025
Base VLM	PaliGemma 3B (SigLIP + Gemma-2B)	PaliGemma 3B (SigLIP + Gemma-2B)
Total parameters	~3.3B	~3.3B
VLM frozen during training	Yes (always)	No during pretraining; yes during post-training
Action representation	Flow matching only	FAST tokens (pretraining) + flow matching (post-training)
High-level subtask prediction	No	Yes (text chain-of-thought before actions)
Web data in training	Limited	Yes (image captioning, VQA, object localization)
Training environments	Focused task sets	~100 household environments
Open-world evaluation	No	Yes (3 unseen homes)
Primary evaluation tasks	Laundry folding, box assembly, table bussing	Kitchen cleaning, bedroom tidying, drawer organization
Primary focus	Task dexterity across known settings	Generalization to unknown settings

In direct comparison experiments within the π0.5 paper, π0.5 substantially outperformed both the original π0 and a variant called π0-FAST+Flow (which added the FAST tokenizer and flow matching to π0 without the full co-training data mix and semantic subtask prediction). The gap was largest on out-of-distribution evaluation conditions, confirming that the improvement came from the co-training methodology rather than the architectural changes alone.

Comparison with contemporaneous VLA models

π0.5 appeared alongside several other significant VLA model releases in early 2025. Two of the most prominent were Helix from Figure AI and GR00T N1 from NVIDIA.

Property	π0.5	Helix (VLA model)	Isaac GR00T N1
Developer	Physical Intelligence	Figure AI	NVIDIA
Release date	April 22, 2025	February 2025	March 2025
Base model	PaliGemma 3B	7B VLM (System 2) + visuomotor policy (System 1)	NVIDIA Eagle-2 VLM + Diffusion Transformer
Total parameters	~3.3B	~7B (S2) + S1 components	~2.2B (public release)
Target platform	Wheeled mobile manipulator (two 6-DoF arms)	Figure humanoid robot (35 DoF)	Generalist humanoid robots
High-level control	Semantic subtask text prediction	System 2 VLM at 7 to 9 Hz	Reasoning module
Low-level control frequency	50 Hz	200 Hz	120 Hz
Action generation method	Flow matching (10 steps)	Visuomotor policy outputting action chunks	Diffusion Transformer
Training data emphasis	Diverse environments, web data, cross-embodiment	~500 hours teleoperation	20,000+ hours egocentric human video + synthetic
Open-world evaluation setting	3 unseen household environments	Household and logistics settings	Simulation benchmarks, multiple embodiments
Open weights	Partial (π0 base via openpi GitHub)	No	Yes (GR00T-N1-2B on Hugging Face)
Architectural decomposition	Unified transformer with dual action heads	Hard two-model split (S2 and S1 as separate networks)	Hard two-model split (System 2 and Diffusion Transformer)

The architecturally deepest distinction is how each system handles the decomposition between semantic reasoning and motor control. Helix and GR00T N1 both implement a hard two-system split, with separate model components for high-level reasoning (running at lower frequency) and low-level motor control (running at higher frequency). π0.5 implements a softer decomposition: a single unified transformer produces both semantic subtask text and motor action tokens, using different output heads for different modalities but sharing weights throughout. Whether the unified or split architecture is ultimately more capable remains an open research question.

Another distinction concerns embodiment. Helix is purpose-built for the Figure humanoid robot, with 35 degrees of freedom including individual finger control. GR00T N1 targets humanoid robots broadly. π0.5 was built and evaluated on a wheeled mobile manipulator without humanoid legs, making direct performance comparisons across platforms difficult since the platforms have different capabilities and face different challenges.

In terms of data strategy, GR00T N1 placed heavy emphasis on synthetic data from the Isaac simulator and egocentric human video (the EgoScale dataset totaling over 20,000 hours). π0.5 did not use synthetic simulation data, relying instead on real robot demonstrations and web images. Helix trained on approximately 500 hours of human teleoperation data, much less than either competitor.

Limitations

The π0.5 paper acknowledged several persistent failure modes and scope limitations.

The model has no persistent spatial map. It cannot see behind itself and has no memory of object positions across large displacements. Tasks requiring the robot to track where it placed objects after moving away, or to remember the state of areas not in the current field of view, can cause failures.

The high-level subtask predictor can be distracted by salient but irrelevant objects. A visually prominent item that has already been handled may re-attract the model's attention and cause repeated approach attempts. This is a symptom of the model lacking an explicit world state representation: it must infer the current state from raw visual observations at each timestep rather than consulting a maintained task progress record.

The system handles simple and direct commands reliably but was not evaluated on complex conditional or ambiguous instructions. Performance on instructions like "put the dishes away unless the dishwasher is already full" or "clean the kitchen when you have time" was not characterized.

π0.5 has no explicit failure recovery mechanism. If a grasp fails or an object is dropped, the model does not detect the failure and replan; it continues its current action sequence until the next high-level subtask boundary. Recovery from dropped objects depends on whether the next subtask prediction happens to address the dropped item.

All evaluation was conducted in standard residential environments with ordinary household objects. Performance on unusual furniture mechanisms (push-to-open cabinets, non-standard drawers), non-standard appliances, or living spaces that deviate from typical residential layouts was not characterized. Commercial kitchens, industrial settings, and outdoor environments were entirely outside the evaluation scope.

The evaluation was also limited in scale: three test homes with 10 trials per task is a relatively small sample for establishing robust performance claims. The homes were all in the same city and likely shared some demographic and architectural characteristics.

Subsequent work

Physical Intelligence followed π0.5 with π0.6 (also written π0.6 or π₀.₆), released in November 2025. The primary innovation in π0.6 was incorporating online reinforcement learning, allowing the robot to improve its performance through experience gathered during real-world deployments rather than relying solely on offline demonstration data.

π0.6 was reported to show substantial improvements over π0.5 on tasks like laundry folding and box assembly, which previously required task-specific fine-tuning with high-quality demonstrations to achieve non-zero success rates. The π0.6 model card noted that its training data composition was largely inherited from π0.5, with the reinforcement learning component providing the primary performance gain beyond the π0.5 baseline.

Physical Intelligence also released the openpi GitHub repository, which provided open-weight versions of the π0 base model and code for fine-tuning on custom tasks. The openpi repository became one of the most widely used open-source robotics model repositories in 2025.

References

Physical Intelligence. "A VLA with Open-World Generalization." pi.website/blog/pi05. April 22, 2025.
Physical Intelligence et al. "π0.5: a Vision-Language-Action Model with Open-World Generalization." arXiv:2504.16054. April 22, 2025.
Physical Intelligence et al. "π0.5: a Vision-Language-Action Model with Open-World Generalization." Proceedings of The 9th Conference on Robot Learning, PMLR 305:17-40, 2025.
Black, Kevin, et al. "π0: A Vision-Language-Action Flow Model for General Robot Control." arXiv:2410.24164. October 2024.
Figure AI. "Helix: A Vision-Language-Action Model for Generalist Humanoid Control." figure.ai/news/helix. February 2025.
NVIDIA. "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." arXiv:2503.14734. March 2025.
Physical Intelligence. "π0.6 Model Card." website.pi-asset.com/pi06star/PI06_model_card.pdf. November 2025.
Physical Intelligence. "FAST: Efficient Robot Action Tokenization." pi.website/research/fast. 2025.
Kalil, Mike. "Physical Intelligence's π0.5 VLA with Open-World Generalization." mikekalil.com. 2025.
Edelman, Isaac. "A review of π0.5: Scaling Open-World Generalization in Vision-Language-Action Models." Toward Humanoids, Medium. 2026.
DeepWiki. "π₀ Model Family." deepwiki.com/Physical-Intelligence/openpi/4.2-p-model-family. 2025.
Pebblous. "Three Teams, Three Robot Brains: A Head-to-Head Comparison of GR00T, Gemini, and π Architectures." blog.pebblous.ai. 2025.

Background

Physical Intelligence

π0 and the VLA lineage

The open-world generalization problem

Release and reception

Architecture

PaliGemma backbone

Action expert

Dual action representation

High-level subtask prediction

Attention structure

Training differences from π0

Training data and co-training methodology

Overview of co-training

Data categories

Mobile manipulator data (MM)

Cross-embodiment laboratory data (CE)

Web data (WD)

High-level subtask annotation (HL)

Two-stage training procedure

Open-world generalization

Definition and prior work context

Hardware platform

Evaluation tasks

Quantitative results

Ablation studies

Demonstrations

Comparison with π0

Comparison with contemporaneous VLA models

Limitations

Subsequent work

See also

References

Improve this article

Related Articles

Helix (VLA model)

π0

RT-2

ERQA

NVIDIA Cosmos

Physical Intelligence

Background

Physical Intelligence

π0 and the VLA lineage

The open-world generalization problem

Release and reception

Architecture

PaliGemma backbone

Action expert

Dual action representation

High-level subtask prediction

Attention structure

Training differences from π0

Training data and co-training methodology

Overview of co-training

Data categories

Mobile manipulator data (MM)

Cross-embodiment laboratory data (CE)

Web data (WD)

High-level subtask annotation (HL)

Two-stage training procedure

Open-world generalization

Definition and prior work context

Hardware platform

Evaluation tasks

Quantitative results

Ablation studies

Demonstrations

Comparison with π0

Comparison with contemporaneous VLA models

Limitations

Subsequent work

See also

References

Related Articles

Helix (VLA model)

π0

RT-2

ERQA

NVIDIA Cosmos