π0.5 (also written pi0.5 or π₀.₅) is a vision-language-action model developed by Physical Intelligence and released on April 22, 2025. Building directly on the π0 architecture, π0.5 addresses one of the most persistent obstacles in embodied AI: getting a robot to operate reliably in environments it has never seen before. Where most VLA models are evaluated in settings that closely resemble their training data, π0.5 was tested in entirely new homes, performing complex domestic tasks such as cleaning kitchens and tidying bedrooms without prior exposure to those spaces.
The paper describing the model, "π0.5: a Vision-Language-Action Model with Open-World Generalization," was published simultaneously as an arXiv preprint (arXiv:2504.16054) and as a Physical Intelligence blog post on April 22, 2025. It subsequently appeared in the Proceedings of The 9th Conference on Robot Learning (CoRL 2025), PMLR Volume 305, pages 17 to 40. The paper carries 35 listed authors from Physical Intelligence, including Kevin Black, Noah Brown, Chelsea Finn, Karol Hausman, Brian Ichter, Sergey Levine, Allen Z. Ren, and Homer Walke, among others.
The central technical contribution is a co-training framework that mixes six heterogeneous data categories, including mobile manipulator household demonstrations, non-mobile multi-environment robot data, cross-embodiment laboratory recordings, high-level semantic subtask annotations, internet-scale web data, and verbal instruction demonstrations. In experiments, π0.5 approached the performance of models trained directly on test-environment data while evaluating in three San Francisco rental homes the development team had never visited during data collection.
Physical Intelligence (abbreviated as π or pi) is a San Francisco robotics AI company founded in 2023. Its founders include Karol Hausman, Brian Ichter, Sergey Levine, and Chelsea Finn, researchers with backgrounds spanning Google Brain, Google DeepMind, UC Berkeley, and Stanford. The company focuses on developing foundation models for physical systems, pursuing the same scaling approach that produced general-purpose text and image models, applied to robot control. Physical Intelligence raised early funding from investors including Bezos Expeditions, OpenAI, Sequoia Capital, and Khosla Ventures.
The company's research strategy is to train a single large model on data from many different robot types and tasks, rather than building narrow task-specific policies. The hypothesis is that a model trained on diverse enough physical interaction data will develop representations that transfer across embodiments and environments, much as large language models trained on diverse text transfer across domains.
π0 was Physical Intelligence's first major model release, published in October 2024. The model combined a 3-billion-parameter PaliGemma vision-language backbone with a separately parameterized action expert. The action expert used flow matching, a technique from generative modeling that learns smooth vector fields to convert a Gaussian noise distribution into a target action distribution. During inference, the model runs 10 denoising steps to produce a 50-step action chunk covering one second of motor commands.
π0 demonstrated broad dexterity across tasks like laundry folding, box assembly, and table bussing. However, its generalization remained largely constrained to environments and object configurations seen during training. Evaluating it in genuinely new homes, with unfamiliar furniture arrangements and unseen objects, revealed substantial performance drops.
To extend generalization beyond the training distribution, Physical Intelligence pursued two parallel lines of follow-up work. One produced FAST (Frequency-based Action Sequence Tokenizer), which compresses robot trajectories into compact discrete token sequences using Discrete Cosine Transform, enabling more efficient training on heterogeneous data. The other produced π0.5, which incorporated FAST tokens into a broader co-training framework explicitly designed for open-world generalization.
Generalization in robotics has a specific meaning distinct from its use in machine learning benchmarks. A robot that "generalizes" in the standard ML sense may succeed on held-out test objects from the same distribution as training objects. Open-world generalization refers to a harder condition: operating in physical spaces with different floor plans, furniture arrangements, lighting conditions, and object inventories from anything seen during training.
Most VLA papers published before 2025 evaluated robots in environments that closely matched their training settings. Even when test objects differed, the room, the table positions, and the camera angles were familiar. π0.5 defined its evaluation around a stricter criterion: the three test homes used for evaluation were locations the Physical Intelligence team had never entered during data collection. This framing shaped the paper's entire methodology.
π0.5 was announced via a blog post on the Physical Intelligence website on April 22, 2025, accompanied by the arXiv preprint and demonstration videos showing the robot operating in unfamiliar San Francisco rental homes. The blog framed the work as "a significant step forward toward truly generalizable physical intelligence," emphasizing the departure from prior VLA evaluations in training-matched settings.
The release generated substantial discussion in the robotics and machine learning communities. A Hacker News thread on the paper drew attention to the 97.6% figure, the share of pretraining data drawn from non-mobile-manipulator sources, as counterintuitive. The model used to train a household mobile robot turned out to rely overwhelmingly on data from other robots, web images, and non-mobile manipulation setups, yet still generalized to mobile household tasks. Participants in the discussion noted that this data efficiency argument, that you do not need large amounts of on-robot household data if you train on enough diverse data from other sources, was one of the paper's most practically significant claims.
The paper's acceptance to CoRL 2025, the premier conference for robot learning research, confirmed its standing in the field. Physical Intelligence had previously published π0 at CoRL 2024.
π0.5 is built on PaliGemma, Google's open-weight vision-language model. PaliGemma combines two pretrained components: SigLIP-So400m, a contrastive image encoder trained with SigmoidLoss on large image-text datasets, and Gemma-2B, a 2-billion-parameter decoder-only language model. These are joined by a learned linear projection that maps SigLIP image patch tokens into the Gemma token embedding space.
PaliGemma processes images at 224x224 resolution, generating 256 image tokens per image. The model can accept multiple images in a single context, which is how π0.5 handles the robot's four-camera input: the forward camera, the rear camera, and the two wrist cameras are all encoded as separate image token sequences concatenated into the context.
The total parameter count for π0.5 is approximately 3.3 billion. PaliGemma accounts for roughly 3 billion of these parameters. A separate action expert module contributes approximately 300 million additional parameters initialized from scratch rather than from pretrained weights.
The action expert is a transformer-based module that takes PaliGemma's contextualized token representations and generates continuous robot action sequences. It processes a combination of image embeddings, language instruction tokens, robot proprioceptive state tokens, and, during post-training, semantic subtask prediction tokens.
The action expert uses adaptive RMSNorm (AdaRMS) to inject the flow-matching diffusion timestep into each transformer layer. A two-layer MLP maps the timestep scalar to a set of scale and shift parameters that modulate the layer normalization applied to each hidden state. This is the primary architectural change from π0's action expert, which fused the timestep directly with the noisy action input rather than conditioning each layer independently.
At inference time, the action expert runs 10 denoising steps using the flow matching procedure. Starting from Gaussian noise, each step moves the action representation along a learned vector field toward the target action distribution. The result is a 50-step action chunk covering one second of motion at 50 Hz, representing target end-effector positions for both arms and the mobile base.
π0.5 uses two distinct action representations in different training phases. During pretraining, robot demonstration data is encoded using FAST tokens. The FAST tokenizer applies a Discrete Cosine Transform to an action trajectory, compressing the high-frequency components and retaining the low-frequency ones, then quantizing the result into a discrete codebook. FAST tokens allow robot demonstration data to be treated as discrete sequences comparable to text tokens, enabling joint training with language and web data using a single cross-entropy objective.
During post-training, the action expert transitions from producing FAST token predictions to producing continuous flow matching outputs. The FAST token head is retained but the flow matching head becomes the primary output for robot control. This design means the model can leverage the computational efficiency of discrete token training at scale during pretraining, then recover continuous action precision during the post-training specialization phase.
One of the clearest architectural additions in π0.5 relative to π0 is the explicit high-level subtask prediction stage. Before generating motor action tokens, the model first produces a short natural-language description of the current semantic subtask (for example, "pick up the mug" or "open the drawer"). This text is generated by the Gemma language head using autoregressive decoding.
The predicted subtask text is then prepended to the action expert's input context before flow matching begins. This gives the action expert an explicit semantic anchor: rather than inferring the current goal entirely from visual observations, it receives a natural-language description of what it is about to do. The authors described this as a form of chain-of-thought reasoning applied to robot control.
Ablation experiments showed that including subtask annotation data during training provided benefits even when the predicted subtask was not passed to the action expert at inference time. The training signal from subtask prediction improved the quality of the visual representations used for action generation, suggesting that semantic annotation is useful as a training objective independent of its role as an inference-time input.
Like π0, π0.5 uses a prefix-suffix attention structure. Image, language, and state tokens form the prefix, over which bidirectional (full) attention is applied. Action tokens form the suffix, over which causal (left-to-right) attention is applied, with each action token attending to all prefix tokens and to preceding action tokens. This structure ensures the model can integrate full context when generating actions while preserving the autoregressive structure needed for FAST token generation.
Several training-time choices distinguish π0.5 from π0. In π0, the PaliGemma backbone was kept frozen throughout training, with only the action expert parameters updated. In π0.5, the PaliGemma backbone is unfrozen during pretraining. The authors attributed this decision to the larger and more diverse training dataset: unfreezing the backbone lets the model adapt its visual and language representations to the robot-specific observation distribution without catastrophic forgetting, because the pretraining data includes enough web and language examples to maintain the backbone's general capabilities.
The π0 paper also described a technique called knowledge insulation, in which the gradient from the action expert is blocked from flowing back into the VLM backbone. π0.5 applies a variant of this during post-training: after pretraining with the unfrozen backbone, the VLM is frozen again for the post-training phase so that the action expert can be fine-tuned on mobile manipulation data without distorting the broader representations built during pretraining.
The defining methodological contribution of π0.5 is training a single model jointly on six data categories that differ dramatically in their format, source robot, semantic content, and action space. The hypothesis driving this design is that different data sources teach complementary capabilities: mobile manipulation data teaches the target skill, non-mobile robot data teaches generalizable manipulation primitives, cross-embodiment data transfers motor knowledge across platforms, semantic annotation teaches the structure of household tasks, web data teaches object recognition and scene understanding, and verbal instruction data teaches language grounding.
Joint training on all six categories is not straightforward because the data types are heterogeneous: some examples include action sequences while others do not, some have bounding box annotations while others have only image-text pairs, and the robot embodiments have different observation and action spaces. The FAST tokenizer and the dual action representation solve part of this problem by providing a common discrete token format for action data. The remainder is handled through a shared token vocabulary and a multi-task loss that applies different output heads to different data types depending on what labels are available.
| Source | Abbreviation | Format | Scale |
|---|---|---|---|
| Mobile Manipulator data | MM | Video + joint states + actions | ~400 hours, ~100 homes |
| Multi-Environment non-mobile | ME | Video + joint states + actions | Multiple indoor environments |
| Cross-Embodiment laboratory | CE | Video + joint states + actions | OXE dataset + in-house recordings |
| High-Level subtask prediction | HL | Video + natural language subtask labels | Manually annotated |
| Web Data | WD | Images + captions, VQA pairs, object bounding boxes | Internet-scale |
| Verbal Instruction demonstrations | VI | Video + natural language instructions | Expert-collected, post-training only |
The most striking aspect of the data composition is the relative scale of non-MM sources. During pretraining, 97.6% of training examples come from sources other than mobile manipulator household data. Cross-embodiment recordings, multi-environment robot demonstrations, semantic annotations, and web images collectively dominate the training distribution. Despite this, the final model generalizes to mobile household manipulation.
The MM data was collected by Physical Intelligence's data collection team operating the mobile manipulator hardware in approximately 100 distinct household environments. Demonstrations were collected via teleoperation, with human operators remotely controlling the robot's base, torso, arms, and grippers while a researcher indicated the task objective. The target tasks were the four evaluation categories: dishes in sink, items in drawer, laundry basket, and make bed. Approximately 400 hours of MM demonstrations were collected across the roughly 100 homes.
Even with 100 training homes, the MM data represents a small fraction of the total training volume. The paper's key empirical finding is that this relatively modest collection of on-robot household data, combined with diverse data from other sources, is sufficient to achieve near-parity with oracle baselines trained directly on the test environments.
The CE data includes two components. The first is a subset of the Open X-Embodiment (OXE) dataset, a publicly available corpus assembled by Google and academic collaborators that aggregates robot manipulation demonstrations from over 20 institutions and 22 robot platforms. The second is in-house cross-embodiment data collected by Physical Intelligence across multiple robot configurations, including single-arm manipulators and bimanual setups used for the original π0 training.
The OXE and in-house CE data provide manipulation demonstrations across a wide range of object types, task structures, and robot kinematics. Because these demonstrations are mostly from laboratory settings rather than household environments, they do not directly teach household task completion. Instead, they appear to teach generalizable manipulation skills: how to grasp objects of different shapes, how to handle soft materials, how to operate drawers and doors.
The web data consists of internet-scale image understanding examples: image captioning pairs, visual question-answering examples, and object localization examples with bounding box annotations. These examples do not include robot actions. They are included in training as a non-robotic co-training signal.
The motivation for including web data is to prevent catastrophic forgetting of the PaliGemma backbone's visual recognition capabilities. A model trained exclusively on robot demonstration data may lose the ability to recognize object categories not represented in robot demonstrations. Since the test homes contained arbitrary household objects not present in training demonstrations, the web data was expected to be critical for out-of-distribution object recognition.
Ablation experiments confirmed this: removing web data produced the largest performance drop specifically for out-of-distribution object recognition tasks, while having a smaller effect on tasks involving familiar objects.
The HL data consists of robot observation frames or short video clips paired with manually written natural-language descriptions of the current semantic subtask. These descriptions are typically short phrases ("pick up the plate," "walk to the kitchen") rather than full sentences. Annotation was performed by the data collection team after the fact, labeling segments of existing robot demonstrations with semantic subtask descriptions.
The HL data trains the model's high-level language head: the capacity to generate a short text description of what the robot is about to do before generating actions. Including HL data also provides a richer supervision signal for the visual representations, since the model must learn to associate specific visual configurations with specific semantic subtask labels.
Training proceeds in two sequential phases.
The pretraining phase runs for 280,000 gradient steps and uses all six data categories (MM, ME, CE, HL, WD, and partial VI). During this phase, all action data is represented using FAST tokens. The PaliGemma backbone is unfrozen so it can adapt to the robot-specific distribution while retaining general visual and language representations through the stabilizing influence of web and language data. The loss function is a weighted sum of cross-entropy over all discrete token outputs.
The post-training phase runs for an additional 80,000 gradient steps and focuses the model on mobile manipulation. This phase uses only MM, ME, HL, and VI data. The PaliGemma backbone is frozen to preserve the representations built during pretraining. The flow matching action head is introduced during this phase, and its loss is combined with the cross-entropy loss at a weight of α = 10.0. Post-training specializes the action expert for the continuous action precision required by mobile manipulation tasks.
The term "open-world generalization" in π0.5's framing refers to performance in physical environments not represented in the training data. Prior VLA evaluations, including those for π0 and most VLAs published before 2025, tested robots in environments matched to or very close to training conditions. Even evaluation on novel test objects typically occurred in familiar rooms with familiar furniture. The few prior works that did evaluate in new environments generally did so with simple pick-and-place tasks using a single robot arm, not with long-horizon mobile manipulation tasks.
π0.5 was evaluated in three rental homes in San Francisco that the development team had never visited during data collection. The homes had different furniture, kitchen layouts, counter arrangements, clutter distributions, and object inventories from anything in the training set. The evaluation tasks required the robot to complete 10 to 15 minute sequences of subtasks, navigating the space, identifying relevant objects, and performing dexterous manipulation.
π0.5 was evaluated on two mobile manipulator platforms. Each robot has a wheeled holonomic base (allowing movement in any direction without turning), a torso lift, two 6-degree-of-freedom arms with parallel-jaw grippers, and four RGB cameras: one forward-facing mounted on the torso, one rear-facing, and one wrist-mounted camera on each arm. The combined state and action space has 18 to 19 degrees of freedom. Motor commands are tracked by a proportional-derivative controller at 50 Hz. The compute for neural network inference runs off-robot and is streamed to the robot over WiFi.
This hardware configuration is simpler than a humanoid robot but more capable than a fixed-base arm: the mobile base lets the robot navigate household spaces, the torso lift helps with varying counter and shelf heights, and the wrist cameras provide close-up views useful for grasping.
Four primary tasks structured the evaluation:
| Task | Description | Complexity |
|---|---|---|
| Dishes in sink | Locate dirty dishes distributed around the kitchen; transport each to the sink | Requires navigation, object search, repeated grasping |
| Items in drawer | Find scattered small items; sort and deposit in appropriate drawers | Requires fine manipulation and categorical sorting |
| Laundry basket | Collect clothing and textiles from floor and surfaces; deposit in basket | Requires handling deformable objects |
| Make bed | Smooth sheets, arrange pillows, tuck loose bedding | Requires whole-body manipulation of large soft objects |
Each task was evaluated across 10 trials per task per environment. The primary metric was task progress: the fraction of relevant items or subtasks successfully completed in each trial. A secondary metric tracked out-of-distribution generalization specifically to objects not present in any training demonstration.
Across the four tasks and three test homes, π0.5 achieved a 94% success rate on out-of-distribution object manipulation and an 83% to 86% success rate on in-distribution conditions, depending on the specific evaluation configuration. Performance approached that of an oracle baseline model trained directly on data from the test environments, demonstrating that the co-training methodology can produce nearly environment-matched performance without requiring data collection in the target environments.
The paper also reported a scaling curve: as the number of distinct training homes in the MM dataset increased from 3 to 104, task performance improved steadily and monotonically. At 3 training homes, generalization to new homes was poor. At approximately 100 training homes, performance approached oracle levels. This curve confirmed that environment diversity in training data was more important than total demonstration volume.
The paper reported ablations isolating each data source's contribution:
Physical Intelligence published demonstration videos alongside the paper showing π0.5 operating in the three previously unseen homes. The kitchen cleaning sequences showed the robot identifying scattered dishes on countertops and tables, grasping items one at a time, navigating to the sink, placing items in the sink, and then returning to find additional items. The bedroom sequences showed the robot picking up clothing from the floor, handling soft textiles, locating a laundry basket through visual search, and depositing items.
One demonstration sequence showed the robot's response to human interference. During a task, a person moved an object the robot had already handled. The robot perceived the changed state on its next visual scan and re-executed the relevant subtask, demonstrating closed-loop replanning rather than execution of a fixed predetermined sequence. The robot also handled spills using a sponge in one demonstration, a task requiring both tool use and surface contact estimation.
The blog post noted that the robot was observed to display "the flexibility and resourcefulness with which a person might approach a new challenge," adapting its approach based on the specific object arrangement it found in each room rather than following a memorized path through the space.
All demonstrations were conducted without prior teleoperation in the test homes. A physical setup visit to attach the robot's charging station and confirm the physical space was accessible occurred before evaluation, but no robot demonstrations or data collection took place in the test homes.
π0.5 extends π0 in several ways that reflect both architectural changes and methodological shifts in training:
| Property | π0 | π0.5 |
|---|---|---|
| Release date | October 2024 | April 22, 2025 |
| Base VLM | PaliGemma 3B (SigLIP + Gemma-2B) | PaliGemma 3B (SigLIP + Gemma-2B) |
| Total parameters | ~3.3B | ~3.3B |
| VLM frozen during training | Yes (always) | No during pretraining; yes during post-training |
| Action representation | Flow matching only | FAST tokens (pretraining) + flow matching (post-training) |
| High-level subtask prediction | No | Yes (text chain-of-thought before actions) |
| Web data in training | Limited | Yes (image captioning, VQA, object localization) |
| Training environments | Focused task sets | ~100 household environments |
| Open-world evaluation | No | Yes (3 unseen homes) |
| Primary evaluation tasks | Laundry folding, box assembly, table bussing | Kitchen cleaning, bedroom tidying, drawer organization |
| Primary focus | Task dexterity across known settings | Generalization to unknown settings |
In direct comparison experiments within the π0.5 paper, π0.5 substantially outperformed both the original π0 and a variant called π0-FAST+Flow (which added the FAST tokenizer and flow matching to π0 without the full co-training data mix and semantic subtask prediction). The gap was largest on out-of-distribution evaluation conditions, confirming that the improvement came from the co-training methodology rather than the architectural changes alone.
π0.5 appeared alongside several other significant VLA model releases in early 2025. Two of the most prominent were Helix from Figure AI and GR00T N1 from NVIDIA.
| Property | π0.5 | Helix (VLA model) | Isaac GR00T N1 |
|---|---|---|---|
| Developer | Physical Intelligence | Figure AI | NVIDIA |
| Release date | April 22, 2025 | February 2025 | March 2025 |
| Base model | PaliGemma 3B | 7B VLM (System 2) + visuomotor policy (System 1) | NVIDIA Eagle-2 VLM + Diffusion Transformer |
| Total parameters | ~3.3B | ~7B (S2) + S1 components | ~2.2B (public release) |
| Target platform | Wheeled mobile manipulator (two 6-DoF arms) | Figure humanoid robot (35 DoF) | Generalist humanoid robots |
| High-level control | Semantic subtask text prediction | System 2 VLM at 7 to 9 Hz | Reasoning module |
| Low-level control frequency | 50 Hz | 200 Hz | 120 Hz |
| Action generation method | Flow matching (10 steps) | Visuomotor policy outputting action chunks | Diffusion Transformer |
| Training data emphasis | Diverse environments, web data, cross-embodiment | ~500 hours teleoperation | 20,000+ hours egocentric human video + synthetic |
| Open-world evaluation setting | 3 unseen household environments | Household and logistics settings | Simulation benchmarks, multiple embodiments |
| Open weights | Partial (π0 base via openpi GitHub) | No | Yes (GR00T-N1-2B on Hugging Face) |
| Architectural decomposition | Unified transformer with dual action heads | Hard two-model split (S2 and S1 as separate networks) | Hard two-model split (System 2 and Diffusion Transformer) |
The architecturally deepest distinction is how each system handles the decomposition between semantic reasoning and motor control. Helix and GR00T N1 both implement a hard two-system split, with separate model components for high-level reasoning (running at lower frequency) and low-level motor control (running at higher frequency). π0.5 implements a softer decomposition: a single unified transformer produces both semantic subtask text and motor action tokens, using different output heads for different modalities but sharing weights throughout. Whether the unified or split architecture is ultimately more capable remains an open research question.
Another distinction concerns embodiment. Helix is purpose-built for the Figure humanoid robot, with 35 degrees of freedom including individual finger control. GR00T N1 targets humanoid robots broadly. π0.5 was built and evaluated on a wheeled mobile manipulator without humanoid legs, making direct performance comparisons across platforms difficult since the platforms have different capabilities and face different challenges.
In terms of data strategy, GR00T N1 placed heavy emphasis on synthetic data from the Isaac simulator and egocentric human video (the EgoScale dataset totaling over 20,000 hours). π0.5 did not use synthetic simulation data, relying instead on real robot demonstrations and web images. Helix trained on approximately 500 hours of human teleoperation data, much less than either competitor.
The π0.5 paper acknowledged several persistent failure modes and scope limitations.
The model has no persistent spatial map. It cannot see behind itself and has no memory of object positions across large displacements. Tasks requiring the robot to track where it placed objects after moving away, or to remember the state of areas not in the current field of view, can cause failures.
The high-level subtask predictor can be distracted by salient but irrelevant objects. A visually prominent item that has already been handled may re-attract the model's attention and cause repeated approach attempts. This is a symptom of the model lacking an explicit world state representation: it must infer the current state from raw visual observations at each timestep rather than consulting a maintained task progress record.
The system handles simple and direct commands reliably but was not evaluated on complex conditional or ambiguous instructions. Performance on instructions like "put the dishes away unless the dishwasher is already full" or "clean the kitchen when you have time" was not characterized.
π0.5 has no explicit failure recovery mechanism. If a grasp fails or an object is dropped, the model does not detect the failure and replan; it continues its current action sequence until the next high-level subtask boundary. Recovery from dropped objects depends on whether the next subtask prediction happens to address the dropped item.
All evaluation was conducted in standard residential environments with ordinary household objects. Performance on unusual furniture mechanisms (push-to-open cabinets, non-standard drawers), non-standard appliances, or living spaces that deviate from typical residential layouts was not characterized. Commercial kitchens, industrial settings, and outdoor environments were entirely outside the evaluation scope.
The evaluation was also limited in scale: three test homes with 10 trials per task is a relatively small sample for establishing robust performance claims. The homes were all in the same city and likely shared some demographic and architectural characteristics.
Physical Intelligence followed π0.5 with π0.6 (also written π0.6 or π₀.₆), released in November 2025. The primary innovation in π0.6 was incorporating online reinforcement learning, allowing the robot to improve its performance through experience gathered during real-world deployments rather than relying solely on offline demonstration data.
π0.6 was reported to show substantial improvements over π0.5 on tasks like laundry folding and box assembly, which previously required task-specific fine-tuning with high-quality demonstrations to achieve non-zero success rates. The π0.6 model card noted that its training data composition was largely inherited from π0.5, with the reinforcement learning component providing the primary performance gain beyond the π0.5 baseline.
Physical Intelligence also released the openpi GitHub repository, which provided open-weight versions of the π0 base model and code for fine-tuning on custom tasks. The openpi repository became one of the most widely used open-source robotics model repositories in 2025.