Vision-language-action model
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,320 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,320 words
Add missing citations, update stale details, or suggest a clearer explanation.
A vision-language-action model (VLA model, or VLA) is a class of foundation model for robotic control that takes one or more RGB camera images and a natural-language instruction as input and produces robot actions, typically expressed as end-effector pose deltas, joint-space targets, or tokenized action sequences, as output. VLA models extend the vision-language model paradigm with an action head or action expert so that internet-scale pretraining on image-text data can be transferred to closed-loop manipulation and locomotion policies. The term was coined in July 2023 by Brohan and colleagues at Google DeepMind in the paper that introduced RT-2: "We refer to such category of models as vision-language-action models (VLA)."[^1]
By 2026 VLAs had become the dominant architecture for general-purpose manipulation research. Open-weights releases such as OpenVLA (Stanford, Toyota Research Institute and the University of Washington, June 2024)[^2], the openpi family from Physical Intelligence (π₀ in October 2024, π₀-FAST in January 2025, π₀.₅ in April 2025)[^3][^4][^5], NVIDIA's GR00T N1 for humanoids (March 2025)[^6], Figure AI's Helix dual-system policy (February 2025)[^7] and Hugging Face's lightweight SmolVLA (June 2025)[^8] together turned VLAs from a single Google research demo into a competitive open ecosystem trained on hundreds of thousands of teleoperated robot demonstrations and on consortium datasets such as Open X-Embodiment[^9] and DROID[^10].
Modern VLA work has three identifiable predecessors in 2022 and early 2023, none of which were marketed under the "vision-language-action" label at the time.
Gato, introduced by DeepMind in May 2022 with about 1.18 billion parameters, was a single decoder-only transformer trained on 604 tasks spanning Atari games, image captioning, dialogue, simulated 3D navigation and real robot-arm control, with all modalities serialized into a shared token stream.[^11] Gato established the prompt that a single transformer with the same weights could output tokens for joystick inputs, robotic gripper commands and natural language. It did not, however, leverage a strong pretrained vision-language model: it was trained from scratch on a curated cocktail of supervised and behaviour-cloning data.
RT-1, by Brohan and colleagues at Google in December 2022, was a 35-million-parameter Robotics Transformer trained on roughly 130,000 episodes of teleoperated Everyday Robots manipulation across 700+ tasks.[^12] RT-1 tokenized images with EfficientNet, fused them with a language embedding via FiLM conditioning, and emitted discretized end-effector actions. RT-1 is widely credited with showing that supervised behaviour cloning at scale could deliver high task success across many skills and with introducing the action discretization recipe later inherited by RT-2 and OpenVLA.
PaLM-E, published by Google in March 2023, was an "embodied multimodal language model" that integrated the 540-billion-parameter PaLM language model with a 22-billion-parameter Vision Transformer, producing a 562-billion-parameter model.[^13] PaLM-E injected continuous image and state observations into the language embedding space, generated high-level plans in text, and delegated low-level control to RT-1. It was a step toward foundation-model-based robotics but did not directly emit actions itself, blurring the boundary between a vision-language model and a VLA.
RT-2, published as arXiv preprint 2307.15818 on 28 July 2023 by a 54-author team led by Brohan, was the paper that explicitly defined the "vision-language-action" category.[^1] Two backbones were tried: a 12-billion-parameter variant built on PaLM-E and a 55-billion-parameter variant built on the PaLI-X vision-language model.[^14] The defining engineering trick was to represent each dimension of the 7-DoF end-effector action (x, y, z, roll, pitch, yaw, gripper) as a single integer in 0-255, then reserve 256 unused tokens in the language model's vocabulary to encode those integers. With this representation, fine-tuning a VLM on co-mingled internet image-text data and robot demonstrations produced a single decoder that could either answer a visual question or emit an action sequence depending on the prompt.
The headline empirical result was that RT-2 exhibited substantial generalization to unseen objects, backgrounds and instructions, and could perform crude chain-of-thought planning at the level of "pick up the extinct animal" (which the model resolved to a dinosaur figurine). RT-2 was never released as open weights and was tied to Google's internal Everyday Robots hardware, but the paper's framing and terminology proved durable: by 2024 essentially every robot-foundation-model release positioned itself as a VLA or as a competitor to a VLA.
After RT-2, VLAs proliferated along two roughly parallel branches. One branch kept the decoder-only "actions as tokens" recipe and pushed it into open weights, exemplified by Octo (May 2024) and OpenVLA (June 2024). A second branch replaced the discrete action token head with a continuous action expert trained by flow matching or diffusion. This branch, originating with Physical Intelligence's π₀ in October 2024 and continued by RDT-1B (October 2024), GR00T N1 (March 2025) and Helix (February 2025), targeted higher-frequency dexterous control where 256-bin discretization breaks down. A third branch, the dual-system or "system 1 / system 2" architecture promoted by Figure AI and NVIDIA, factored the policy into a slow VLM that reasons over scene and language and a fast neural controller that emits motor commands. By 2026 these branches were converging: π₀.₅ adds high-level subtask prediction on top of a flow-matching action expert, and GR00T N1 and Helix both bolt a small action transformer onto a frozen or lightly tuned VLM backbone.
The pace of the explosion is hard to overstate. In the eighteen months between OpenVLA's June 2024 release and the GR00T N1.7 announcement in late 2025, the field released at least a dozen open-weights VLAs and roughly twice as many closed ones, while the median size of an open VLA fell from about seven billion parameters down to under a billion as engineers learned which parts of a generic VLM could be discarded once the action expert took over the dexterous-control workload. Concurrent advances in synthetic data pipelines, low-cost teleoperation hardware, and PyTorch-based training infrastructure such as LeRobot made it feasible for academic labs and even individuals to train serviceable VLAs from scratch on a single workstation GPU, a circumstance that had been unimaginable when RT-2 demanded the dedicated TPU pod of a hyperscaler.
The most common working definition, traceable directly to the RT-2 paper, is that a VLA is a model satisfying three properties:
In addition, most authors require that the model be built on top of a pretrained vision-language model, so that internet-scale visual grounding is inherited rather than learned from robot data alone. That criterion is what distinguishes RT-2 and its successors from Gato or RT-1, which were trained from scratch.
The boundary cases are themselves informative. Octo[^15] is sometimes excluded from the VLA category because it conditions on a "task token" produced by a small language encoder rather than processing free-form text through a full LLM, and its diffusion action head sits on top of a transformer trained from scratch on robot data. RDT-1B[^16] is sometimes excluded because language enters through a T5 encoder rather than via a co-trained autoregressive VLM. World models such as 1X's video-based imagination model[^17] and DeepMind's Genie family are not VLAs because they predict next-frame imagery rather than robot actions. These boundary debates matter less than the underlying engineering choices, which the architecture section addresses next.
VLAs split into four broad architectural families, each tied to a different action representation.
The decoder-only family inherits RT-2's recipe: a single autoregressive transformer, initialized from a pretrained VLM, predicts action tokens after the image and instruction tokens. Each action dimension is quantized into 256 bins and assigned a token in the vocabulary, so action prediction is reduced to next-token prediction.
RT-2 uses this scheme on a 12B PaLM-E or 55B PaLI-X backbone.[^1][^14] OpenVLA, the open-source counterpart by Kim, Pertsch and colleagues, fine-tunes a Llama-2 7B language model paired with a fused DINOv2 plus SigLIP visual encoder on 970,000 manipulation episodes drawn from Open X-Embodiment, and reports outperforming RT-2-X by 16.5 absolute percentage points in success rate while using roughly one-seventh of the parameters.[^2][^18]
The advantage of this design is that it requires almost no architectural change relative to a standard VLM, so it benefits directly from progress in language and vision-language pretraining. The disadvantage is that fine-grained dexterous control suffers from the coarse 256-bin discretization and from the high latency of autoregressively decoding several tokens per action. The FAST tokenizer from Pertsch, Stachowicz and colleagues, published in January 2025, addresses both problems by applying a discrete cosine transform to action chunks before tokenization, achieving up to a fivefold reduction in training time and substantial gains in dexterous task performance.[^19] The π₀-FAST model from Physical Intelligence is an autoregressive VLA built directly on this tokenizer.[^4][^19]
The action-expert family freezes or lightly fine-tunes a pretrained VLM and adds a separate, smaller transformer that emits continuous action chunks. The new module is trained as a conditional flow-matching or diffusion model: at training time the action expert is shown noisy actions and predicts the denoising velocity field, and at inference time it integrates that field for a handful of steps to produce a smooth action trajectory.
π₀ by Physical Intelligence (October 2024) is the canonical example. It uses Google's PaliGemma 3B as the VLM backbone and bolts on a roughly 300-million-parameter action expert that emits 50-step continuous action chunks via flow matching, supporting up to 50 Hz dexterous control.[^3][^20] The model is trained on a proprietary Physical Intelligence dataset covering eight robot embodiments, mixed with Open X-Embodiment. π₀.₅ (April 2025) keeps the flow-matching head but co-trains on web data, high-level subtask prediction and verbal instructions in addition to robot data, and is reported to perform multi-minute kitchen and bedroom cleaning tasks in homes that were never seen during training.[^5]
RDT-1B (Liu et al., Tsinghua, October 2024) is a 1.2-billion-parameter diffusion transformer with a "physically interpretable unified action space" that lets a single model fine-tune to many bimanual robots without rewriting the action head.[^16] Language enters via a T5 encoder rather than a co-trained VLM, which makes RDT-1B closer in spirit to a diffusion policy than to a pure VLA, but the paper is routinely listed alongside VLAs in the literature.
The dual-system family decomposes the policy into a slow, large vision-language module that reasons over the scene and language ("System 2") and a fast, small controller that runs at sensor rates and emits motor commands ("System 1"). The split is inspired by the dual-process theory of human cognition.
Helix, announced by Figure AI in February 2025, sets System 2 as a 7-billion-parameter open-weights VLM running at 7-9 Hz, while System 1 is an 80-million-parameter cross-attention encoder-decoder transformer that emits 200 Hz continuous control for 35 degrees of freedom across the torso, head, wrists and individual fingers.[^7] Figure trained Helix on roughly 500 hours of teleoperated demonstrations annotated automatically by a VLM, and demonstrated it on multi-robot collaborative manipulation of unseen household objects.
GR00T N1, released by NVIDIA in March 2025, takes essentially the same shape but for humanoid robots and with open weights.[^6] System 2 is a fine-tuned Eagle-2 vision-language model (about 1.34 billion of the model's 2.2 billion total parameters), and System 1 is a diffusion transformer with adaptive layer-norm conditioning that emits 16-step continuous action chunks at roughly 60 millisecond latency on an NVIDIA L40 GPU.[^21] GR00T N1 is trained on a heterogeneous mix of real teleoperation, human video and synthetically generated trajectories from the GR00T-Dreams blueprint. The follow-up GR00T N1.5, announced at Computex 2025, improves new-environment generalization and language following and was further extended in Isaac GR00T N1.7 in late 2025.[^22]
A few prominent generalist policies sit on the boundary of the VLA category. Octo (Berkeley, May 2024) is a transformer-based diffusion policy with 27 million parameters (Octo-Small) or 93 million parameters (Octo-Base), trained on 800,000 trajectories from Open X-Embodiment.[^15][^23] Language enters through a frozen T5 encoder that produces a "task token", and the action head is a diffusion decoder. Most authors classify Octo as a "generalist robot policy" rather than a VLA in the strict sense, because there is no co-trained autoregressive language model in the loop.
SmolVLA (Hugging Face, June 2025) shows that the action-expert recipe scales down: a trimmed SmolVLM-2 backbone plus a small action transformer yields a 450-million-parameter model that can be trained on a single consumer GPU and that, with an asynchronous inference stack, approaches the success rate of models an order of magnitude larger on the LIBERO benchmark.[^8] SmolVLA is built on, and shipped through, the Hugging Face LeRobot library.
The following table summarizes the principal verified VLA-style models released between July 2023 and mid-2025. Open-weights status reflects the state at original release; subsequent open-source re-implementations are not counted.
| Model | Organization | First release | Backbone VLM | Action format | Total params | Open weights |
|---|---|---|---|---|---|---|
| RT-2 (PaLM-E variant) | Google DeepMind | Jul 2023 | PaLM-E | Discretized tokens, 256 bins | 12 B | No |
| RT-2 (PaLI-X variant) | Google DeepMind | Jul 2023 | PaLI-X | Discretized tokens, 256 bins | 55 B | No |
| Octo-Small / Base | UC Berkeley et al. | May 2024 | T5 (task token) | Diffusion | 27 M / 93 M | Yes (MIT) |
| OpenVLA | Stanford / TRI / UW | Jun 2024 | Prismatic VLM (Llama-2 7B + DINOv2 + SigLIP) | Discretized tokens, 256 bins | 7 B | Yes (MIT) |
| RDT-1B | Tsinghua | Oct 2024 | T5-XXL | Diffusion (unified action space) | 1.2 B | Yes |
| π₀ | Physical Intelligence | Oct 2024 | PaliGemma 3B | Flow matching, 50-step chunks | ~3.3 B | Yes (Apache 2.0, openpi) |
| π₀-FAST | Physical Intelligence | Dec 2024 / Jan 2025 | PaliGemma 3B | Autoregressive FAST tokens | ~3 B | Yes (openpi) |
| Helix | Figure AI | Feb 2025 | Open-weights VLM (S2) | Continuous, 200 Hz | 7 B (S2) + 80 M (S1) | No |
| GR00T N1-2B | NVIDIA | Mar 2025 | Eagle-2 VLM (~1.34 B) | Diffusion Transformer, 16-step chunks | 2.2 B | Yes (NVIDIA OneWay) |
| π₀.₅ | Physical Intelligence | Apr 2025 | PaliGemma 3B + co-training | Flow matching plus subtask prediction | ~3.3 B | Yes (openpi) |
| GR00T N1.5 | NVIDIA | May 2025 | Eagle-2 VLM | Diffusion Transformer | ~2.2 B | Yes |
| SmolVLA | Hugging Face | Jun 2025 | SmolVLM-2 (trimmed) | Action expert, chunked | 450 M | Yes |
Parameter counts are taken from the original papers or model cards; where the paper splits parameters across modules, the sub-totals are kept explicit.
VLA training depends on robot data far more than VLM training depends on image-text data, because high-quality robot demonstrations are much more expensive to collect than scraped web pages. By 2026 the field had standardized on a small number of consortium datasets plus proprietary in-house pools.
Open X-Embodiment (often abbreviated OXE, sometimes called the RT-X dataset) is the collaborative dataset assembled by 21 academic and industrial institutions and released alongside the RT-X paper at the 2023 Conference on Robot Learning.[^9] The October 2023 release aggregated demonstrations from 22 robot embodiments, 527 distinct skills and 160,266 tasks across more than 1 million trajectories, all converted into a standard episodic format. OpenVLA, π₀, Octo and most other open VLAs use OXE either as the primary pretraining corpus or as a co-training mixture component.
The associated paper also released RT-1-X and RT-2-X, multi-embodiment versions of the original RT-1 and RT-2, trained on the consortium data. Both demonstrated positive transfer in the sense that adding data from foreign robot bodies improved performance on the source robot's tasks.
DROID (Distributed Robot Interaction Dataset), introduced by Khazatsky and colleagues in March 2024, contributes 76,000 demonstration trajectories totalling about 350 hours of interaction across 564 scenes and 84 tasks, collected by 50 data collectors at 13 institutions over 12 months on identical Franka Panda hardware.[^10] DROID supplies three synchronized RGB streams, camera calibration, depth, and natural-language instructions for every episode, and is the de facto benchmark for "in-the-wild" Franka generalization. Physical Intelligence, in June 2025, released openpi checkpoints fine-tuned on the full DROID dataset, claiming the first models able to follow instructions on Franka platforms in entirely new environments.[^24]
BridgeData V2 (Walke et al., August 2023) contains 60,096 trajectories collected across 24 environments on the low-cost WidowX 250 platform, and is engineered specifically to support open-vocabulary, multi-task learning conditioned on goal images or language instructions.[^25] It is one of the most widely used datasets for SIMPLER and real-robot evaluations, and the WidowX subset of SIMPLER is built from BridgeData V2 scenes.
RH20T (Fang et al., July 2023) is a contact-rich manipulation dataset with more than 110,000 trajectories spanning 147 tasks, collected across multiple Franka, Kuka and UR robots with synchronized vision, force, audio and action data, plus a paired human demonstration video for each episode.[^26] RH20T is unusual in its emphasis on contact-rich skills such as cutting, plugging, pouring and folding.
Several leading VLAs are trained on proprietary in-house pools that are not released. Physical Intelligence reports collecting more than 10,000 hours of dexterous teleoperation across eight robots, including UR5e, bimanual UR5e, Franka, bimanual Trossen, bimanual ARX, and mobile Trossen / Fibocom variants.[^3] Figure trained Helix on roughly 500 hours of teleoperation on its humanoid hardware.[^7] NVIDIA augmented GR00T training with the GR00T-Dreams blueprint, which uses world-model video generation to produce synthetic robot trajectories at scale, claiming to compress what would have been three months of human teleoperation into 36 hours of synthetic data generation.[^22]
| Dataset | Year | Trajectories | Embodiments | Notes |
|---|---|---|---|---|
| BridgeData V2 | 2023 | ~60,000 | 1 (WidowX) | Open vocabulary, language and goal images |
| RH20T | 2023 | 110,000+ | Mixed Franka / Kuka / UR | Force and audio modalities, contact-rich |
| RT-1 dataset | 2022 | ~130,000 | 1 (Everyday Robots) | RT-1 and RT-2 training corpus |
| Open X-Embodiment | 2023 | ~1,000,000+ | 22 | Standard format, consortium of 21 institutions |
| DROID | 2024 | 76,000 | 1 (Franka Panda) | Diverse scenes, 13 institutions, 350 h |
| Physical Intelligence | 2024-25 | proprietary | 8+ | More than 10,000 h dexterous teleop |
| Figure Helix | 2025 | ~500 h | 1 (Figure 02) | Auto-annotated by VLM |
Evaluating VLAs is harder than evaluating language or vision models because the most informative outcome is a success rate on a physical robot, which is expensive and hard to reproduce. Three classes of benchmark have emerged.
LIBERO (Liu et al., June 2023) is a simulation benchmark for lifelong robot learning that procedurally generates 130 language-conditioned manipulation tasks grouped into four suites probing object, spatial, goal and long-horizon distribution shifts.[^27] By 2025 LIBERO had become the standard "first-pass" benchmark for VLAs because it is cheap to run, supports language instructions natively and yields task-success numbers that are reasonably correlated with real-robot generalization. OpenVLA, π₀, RDT-1B, GR00T N1 and SmolVLA all report LIBERO results in their papers.
SIMPLER (Simulated Manipulation Policy Evaluation in Real-to-Sim) is a fully simulated evaluation framework targeting policies trained on real data. It exposes two protocols: a "visual matching" track in which the simulator mimics the camera viewpoint and visual appearance of the real cell as closely as possible, and a "variant aggregation" track that perturbs lighting, textures and object positions to test robustness. SIMPLER was released specifically to evaluate Google Robot and WidowX + Bridge policies including RT-1-X, RT-2-X, Octo and OpenVLA, and is now extended by community forks such as SimplerEnv-OpenVLA.[^28]
Despite simulation progress, VLAs are still reported on extensive real-robot evaluations in their original papers. RT-2 was evaluated on Google's Everyday Robots cell. OpenVLA reported 29 tasks across multiple embodiments. π₀ was evaluated on five complex tasks including laundry folding, table bussing and bagging, and π₀.₅ reported multi-minute kitchen and bedroom cleaning in homes that were not part of training. GR00T N1 and Helix were evaluated on humanoid platforms (Fourier GR-1 and Figure 02 respectively). The lack of a shared real-robot leaderboard remains one of the field's biggest gaps; the community has responded with newer real-to-sim benchmarks such as REALM and RobotArena-infinity.
By 2026 the VLA field has matured to the point where its open problems are reasonably well agreed.
The most actively debated question is how to represent actions. The original RT-2 recipe of 256-bin discretized tokens degrades on high-frequency dexterous control, motivating the flow-matching action experts of π₀ and the diffusion transformers of RDT-1B and GR00T N1. The FAST tokenizer attempts to recover the simplicity of autoregressive decoding while restoring high-frequency expressivity. The trade-offs are still being worked out: autoregressive tokens are easy to integrate with VLM pretraining but slow at inference; flow matching is fast and smooth but harder to combine with discrete language outputs in a single model; diffusion is robust but requires several denoising steps and can be hard to tune.
Whether VLAs follow language-model-style scaling laws in robot data is unresolved. The Physical Intelligence team reports continued improvements out to more than 10,000 hours of teleoperation, but the slope of those curves is unclear, and most academic groups cannot afford comparable data collection. Synthetic data from world models (GR00T-Dreams) and from large-scale simulation (Isaac Sim, MimicGen) is the leading proposed answer, but synthetic-to-real transfer for dexterous manipulation remains imperfect.
Despite progress in physics-based simulation and in real-to-sim benchmarks such as SIMPLER and REALM, the gap between simulated success rates and real-robot success rates is still routinely 10 to 30 percentage points on the same task. Real-to-sim approaches such as RobotArena-infinity attempt to convert real-robot rollouts into simulated counterparts using neural rendering, but accuracy on contact-rich, deformable or articulated objects is still a research frontier.
VLAs have surpassed earlier generalist policies on short manipulation primitives, but multi-minute long-horizon tasks remain hard. π₀.₅ is the highest-profile demonstration of long-horizon cleaning in new homes, achieved by co-training on high-level subtask prediction, but it still depends on careful task framing. Truly dexterous in-hand manipulation, bimanual coordination on deformable objects, and tool use are at the edge of current capability.
Open X-Embodiment showed positive transfer across robot bodies, but only in limited regimes. Skild AI[^29] markets an "omni-bodied" policy capable of running on quadrupeds, humanoids, tabletop arms and mobile manipulators without prior knowledge of the body form, but the underlying claims have not yet been independently published. Cross-embodiment policies trained naively on heterogeneous bodies often underperform body-specific policies, and the right inductive bias for body abstraction is still open.
Running a multi-billion-parameter VLA in a tight closed-loop at 50-200 Hz on embedded hardware is non-trivial. Helix splits across a slow VLM and a fast controller specifically for this reason; SmolVLA pushes the total parameter count down to 450 million; π₀ and π₀-FAST emphasize action chunking and asynchronous inference. Quantization, distillation and speculative decoding are active engineering frontiers. Safety, robustness to adversarial scenes, and the ability to recover from out-of-distribution states are also flagged in essentially every deployment writeup.
A subtler but increasingly visible problem is that real-robot evaluations published in different papers are not directly comparable. Cell geometries, lighting, object sets, prompt wording and even the seed used to randomize start poses can shift a reported success rate by 20 percentage points or more. Workshops such as the CoRL Real Robot Challenge and the RoboCup-style competitions have proposed standardized cells, but no community-wide protocol comparable to ImageNet for vision or MMLU for language has emerged. The SIMPLER and REALM benchmarks partially address this by fixing simulation environments, yet the simulation-to-real gap means that good SIMPLER numbers do not always imply good real-robot performance, and vice versa. Building a trusted real-robot leaderboard is widely regarded as a precondition for the field's continued empirical maturation.
Closed-loop foundation-model policies inherit the failure modes of their underlying VLMs. Reported issues include hallucinated affordances (predicting actions for objects that are not actually in the workspace), instruction misinterpretation, and prompt-injection-style vulnerabilities in which printed text in the scene overrides the user instruction. Several 2025 papers also document that VLAs occasionally output unsafe joint targets when the camera is occluded or when the prompt is adversarial; mitigations include constrained decoding, separate safety filters and conservative action chunking. Because VLAs are now deployed on humanoid hardware with substantial mass and reach, even rare failure modes can cause property damage, and the field is starting to converge on the view that safety-critical behaviour should be enforced by a non-learned controller layer beneath the VLA rather than by the VLA itself.
VLAs moved from research demos into commercial deployments faster than most prior generations of robot learning. By mid-2026 the principal players, all backed by primary sources, were:
Several other startups, Google DeepMind's Gemini Robotics team, and Tesla's Optimus program publish episodic demos but have not released primary architecture papers comparable to the ones above.
The VLA open-source stack in 2026 has three tightly coupled layers.
Model checkpoints: OpenVLA, openpi (π₀, π₀-FAST, π₀.₅), GR00T N1 / N1.5 / N1.7, Octo, RDT-1B and SmolVLA are all released with weights and inference code. OpenVLA and Octo use MIT-style licenses; openpi uses Apache 2.0; NVIDIA uses a custom OneWay license for GR00T models. All major checkpoints are mirrored on Hugging Face Hub.
Training and fine-tuning libraries: The Hugging Face LeRobot library is the most general-purpose, providing PyTorch implementations of diffusion policies, OpenVLA, π₀, π₀-FAST, π₀.₅ and SmolVLA together with the LeRobotDataset format.[^30] The openpi repository[^24] provides both the original JAX implementations and a 2025 PyTorch port. The OpenVLA project supplies LoRA and quantization recipes specifically aimed at consumer GPUs.[^2][^18]
Datasets: The LeRobotDataset format on the Hugging Face Hub has become a de facto standard for community-collected robot data; it stores videos as MP4 and per-step state and action as Parquet files. Open X-Embodiment remains the standard pretraining corpus, with DROID, BridgeData V2 and RH20T as the most common single-platform fine-tuning sources.
The ecosystem also extends to affordable open-source robot hardware: Hugging Face's LeRobot project distributes designs for the SO-100 arm and other low-cost platforms, with the goal of getting VLA-capable hardware into university and hobbyist hands at sub-thousand-dollar price points.