Vision-language-action model

29 min read

Updated Jul 23, 2026

A vision-language-action model (VLA model, or VLA) is a class of foundation model for robotic control that takes one or more RGB camera images and a natural-language instruction as input and directly produces robot actions as output, typically expressed as end-effector pose deltas, joint-space targets, or tokenized action sequences. In short, a VLA turns a vision-language model into a closed-loop robot policy: it adds an action head or action expert so that internet-scale pretraining on image-text data can be transferred to manipulation and locomotion. The term was coined in July 2023 by Brohan and colleagues at Google DeepMind in the paper that introduced RT-2, which stated plainly: "We refer to such category of models as vision-language-action models (VLA)."^[1]

The single most-cited result in the field illustrates why VLAs caught on: OpenVLA, a 7-billion-parameter open model, reported "outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters."^[2] By 2026 VLAs had become the dominant architecture for general-purpose manipulation research. Open-weights releases such as OpenVLA (Stanford, Toyota Research Institute and the University of Washington, June 2024)^[2], the openpi family from Physical Intelligence (π₀ in October 2024, π₀-FAST in January 2025, π₀.₅ in April 2025)^[3]^[4]^[5], NVIDIA's GR00T N1 for humanoids (March 2025)^[6], Figure AI's Helix dual-system policy (February 2025)^[7] and Hugging Face's lightweight SmolVLA (June 2025)^[8] together turned VLAs from a single Google research demo into a competitive open ecosystem trained on hundreds of thousands of teleoperated robot demonstrations and on consortium datasets such as Open X-Embodiment^[9] and DROID^[10].

What is a VLA used for?

A VLA is used as the end-to-end control policy for a robot performing language-specified tasks. A user (or a higher-level planner) provides a natural-language instruction such as "pick up the bag of chips" or "fold the towel," the model observes the scene through one or more cameras, and it emits the low-level actions that drive the robot's joints or gripper in a closed loop. The dominant application is general-purpose tabletop and bimanual manipulation (sorting, bussing tables, folding laundry, packing), with a fast-growing branch in humanoid whole-body control. Because a VLA inherits the visual grounding and language understanding of its underlying vision-language model, it can generalize to objects, backgrounds and phrasings it never saw during robot training, which is the property that distinguishes it from earlier task-specific manipulation policies.^[1]^[2]

History

Pre-VLA generalist policies

Modern VLA work has three identifiable predecessors in 2022 and early 2023, none of which were marketed under the "vision-language-action" label at the time.

Gato, introduced by DeepMind in May 2022 with about 1.18 billion parameters, was a single decoder-only transformer trained on 604 tasks spanning Atari games, image captioning, dialogue, simulated 3D navigation and real robot-arm control, with all modalities serialized into a shared token stream.^[11] Gato established the prompt that a single transformer with the same weights could output tokens for joystick inputs, robotic gripper commands and natural language. It did not, however, leverage a strong pretrained vision-language model: it was trained from scratch on a curated cocktail of supervised and behaviour-cloning data.

RT-1, by Brohan and colleagues at Google in December 2022, was a 35-million-parameter Robotics Transformer trained on roughly 130,000 episodes of teleoperated Everyday Robots manipulation across 700+ tasks.^[12] RT-1 tokenized images with EfficientNet, fused them with a language embedding via FiLM conditioning, and emitted discretized end-effector actions. RT-1 is widely credited with showing that supervised behaviour cloning at scale could deliver high task success across many skills and with introducing the action discretization recipe later inherited by RT-2 and OpenVLA.

PaLM-E, published by Google in March 2023, was an "embodied multimodal language model" that integrated the 540-billion-parameter PaLM language model with a 22-billion-parameter Vision Transformer, producing a 562-billion-parameter model.^[13] PaLM-E injected continuous image and state observations into the language embedding space, generated high-level plans in text, and delegated low-level control to RT-1. It was a step toward foundation-model-based robotics but did not directly emit actions itself, blurring the boundary between a vision-language model and a VLA.

When was the term coined? RT-2 (July 2023)

RT-2, published as arXiv preprint 2307.15818 on 28 July 2023 by a 54-author team led by Brohan, was the paper that explicitly defined the "vision-language-action" category.^[1] Two backbones were tried: a 12-billion-parameter variant built on PaLM-E and a 55-billion-parameter variant built on the PaLI-X vision-language model.^[14] The defining engineering trick was to represent each dimension of the 7-DoF end-effector action (x, y, z, roll, pitch, yaw, gripper) as a single integer in 0-255, then reserve 256 unused tokens in the language model's vocabulary to encode those integers. With this representation, fine-tuning a VLM on co-mingled internet image-text data and robot demonstrations produced a single decoder that could either answer a visual question or emit an action sequence depending on the prompt.

The headline empirical claim was that, in the authors' words, RT-2 showed "significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data ... and the ability to perform rudimentary reasoning in response to user commands."^[1] In practice this meant chain-of-thought planning at the level of "pick up the extinct animal" (which the model resolved to a dinosaur figurine). RT-2 was never released as open weights and was tied to Google's internal Everyday Robots hardware, but the paper's framing and terminology proved durable: by 2024 essentially every robot-foundation-model release positioned itself as a VLA or as a competitor to a VLA.

The 2024-2026 explosion

After RT-2, VLAs proliferated along two roughly parallel branches. One branch kept the decoder-only "actions as tokens" recipe and pushed it into open weights, exemplified by Octo (May 2024) and OpenVLA (June 2024). A second branch replaced the discrete action token head with a continuous action expert trained by flow matching or diffusion. This branch, originating with Physical Intelligence's π₀ in October 2024 and continued by RDT-1B (October 2024), GR00T N1 (March 2025) and Helix (February 2025), targeted higher-frequency dexterous control where 256-bin discretization breaks down. A third branch, the dual-system or "system 1 / system 2" architecture promoted by Figure AI and NVIDIA, factored the policy into a slow VLM that reasons over scene and language and a fast neural controller that emits motor commands. By 2026 these branches were converging: π₀.₅ adds high-level subtask prediction on top of a flow-matching action expert, and GR00T N1 and Helix both bolt a small action transformer onto a frozen or lightly tuned VLM backbone.

The pace of the explosion is hard to overstate. In the eighteen months between OpenVLA's June 2024 release and the GR00T N1.7 announcement in late 2025, the field released at least a dozen open-weights VLAs and roughly twice as many closed ones, while the median size of an open VLA fell from about seven billion parameters down to under a billion as engineers learned which parts of a generic VLM could be discarded once the action expert took over the dexterous-control workload. Concurrent advances in synthetic data pipelines, low-cost teleoperation hardware, and PyTorch-based training infrastructure such as LeRobot made it feasible for academic labs and even individuals to train serviceable VLAs from scratch on a single workstation GPU, a circumstance that had been unimaginable when RT-2 demanded the dedicated TPU pod of a hyperscaler.

Definition and scope

The most common working definition, traceable directly to the RT-2 paper, is that a VLA is a model satisfying three properties:

Vision input: it consumes one or more RGB images (sometimes augmented with depth or wrist-camera views).
Language input: it consumes a natural-language instruction or goal description.
Action output: it emits a sequence of robot actions either as discrete tokens, as continuous vectors, or as a parameterized trajectory, such that the model can be deployed as a closed-loop policy on a physical or simulated robot.

In addition, most authors require that the model be built on top of a pretrained vision-language model, so that internet-scale visual grounding is inherited rather than learned from robot data alone. That criterion is what distinguishes RT-2 and its successors from Gato or RT-1, which were trained from scratch.

How does a VLA differ from a vision-language model?

A vision-language model (VLM) maps images and text to text: it answers questions, captions images and follows instructions in language. A VLA extends a VLM with an action interface so that the same network, or a tightly coupled action expert beside it, also outputs robot actions. The distinction is the third modality: a VLM stops at language, while a VLA closes the loop on a physical robot by predicting low-level controls. Most modern VLAs are literally fine-tuned or co-trained from a VLM checkpoint (RT-2 from PaLI-X and PaLM-E, OpenVLA from a Llama-2-based Prismatic VLM, π₀ from PaliGemma), which is why they inherit the VLM's open-vocabulary grounding.^[1]^[2]^[3]

The boundary cases are themselves informative. Octo^[15] is sometimes excluded from the VLA category because it conditions on a "task token" produced by a small language encoder rather than processing free-form text through a full LLM, and its diffusion action head sits on top of a transformer trained from scratch on robot data. RDT-1B^[16] is sometimes excluded because language enters through a T5 encoder rather than via a co-trained autoregressive VLM. World models such as 1X's video-based imagination model^[17] and DeepMind's Genie family are not VLAs because they predict next-frame imagery rather than robot actions. These boundary debates matter less than the underlying engineering choices, which the architecture section addresses next.

Architecture patterns

VLAs split into four broad architectural families, each tied to a different action representation.

Decoder-only with discretized action tokens

The decoder-only family inherits RT-2's recipe: a single autoregressive transformer, initialized from a pretrained VLM, predicts action tokens after the image and instruction tokens. Each action dimension is quantized into 256 bins and assigned a token in the vocabulary, so action prediction is reduced to next-token prediction.

RT-2 uses this scheme on a 12B PaLM-E or 55B PaLI-X backbone.^[1]^[14] OpenVLA, the open-source counterpart by Kim, Pertsch and colleagues, fine-tunes a Llama-2 7B language model paired with a fused DINOv2 plus SigLIP visual encoder on 970,000 manipulation episodes drawn from Open X-Embodiment, and reports outperforming RT-2-X by 16.5 absolute percentage points in success rate while using roughly one-seventh of the parameters.^[2]^[18]

The advantage of this design is that it requires almost no architectural change relative to a standard VLM, so it benefits directly from progress in language and vision-language pretraining. The disadvantage is that fine-grained dexterous control suffers from the coarse 256-bin discretization and from the high latency of autoregressively decoding several tokens per action. The FAST tokenizer from Pertsch, Stachowicz and colleagues, published in January 2025, addresses both problems by applying a discrete cosine transform to action chunks before tokenization, achieving up to a fivefold reduction in training time and substantial gains in dexterous task performance.^[19] The π₀-FAST model from Physical Intelligence is an autoregressive VLA built directly on this tokenizer.^[4]^[19]

VLM backbone plus action expert (flow matching or diffusion)

The action-expert family freezes or lightly fine-tunes a pretrained VLM and adds a separate, smaller transformer that emits continuous action chunks. The new module is trained as a conditional flow-matching or diffusion model: at training time the action expert is shown noisy actions and predicts the denoising velocity field, and at inference time it integrates that field for a handful of steps to produce a smooth action trajectory.

π₀ by Physical Intelligence (October 2024) is the canonical example. It uses Google's PaliGemma 3B as the VLM backbone and bolts on a roughly 300-million-parameter action expert that emits 50-step continuous action chunks via flow matching, supporting up to 50 Hz dexterous control.^[3]^[20] The model is trained on a proprietary Physical Intelligence dataset covering eight robot embodiments, mixed with Open X-Embodiment. π₀.₅ (April 2025) keeps the flow-matching head but co-trains on web data, high-level subtask prediction and verbal instructions in addition to robot data, and is reported to perform multi-minute kitchen and bedroom cleaning tasks in homes that were never seen during training.^[5]

RDT-1B (Liu et al., Tsinghua, October 2024) is a 1.2-billion-parameter diffusion transformer with a "physically interpretable unified action space" that lets a single model fine-tune to many bimanual robots without rewriting the action head.^[16] Language enters via a T5 encoder rather than a co-trained VLM, which makes RDT-1B closer in spirit to a diffusion policy than to a pure VLA, but the paper is routinely listed alongside VLAs in the literature.

Dual-system or hierarchical (System 1 plus System 2)

The dual-system family decomposes the policy into a slow, large vision-language module that reasons over the scene and language ("System 2") and a fast, small controller that runs at sensor rates and emits motor commands ("System 1"). The split is inspired by the dual-process theory of human cognition.

Helix, announced by Figure AI in February 2025, sets System 2 as a 7-billion-parameter open-weights VLM running at 7-9 Hz, while System 1 is an 80-million-parameter cross-attention encoder-decoder transformer that emits 200 Hz continuous control for 35 degrees of freedom across the torso, head, wrists and individual fingers.^[7] Figure trained Helix on roughly 500 hours of teleoperated demonstrations annotated automatically by a VLM, and demonstrated it on multi-robot collaborative manipulation of unseen household objects.

GR00T N1, released by NVIDIA in March 2025, takes essentially the same shape but for humanoid robots and with open weights.^[6] System 2 is a fine-tuned Eagle-2 vision-language model (about 1.34 billion of the model's 2.2 billion total parameters), and System 1 is a diffusion transformer with adaptive layer-norm conditioning that emits 16-step continuous action chunks at roughly 60 millisecond latency on an NVIDIA L40 GPU.^[21] GR00T N1 is trained on a heterogeneous mix of real teleoperation, human video and synthetically generated trajectories from the GR00T-Dreams blueprint. The follow-up GR00T N1.5, announced at Computex 2025, improves new-environment generalization and language following and was further extended in Isaac GR00T N1.7 in late 2025.^[22]

Other patterns

A few prominent generalist policies sit on the boundary of the VLA category. Octo (Berkeley, May 2024) is a transformer-based diffusion policy with 27 million parameters (Octo-Small) or 93 million parameters (Octo-Base), trained on 800,000 trajectories from Open X-Embodiment.^[15]^[23] Language enters through a frozen T5 encoder that produces a "task token", and the action head is a diffusion decoder. Most authors classify Octo as a "generalist robot policy" rather than a VLA in the strict sense, because there is no co-trained autoregressive language model in the loop.

SmolVLA (Hugging Face, June 2025) shows that the action-expert recipe scales down: a trimmed SmolVLM-2 backbone plus a small action transformer (about 100 million of the model's parameters) yields a 450-million-parameter model that can be trained on a single consumer GPU and deployed on consumer-grade GPUs or even CPUs.^[8] With an asynchronous inference stack that decouples action prediction from execution, the authors report roughly 30% faster response and about 2x task throughput, letting SmolVLA approach the success rate of models an order of magnitude larger on the LIBERO benchmark.^[8] SmolVLA is built on, and shipped through, the Hugging Face LeRobot library.

Survey of major VLA models

The following table summarizes the principal verified VLA-style models released between July 2023 and mid-2025. Open-weights status reflects the state at original release; subsequent open-source re-implementations are not counted.

Model	Organization	First release	Backbone VLM	Action format	Total params	Open weights
RT-2 (PaLM-E variant)	Google DeepMind	Jul 2023	PaLM-E	Discretized tokens, 256 bins	12 B	No
RT-2 (PaLI-X variant)	Google DeepMind	Jul 2023	PaLI-X	Discretized tokens, 256 bins	55 B	No
Octo-Small / Base	UC Berkeley et al.	May 2024	T5 (task token)	Diffusion	27 M / 93 M	Yes (MIT)
OpenVLA	Stanford / TRI / UW	Jun 2024	Prismatic VLM (Llama-2 7B + DINOv2 + SigLIP)	Discretized tokens, 256 bins	7 B	Yes (MIT)
RDT-1B	Tsinghua	Oct 2024	T5-XXL	Diffusion (unified action space)	1.2 B	Yes
π₀	Physical Intelligence	Oct 2024	PaliGemma 3B	Flow matching, 50-step chunks	~3.3 B	Yes (Apache 2.0, openpi)
π₀-FAST	Physical Intelligence	Dec 2024 / Jan 2025	PaliGemma 3B	Autoregressive FAST tokens	~3 B	Yes (openpi)
Helix	Figure AI	Feb 2025	Open-weights VLM (S2)	Continuous, 200 Hz	7 B (S2) + 80 M (S1)	No
GR00T N1-2B	NVIDIA	Mar 2025	Eagle-2 VLM (~1.34 B)	Diffusion Transformer, 16-step chunks	2.2 B	Yes (NVIDIA OneWay)
π₀.₅	Physical Intelligence	Apr 2025	PaliGemma 3B + co-training	Flow matching plus subtask prediction	~3.3 B	Yes (openpi)
GR00T N1.5	NVIDIA	May 2025	Eagle-2 VLM	Diffusion Transformer	~2.2 B	Yes
SmolVLA	Hugging Face	Jun 2025	SmolVLM-2 (trimmed)	Action expert, chunked	450 M	Yes

Parameter counts are taken from the original papers or model cards; where the paper splits parameters across modules, the sub-totals are kept explicit.

Training datasets

VLA training depends on robot data far more than VLM training depends on image-text data, because high-quality robot demonstrations are much more expensive to collect than scraped web pages. By 2026 the field had standardized on a small number of consortium datasets plus proprietary in-house pools.

Open X-Embodiment

Open X-Embodiment (often abbreviated OXE, sometimes called the RT-X dataset) is the collaborative dataset assembled by 21 academic and industrial institutions and released alongside the RT-X paper at the 2023 Conference on Robot Learning.^[9] The October 2023 release aggregated demonstrations from 22 robot embodiments, 527 distinct skills and 160,266 tasks across more than 1 million trajectories, pooled from 60 existing robot datasets and converted into a standard episodic format.^[9] OpenVLA, π₀, Octo and most other open VLAs use OXE either as the primary pretraining corpus or as a co-training mixture component.

The associated paper also released RT-1-X and RT-2-X, multi-embodiment versions of the original RT-1 and RT-2, trained on the consortium data. Both demonstrated positive transfer in the sense that adding data from foreign robot bodies improved performance on the source robot's tasks.

DROID

DROID (Distributed Robot Interaction Dataset), introduced by Khazatsky and colleagues in March 2024, contributes 76,000 demonstration trajectories totalling about 350 hours of interaction across 564 scenes and 84 tasks, collected by 50 data collectors at 13 institutions over 12 months on identical Franka Panda hardware.^[10] DROID supplies three synchronized RGB streams, camera calibration, depth, and natural-language instructions for every episode, and is the de facto benchmark for "in-the-wild" Franka generalization. Physical Intelligence, in June 2025, released openpi checkpoints fine-tuned on the full DROID dataset, claiming the first models able to follow instructions on Franka platforms in entirely new environments.^[24]

BridgeData V2

BridgeData V2 (Walke et al., August 2023) contains 60,096 trajectories collected across 24 environments on the low-cost WidowX 250 platform, and is engineered specifically to support open-vocabulary, multi-task learning conditioned on goal images or language instructions.^[25] It is one of the most widely used datasets for SIMPLER and real-robot evaluations, and the WidowX subset of SIMPLER is built from BridgeData V2 scenes.

RH20T

RH20T (Fang et al., July 2023) is a contact-rich manipulation dataset with more than 110,000 trajectories spanning 147 tasks, collected across multiple Franka, Kuka and UR robots with synchronized vision, force, audio and action data, plus a paired human demonstration video for each episode.^[26] RH20T is unusual in its emphasis on contact-rich skills such as cutting, plugging, pouring and folding.

Proprietary datasets

Several leading VLAs are trained on proprietary in-house pools that are not released. Physical Intelligence reports collecting more than 10,000 hours of dexterous teleoperation across eight robots, including UR5e, bimanual UR5e, Franka, bimanual Trossen, bimanual ARX, and mobile Trossen / Fibocom variants.^[3] Figure trained Helix on roughly 500 hours of teleoperation on its humanoid hardware.^[7] NVIDIA augmented GR00T training with the GR00T-Dreams blueprint, which uses world-model video generation to produce synthetic robot trajectories at scale, claiming to compress what would have been three months of human teleoperation into 36 hours of synthetic data generation.^[22]

Comparative summary

Dataset	Year	Trajectories	Embodiments	Notes
BridgeData V2	2023	~60,000	1 (WidowX)	Open vocabulary, language and goal images
RH20T	2023	110,000+	Mixed Franka / Kuka / UR	Force and audio modalities, contact-rich
RT-1 dataset	2022	~130,000	1 (Everyday Robots)	RT-1 and RT-2 training corpus
Open X-Embodiment	2023	~1,000,000+	22	Standard format, consortium of 21 institutions
DROID	2024	76,000	1 (Franka Panda)	Diverse scenes, 13 institutions, 350 h
Physical Intelligence	2024-25	proprietary	8+	More than 10,000 h dexterous teleop
Figure Helix	2025	~500 h	1 (Figure 02)	Auto-annotated by VLM

Evaluation benchmarks

Evaluating VLAs is harder than evaluating language or vision models because the most informative outcome is a success rate on a physical robot, which is expensive and hard to reproduce. Three classes of benchmark have emerged.

LIBERO

LIBERO (Liu et al., June 2023) is a simulation benchmark for lifelong robot learning that procedurally generates 130 language-conditioned manipulation tasks grouped into four suites probing object, spatial, goal and long-horizon distribution shifts.^[27] By 2025 LIBERO had become the standard "first-pass" benchmark for VLAs because it is cheap to run, supports language instructions natively and yields task-success numbers that are reasonably correlated with real-robot generalization. OpenVLA, π₀, RDT-1B, GR00T N1 and SmolVLA all report LIBERO results in their papers.

SIMPLER

SIMPLER (Simulated Manipulation Policy Evaluation in Real-to-Sim) is a fully simulated evaluation framework targeting policies trained on real data. It exposes two protocols: a "visual matching" track in which the simulator mimics the camera viewpoint and visual appearance of the real cell as closely as possible, and a "variant aggregation" track that perturbs lighting, textures and object positions to test robustness. SIMPLER was released specifically to evaluate Google Robot and WidowX + Bridge policies including RT-1-X, RT-2-X, Octo and OpenVLA, and is now extended by community forks such as SimplerEnv-OpenVLA.^[28]

Real-robot evaluations

Despite simulation progress, VLAs are still reported on extensive real-robot evaluations in their original papers. RT-2 was evaluated on Google's Everyday Robots cell. OpenVLA reported 29 tasks across multiple embodiments. π₀ was evaluated on five complex tasks including laundry folding, table bussing and bagging, and π₀.₅ reported multi-minute kitchen and bedroom cleaning in homes that were not part of training. GR00T N1 and Helix were evaluated on humanoid platforms (Fourier GR-1 and Figure 02 respectively). The lack of a shared real-robot leaderboard remains one of the field's biggest gaps; the community has responded with newer real-to-sim benchmarks such as REALM and RobotArena-infinity.

Open challenges

By 2026 the VLA field has matured to the point where its open problems are reasonably well agreed.

Action representation

The most actively debated question is how to represent actions. The original RT-2 recipe of 256-bin discretized tokens degrades on high-frequency dexterous control, motivating the flow-matching action experts of π₀ and the diffusion transformers of RDT-1B and GR00T N1. The FAST tokenizer attempts to recover the simplicity of autoregressive decoding while restoring high-frequency expressivity. The trade-offs are still being worked out: autoregressive tokens are easy to integrate with VLM pretraining but slow at inference; flow matching is fast and smooth but harder to combine with discrete language outputs in a single model; diffusion is robust but requires several denoising steps and can be hard to tune.

Data scaling and the "scaling laws" question

Whether VLAs follow language-model-style scaling laws in robot data is unresolved. The Physical Intelligence team reports continued improvements out to more than 10,000 hours of teleoperation, but the slope of those curves is unclear, and most academic groups cannot afford comparable data collection. Synthetic data from world models (GR00T-Dreams) and from large-scale simulation (Isaac Sim, MimicGen) is the leading proposed answer, but synthetic-to-real transfer for dexterous manipulation remains imperfect.

Sim-to-real and real-to-sim

Despite progress in physics-based simulation and in real-to-sim benchmarks such as SIMPLER and REALM, the gap between simulated success rates and real-robot success rates is still routinely 10 to 30 percentage points on the same task. Real-to-sim approaches such as RobotArena-infinity attempt to convert real-robot rollouts into simulated counterparts using neural rendering, but accuracy on contact-rich, deformable or articulated objects is still a research frontier.

Dexterity and long-horizon tasks

VLAs have surpassed earlier generalist policies on short manipulation primitives, but multi-minute long-horizon tasks remain hard. π₀.₅ is the highest-profile demonstration of long-horizon cleaning in new homes, achieved by co-training on high-level subtask prediction, but it still depends on careful task framing. Truly dexterous in-hand manipulation, bimanual coordination on deformable objects, and tool use are at the edge of current capability.

Multi-embodiment generalization

Open X-Embodiment showed positive transfer across robot bodies, but only in limited regimes. Skild AI^[29] markets an "omni-bodied" policy capable of running on quadrupeds, humanoids, tabletop arms and mobile manipulators without prior knowledge of the body form, but the underlying claims have not yet been independently published. Cross-embodiment policies trained naively on heterogeneous bodies often underperform body-specific policies, and the right inductive bias for body abstraction is still open.

Deployment

Running a multi-billion-parameter VLA in a tight closed-loop at 50-200 Hz on embedded hardware is non-trivial. Helix splits across a slow VLM and a fast controller specifically for this reason; SmolVLA pushes the total parameter count down to 450 million; π₀ and π₀-FAST emphasize action chunking and asynchronous inference. Quantization, distillation and speculative decoding are active engineering frontiers. Safety, robustness to adversarial scenes, and the ability to recover from out-of-distribution states are also flagged in essentially every deployment writeup.

Evaluation reproducibility

A subtler but increasingly visible problem is that real-robot evaluations published in different papers are not directly comparable. Cell geometries, lighting, object sets, prompt wording and even the seed used to randomize start poses can shift a reported success rate by 20 percentage points or more. Workshops such as the CoRL Real Robot Challenge and the RoboCup-style competitions have proposed standardized cells, but no community-wide protocol comparable to ImageNet for vision or MMLU for language has emerged. The SIMPLER and REALM benchmarks partially address this by fixing simulation environments, yet the simulation-to-real gap means that good SIMPLER numbers do not always imply good real-robot performance, and vice versa. Building a trusted real-robot leaderboard is widely regarded as a precondition for the field's continued empirical maturation.

Failure modes and safety

Closed-loop foundation-model policies inherit the failure modes of their underlying VLMs. Reported issues include hallucinated affordances (predicting actions for objects that are not actually in the workspace), instruction misinterpretation, and prompt-injection-style vulnerabilities in which printed text in the scene overrides the user instruction. Several 2025 papers also document that VLAs occasionally output unsafe joint targets when the camera is occluded or when the prompt is adversarial; mitigations include constrained decoding, separate safety filters and conservative action chunking. Because VLAs are now deployed on humanoid hardware with substantial mass and reach, even rare failure modes can cause property damage, and the field is starting to converge on the view that safety-critical behaviour should be enforced by a non-learned controller layer beneath the VLA rather than by the VLA itself.

Industry deployment

VLAs moved from research demos into commercial deployments faster than most prior generations of robot learning. By mid-2026 the principal players, all backed by primary sources, were:

Physical Intelligence released the π₀ family under the openpi banner on GitHub, including base checkpoints pre-trained on more than 10,000 hours of robot data, plus DROID and ALOHA fine-tunes.^[24] The company's commercial focus is general-purpose dexterous manipulation across multiple robot OEMs.
Figure AI deploys Helix on its Figure 02 humanoid in pilot programs for general-purpose home and warehouse tasks; the model was first demonstrated on multi-robot collaborative manipulation of unseen household objects.^[7]
NVIDIA ships GR00T N1, N1.5 and N1.7 through the Isaac GR00T platform as foundation models for humanoid OEMs, with early adopters including AeiRobot, Foxlink, Lightwheel and NEURA Robotics. The Isaac GR00T platform additionally supplies the GR00T-Dreams synthetic data blueprint, the Isaac Lab simulator and reference policies.^[22]
1X Technologies released a video-based world model in January 2026 to let the NEO humanoid learn from observed video without further human teleoperation, complementing rather than replacing its VLA-style action policies.^[17]
Skild AI raised approximately 1.4 billion US dollars in a January 2026 Series C round led by SoftBank, NVIDIA's NVentures and Bezos Expeditions at a roughly 14-billion-dollar valuation, on the strength of its "omni-bodied" foundation model.^[29]
Hugging Face distributes SmolVLA and LeRobot under permissive licenses through the LeRobot library, and in 2025 acquired Pollen Robotics to bring affordable open-source robots to market.^[30]

Several other startups, Google DeepMind's Gemini Robotics team, and Tesla's Optimus program publish episodic demos but have not released primary architecture papers comparable to the ones above.

Open-source ecosystem

Are VLAs open source?

Many of the most influential VLAs are open weights, which is unusual for frontier robotics and is a large part of why the field moved so fast. The VLA open-source stack in 2026 has three tightly coupled layers.

Model checkpoints: OpenVLA, openpi (π₀, π₀-FAST, π₀.₅), GR00T N1 / N1.5 / N1.7, Octo, RDT-1B and SmolVLA are all released with weights and inference code. OpenVLA and Octo use MIT-style licenses; openpi uses Apache 2.0; NVIDIA uses a custom OneWay license for GR00T models. The notable closed exceptions are RT-2 (never released) and Figure AI's Helix (demonstrated but not published as weights). All major open checkpoints are mirrored on Hugging Face Hub.

Training and fine-tuning libraries: The Hugging Face LeRobot library is the most general-purpose, providing PyTorch implementations of diffusion policies, OpenVLA, π₀, π₀-FAST, π₀.₅ and SmolVLA together with the LeRobotDataset format.^[30] The openpi repository^[24] provides both the original JAX implementations and a 2025 PyTorch port. The OpenVLA project supplies LoRA and quantization recipes specifically aimed at consumer GPUs.^[2]^[18]

Datasets: The LeRobotDataset format on the Hugging Face Hub has become a de facto standard for community-collected robot data; it stores videos as MP4 and per-step state and action as Parquet files. Open X-Embodiment remains the standard pretraining corpus, with DROID, BridgeData V2 and RH20T as the most common single-platform fine-tuning sources.

The ecosystem also extends to affordable open-source robot hardware: Hugging Face's LeRobot project distributes designs for the SO-100 arm and other low-cost platforms, with the goal of getting VLA-capable hardware into university and hobbyist hands at sub-thousand-dollar price points.

References

^Brohan, A. et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818, 28 July 2023. arxiv.org/...2307.15818 Accessed 2026-06-20.
^Kim, M. J., Pertsch, K., Karamcheti, S. et al. "OpenVLA: An Open-Source Vision-Language-Action Model." arXiv:2406.09246, 13 June 2024. arxiv.org/...2406.09246 Accessed 2026-06-20.
^Physical Intelligence. "π₀: Our First Generalist Policy." Blog post, 31 October 2024. pi.website/...pi0 Accessed 2026-06-20.
^Pertsch, K., Stachowicz, K., Ichter, B., Driess, D., Nair, S., Vuong, Q., Mees, O., Finn, C. and Levine, S. "FAST: Efficient Action Tokenization for Vision-Language-Action Models." arXiv:2501.09747, 16 January 2025. arxiv.org/...2501.09747 Accessed 2026-06-20.
^Physical Intelligence team. "π₀.₅: a Vision-Language-Action Model with Open-World Generalization." arXiv:2504.16054, 22 April 2025. arxiv.org/...2504.16054 Accessed 2026-06-20.
^NVIDIA. "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." arXiv:2503.14734, 18 March 2025. arxiv.org/...2503.14734 Accessed 2026-06-20.
^Figure AI. "Helix: A Vision-Language-Action Model for Generalist Humanoid Control." Company blog, 20 February 2025. figure.ai/...helix Accessed 2026-06-20.
^Shukor, M., Aubakirova, D., Capuano, F. et al. "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics." arXiv:2506.01844, 2 June 2025. arxiv.org/...2506.01844 Accessed 2026-06-20.
^Open X-Embodiment Collaboration. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv:2310.08864, 13 October 2023. arxiv.org/...2310.08864 Accessed 2026-06-20.
^Khazatsky, A. et al. "DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset." arXiv:2403.12945, 19 March 2024. arxiv.org/...2403.12945 Accessed 2026-06-20.
^Reed, S. et al. "A Generalist Agent." arXiv:2205.06175, 12 May 2022. arxiv.org/...2205.06175 Accessed 2026-06-20.
^Brohan, A. et al. "RT-1: Robotics Transformer for Real-World Control at Scale." arXiv:2212.06817, 13 December 2022. arxiv.org/...2212.06817 Accessed 2026-06-20.
^Driess, D. et al. "PaLM-E: An Embodied Multimodal Language Model." arXiv:2303.03378, 6 March 2023. arxiv.org/...2303.03378 Accessed 2026-06-20.
^Google DeepMind. "RT-2: New model translates vision and language into action." Blog post, 28 July 2023. deepmind.google/...vision-and-language-into-action Accessed 2026-06-20.
^Octo Model Team. "Octo: An Open-Source Generalist Robot Policy." arXiv:2405.12213, 20 May 2024. arxiv.org/...2405.12213 Accessed 2026-06-20.
^Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H. and Zhu, J. "RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation." arXiv:2410.07864, 10 October 2024. arxiv.org/...2410.07864 Accessed 2026-06-20.
^1X Technologies. "1X World Model." Company page, January 2026. 1x.tech/...1x-world-model Accessed 2026-06-20.
^OpenVLA project. "OpenVLA Project Page." openvla.github.io Accessed 2026-06-20.
^Pertsch, K. et al. "FAST: Efficient Action Tokenization for Vision-Language-Action Models." arXiv:2501.09747. arxiv.org/...2501.09747 Accessed 2026-06-20.
^Black, K. et al. "π₀: A Vision-Language-Action Flow Model for General Robot Control." Physical Intelligence technical report, October 2024. pi.website/...pi0.pdf Accessed 2026-06-20.
^NVIDIA GR00T N1 model card. huggingface.co/...GR00T-N1-2B Accessed 2026-06-20.
^NVIDIA. "GR00T N1.5." NVIDIA Research page. research.nvidia.com/...gr00t-n1_5 Accessed 2026-06-20.
^Octo project. "Octo: An Open-Source Generalist Robot Policy." octo-models.github.io Accessed 2026-06-20.
^Physical Intelligence. "openpi GitHub repository." github.com/...openpi Accessed 2026-06-20.
^Walke, H., Black, K., Lee, A. et al. "BridgeData V2: A Dataset for Robot Learning at Scale." arXiv:2308.12952, 24 August 2023. arxiv.org/...2308.12952 Accessed 2026-06-20.
^Fang, H.-S. et al. "RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in One-Shot." arXiv:2307.00595, 2 July 2023. arxiv.org/...2307.00595 Accessed 2026-06-20.
^Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y. and Stone, P. "LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning." arXiv:2306.03310, 5 June 2023. arxiv.org/...2306.03310 Accessed 2026-06-20.
^SIMPLER project / SimplerEnv-OpenVLA GitHub repository. github.com/...SimplerEnv-OpenVLA Accessed 2026-06-20.
^Skild AI. "Announcing Series C." Company blog, January 2026. skild.ai/...series-c Accessed 2026-06-20.
^Hugging Face. "LeRobot GitHub repository." github.com/...lerobot Accessed 2026-06-20.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · v4 · 5,738 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit