VLA
Last reviewed
May 16, 2026
Sources
32 citations
Review status
Source-backed
Revision
v2 · 5,299 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
32 citations
Review status
Source-backed
Revision
v2 · 5,299 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Terms and artificial intelligence terms
A Vision-Language-Action model (VLA) is a class of foundation model for robotics that ingests one or more camera images together with a natural-language instruction and emits low-level robot control actions, all within a single neural network.[1][2] First named by Google DeepMind in the July 2023 RT-2 paper, VLAs extend large vision-language models (VLMs) with an action output head so that the same architecture used for visual question answering can also drive a gripper, mobile base, or humanoid upper body.[3] The approach reframes robot control as a sequence prediction problem in which actions are produced in the same representational space as text and image tokens, allowing robots to inherit the world knowledge and semantic reasoning of internet-scale pre-training.[1] By 2026 the VLA paradigm has become the dominant approach to general-purpose robot policies, spawning systems such as RT-2, OpenVLA, the Pi0 and Pi0.5 series from Physical Intelligence, NVIDIA Isaac GR00T N1 and N1.5, Figure AI's Helix, and Gemini Robotics.[4][5][6][7][8]
A Vision-Language-Action model unifies three modalities that were historically separated in robotics pipelines.[1] The vision input typically consists of one or more RGB camera streams from the robot's head, wrist, or third-person viewpoint, sometimes augmented with depth maps or proprioceptive state vectors describing joint angles and end-effector pose. The language input is a free-form natural-language instruction such as "fold the blue towel in the laundry basket" or "put the dishes from the sink into the dishwasher". The action output is a sequence of low-level control commands, most commonly end-effector deltas in Cartesian space combined with a gripper signal, or full joint targets for higher degree-of-freedom platforms such as humanoid upper bodies.[2][9]
What distinguishes a VLA from older language-conditioned policies is that the same transformer backbone that processes vision and language tokens also predicts the action tokens. There is no separate planner, no symbolic intermediate representation, and no hand-engineered grasp library. The model is end-to-end differentiable from pixels and characters to motor torques.[3] This architectural choice is what allows web-scale pre-training to transfer to physical control: semantic concepts learned from billions of image-caption pairs ground out in the same embedding space that ultimately produces the action distribution.[10]
The road to modern VLAs runs through several distinct generations of language-conditioned robot policies, each of which solved one piece of the puzzle.
The immediate precursor to VLAs was CLIPort, introduced by Mohit Shridhar, Lucas Manuelli, and Dieter Fox at the Conference on Robot Learning in 2021.[11] CLIPort combined a frozen CLIP encoder for semantic understanding with a Transporter Network for spatial precision, demonstrating that internet-pretrained vision-language features could ground out into pick-and-place actions on a real tabletop robot. The same year saw a wave of related work including BC-Z and Interactive Language, all establishing that broad semantic generalization required leveraging external pre-trained representations rather than learning from robot data alone.
In April 2022, Google's SayCan project (formally "Do As I Can, Not As I Say") paired a large language model with a learned value function to ground long-horizon instructions in feasible robot skills.[12] When refreshed with the 540-billion-parameter PaLM model, PaLM-SayCan chose the correct skill sequence 84 percent of the time and completed tasks successfully 74 percent of the time, halving the error rate of the prior Flan-T5 baseline. SayCan demonstrated that LLM planning combined with a library of pre-trained low-level skills could already produce surprisingly capable mobile manipulators, although the skill library was hand-curated and the LLM did not directly produce motor commands.
In December 2022, computer vision and language work converged in robotics with Google's RT-1 (Robotics Transformer 1).[13] RT-1 was the first "large robot model" trained end-to-end on 130,000 episodes covering more than 700 tasks, collected over 17 months by a fleet of 13 Everyday Robots manipulators. Architecturally it combined a FiLM-conditioned EfficientNet, a TokenLearner module, and a Transformer decoder, with discrete action tokens for 7-DoF end-effector control. At only 35 million parameters, RT-1 hit 97 percent success on its training instructions and ran on robot hardware at 3 Hz. However, it did not yet leverage internet-scale vision-language pre-training, and its action tokens lived in their own vocabulary disjoint from the text vocabulary.
In March 2023, Google released PaLM-E, an "embodied multimodal language model" that injected raw image patches, neural 3D representations, and robot state directly into the embedding layer of a pre-trained PaLM language model.[14] The largest variant, PaLM-E-562B, set a state of the art on visual question answering benchmarks while also producing high-level plans for mobile manipulators and tabletop arms. PaLM-E showed that a language model could be "embodied" simply by feeding it sensor tokens, but it still output text plans that had to be executed by separate low-level controllers.
Four months later, in July 2023, Google DeepMind released RT-2, which is widely considered the first true VLA.[3] The crucial innovation was tokenizing 7-DoF end-effector actions as eight discrete integers and embedding them inside the same vocabulary as natural-language text. RT-2 came in two variants based on PaLM-E (12 billion parameters) and PaLI-X (55 billion parameters). The model was co-fine-tuned on Internet-scale visual question answering alongside the RT-1 robot demonstration dataset, and in more than 6,000 physical evaluation trials it nearly doubled the success rate of RT-1 on unseen objects, backgrounds, and environments. RT-2 also exhibited "emergent" symbolic reasoning capabilities, such as identifying which object on a table could function as an improvised hammer.[3]
In October 2023, the Open X-Embodiment collaboration, a consortium of 21 research institutions led by Google DeepMind, released a unified dataset of more than 1 million robot trajectories from 22 distinct embodiments, along with the RT-1-X and RT-2-X cross-embodiment models trained on it.[15][16] Cross-embodiment training on this corpus improved RT-1-X performance by 50 percent on average across five different labs' robots, and tripled the performance of RT-2-X on real-world skills relative to single-embodiment baselines. The release of Open X-Embodiment is widely credited as the inflection point at which VLA research transitioned from being a Google-internal effort to a broad open-source community.
The first major open-source generalist robot policy was Octo, released in May 2024 by a team from Berkeley, Stanford, and the Allen Institute.[17] Octo was a transformer-based diffusion policy with 27 million and 93 million parameter variants, pre-trained on 800,000 episodes from the Open X-Embodiment mixture. The model supported flexible camera and action specifications via modality tokens, could be conditioned on either language goals or goal images, and matched the larger RT-2-X-55B on standard evaluation suites despite being orders of magnitude smaller.
Octo was followed in June 2024 by OpenVLA, a 7-billion-parameter VLA released by a Stanford, UC Berkeley, Google DeepMind, and Toyota Research Institute team.[5] OpenVLA combined a Llama 2 7B backbone with a fused SigLIP and DINOv2 visual encoder, trained on 970,000 Open X-Embodiment trajectories using 64 A100 GPUs over 15 days. It predicted normalized 7-DoF end-effector deltas as discrete action tokens. OpenVLA outperformed RT-2-X-55B by 16.5 percent on a 29-task evaluation suite despite having seven times fewer parameters and could be fine-tuned for a new robot or task on a single consumer GPU with LoRA. Crucially, all training code, datasets, model weights, and fine-tuning recipes were released under open licenses, making OpenVLA the de facto research baseline for the next year of VLA work.
In October 2024, Physical Intelligence (a San Francisco startup founded by alumni of Google Brain and the Princeton robot learning group) announced Pi0, a 3.3-billion-parameter VLA built on the PaliGemma VLM with an added 300-million-parameter "action expert".[6][18] Rather than discretizing actions into tokens, Pi0 used flow matching to model the conditional distribution over continuous action chunks at up to 50 Hz, enabling the high-frequency dexterous control needed for tasks such as folding laundry from a hamper, bussing a table, and assembling cardboard boxes. The model was trained on data from seven robotic platforms covering 68 unique tasks, and Physical Intelligence subsequently released the weights and reference implementation as openpi.
The same period saw a flurry of other open contributions, including RDT-1B (October 2024) from Tsinghua University, a 1.2-billion-parameter diffusion foundation model for bimanual manipulation pre-trained on 46 datasets and fine-tuned on 6,000 ALOHA episodes, and CogACT (November 2024) from a Tsinghua, Microsoft Research, and University of Hong Kong collaboration, which paired a 7-billion-parameter VLM with a dedicated diffusion transformer action head and beat OpenVLA by over 35 percent in simulation.[19][20]
The story of 2025 in VLAs is the story of large companies and well-funded startups bringing VLA-driven humanoids and dexterous manipulators into commercial deployment.
In February 2025, Figure AI introduced Helix, the first VLA designed for high-rate continuous control of an entire humanoid upper body.[7][21] Helix is structured as a two-system architecture: a 7-billion-parameter onboard VLM ("System 2") runs at 7 to 9 Hz to provide scene understanding and language comprehension, while an 80-million-parameter visuomotor transformer ("System 1") consumes System 2's latent embeddings and runs at 200 Hz to produce continuous control over a 35-degree-of-freedom action space spanning wrists, torso, head, and individual fingers. Helix is the first VLA known to run entirely on embedded low-power-consumption GPUs on the robot itself, and it was rapidly deployed in commercial logistics pilots and a BMW factory partnership. Figure followed with Helix 02 in early 2026, completing full eight-hour autonomous shifts.
In March 2025, Google DeepMind announced Gemini Robotics and Gemini Robotics-ER ("embodied reasoning"), a pair of models built on Gemini 2.0 that added action as a new output modality.[8][22] Gemini Robotics-ER focused on spatial understanding, grasp prediction, and code generation, while Gemini Robotics added direct VLA control. A subsequent Gemini Robotics On-Device variant, announced in mid-2025, optimized inference for execution entirely on the robot, eliminating cloud round-trips and adapting to new tasks with as few as 50 to 100 demonstrations. Gemini Robotics 1.5 followed in late September 2025, introducing a "think before acting" reasoning trace and an ER-to-VLA agentic stack capable of cross-embodiment skill transfer from the Aloha 2 to Apptronik's Apollo humanoid and dual-arm Franka research platforms.
Also in March 2025, NVIDIA announced Isaac GR00T N1, billed as the world's first open humanoid robot foundation model.[4][23] GR00T N1 used a dual-system architecture in which System 2 (an Eagle-2 VLM combining a SigLIP-2 image encoder with a SmolLM2 language model) ran at 10 Hz for high-level reasoning, while System 1 (a Diffusion Transformer trained with flow matching) ran at up to 120 Hz to produce continuous motor commands by cross-attending to System 2's outputs. The model was trained on a mixture of egocentric human videos, real and simulated robot trajectories, and synthetic data. NVIDIA followed with GR00T N1.5 on June 11, 2025, freezing the VLM during fine-tuning to improve language following, upgrading the backbone to Eagle 2.5, and adding Future Latent Representation Alignment to enable learning from human videos. GR00T N1.5 improved success on the DreamGen task suite from 13.1 percent to 38.3 percent versus N1.
In April 2025, Physical Intelligence released Pi0.5, an extension of Pi0 designed for "open-world generalization".[24] Pi0.5 used a hierarchical architecture in which the model first produced a high-level language step ("open the cabinet", "wipe the counter") and then conditioned its flow-matching action expert on that intermediate plan. It was co-trained on data from multiple robots, web-scale vision-language data, and human verbal annotations. The headline demonstration was a mobile manipulator that could clean a kitchen or bedroom it had never seen during training, executing 10-to-15-minute multi-stage tasks in private homes.
February 2025 also saw Microsoft Research release Magma, a foundation model for multimodal AI agents that operated across digital user interfaces and physical robots within a single architecture.[25] Magma introduced Set-of-Mark and Trace-of-Mark annotations that gave the model structured understanding of both UI elements and robot manipulation traces, and it was published at CVPR 2025. Other notable 2025 entries include Beijing Academy of Artificial Intelligence's RoboBrain and RoboBrain 2.0, an embodied vision-language foundation model focused on planning, affordance perception, and trajectory prediction with 7B and 32B variants.[26]
While every modern VLA shares the same general shape (vision encoder feeding a transformer language backbone that produces action outputs), the field has developed several distinct architectural patterns for the action head.
The original RT-2 approach discretizes each dimension of the action into 256 bins and reuses the least-frequent tokens in the language model vocabulary as action tokens.[3] A 7-DoF end-effector action is then represented as eight integers, which the autoregressive language model produces one token at a time exactly as it would generate text. This pattern has the advantage of requiring no architectural change to the base VLM, allowing direct co-training on language and robot data. Its disadvantages are limited action precision (capped by the bin resolution), slow inference (because actions must be sampled sequentially), and difficulty representing high-frequency continuous control. OpenVLA, RT-2, and CogACT all use variants of this pattern.
A second school of designs attaches a small diffusion or flow-matching network to the VLM that learns to denoise a chunk of continuous actions conditioned on the VLM's latent embedding.[6][20] Physical Intelligence's Pi0 uses flow matching; CogACT and GR00T N1 use Diffusion Transformers. The principal advantage is high-fidelity continuous control at high frequencies (Pi0 reaches 50 Hz, GR00T N1 reaches 120 Hz) without quantization error. The disadvantage is that the action head is no longer reused from language pre-training, so the architecture is more heterogeneous and somewhat harder to train from scratch.
FAST (Frequency-space Action Sequence Tokenization), introduced by Physical Intelligence in January 2025, sits between the two prior approaches.[27] FAST first applies a Discrete Cosine Transform to each action dimension, removes insignificant coefficients, and then applies Byte Pair Encoding to compress the result. The output is 30 to 60 dense tokens per action chunk, a roughly tenfold compression over naive binning. The Pi0-FAST variant matches the dexterity of flow-matching Pi0 while training up to five times faster and remaining fully autoregressive.
Inspired by Daniel Kahneman's two-system model of human cognition, several recent VLAs explicitly decompose the architecture into a slow "thinker" and a fast "doer".[7][4] Figure's Helix runs a 7B VLM at 7-9 Hz to produce a semantic latent, which a separate 80M visuomotor transformer consumes at 200 Hz to emit joint commands. GR00T N1 runs the Eagle-2 VLM at 10 Hz feeding a 120 Hz Diffusion Transformer. Gemini Robotics 1.5 uses a similar decomposition in which Gemini Robotics-ER produces reasoning traces that condition the lower-level VLA. This pattern is especially valuable for humanoid platforms with high-DoF action spaces and tight real-time constraints, because it allows the expensive VLM to amortize one inference over many low-level control cycles.
A cross-cutting technique used by Pi0, RDT-1B, and modern OpenVLA fine-tunes is action chunking, in which the model predicts a short sequence of future actions (typically eight to 50 steps) in a single forward pass rather than one action at a time.[28] This dramatically reduces compounding error from teleoperated demonstrations and smooths trajectories. The OpenVLA-OFT recipe published in February 2025 combined parallel decoding with action chunking, a continuous action representation, and an L1 regression loss to lift OpenVLA's LIBERO success rate from 76.5 percent to 97.1 percent and to raise inference throughput from roughly 5 Hz to over 100 Hz, a 26-fold speedup.
| Model | Developer | Year | Parameters | Action head | Openness |
|---|---|---|---|---|---|
| RT-1 | 2022 | 35M | Discrete tokens | Code open | |
| PaLM-E | 2023 | up to 562B | Text plans | Closed | |
| RT-2 | Google DeepMind | 2023 | 12B / 55B | Discrete tokens | Closed |
| RT-2-X | Open X-Embodiment | 2023 | up to 55B | Discrete tokens | Partial weights |
| Octo | Berkeley / Stanford | 2024 | 27M / 93M | Diffusion | Fully open |
| OpenVLA | Stanford / Berkeley / DeepMind / TRI | 2024 | 7B | Discrete tokens | Fully open |
| Pi0 | Physical Intelligence | 2024 | 3.3B | Flow matching | Weights open |
| Pi0-FAST | Physical Intelligence | 2025 | 3.3B | FAST tokens | Weights open |
| Pi0.5 | Physical Intelligence | 2025 | ~3.3B | Hierarchical FM | Partial |
| RDT-1B | Tsinghua | 2024 | 1.2B | Diffusion | Fully open |
| CogACT | Tsinghua / Microsoft | 2024 | ~7.3B | Diffusion | Open |
| Helix | Figure AI | 2025 | 7B + 80M | Continuous (S1/S2) | Closed |
| GR00T N1 | NVIDIA | 2025 | 2B class | Diffusion (S1/S2) | Open |
| GR00T N1.5 | NVIDIA | 2025 | 3B | Diffusion (S1/S2) | Open |
| Gemini Robotics | Google DeepMind | 2025 | Undisclosed | VLA + ER | Closed |
| Gemini Robotics On-Device | Google DeepMind | 2025 | Undisclosed | VLA | Closed (preview) |
| Magma | Microsoft Research | 2025 | 8B | Multi-modal action | Open |
| RoboBrain 2.0 | BAAI | 2025 | 7B / 32B | Hybrid | Open |
| SmolVLA | Hugging Face | 2025 | 450M | Continuous (async) | Fully open |
The overwhelming majority of robot demonstration data used to train VLAs comes from human teleoperation, in which an operator drives the robot through a task using a joystick, VR controller, leader-follower puppet rig, or specialized data-collection device.[29] The collected trajectories then serve as supervised behavior cloning targets: the model is trained to predict the operator's action given the observation history. This paradigm is simple, scalable in principle, and produces high-quality data, but it remains expensive (a single demonstration can take a minute or more of skilled operator time) and biased toward whatever the operator chose to do.
The Open X-Embodiment effort established that pooling data from many different robot types can yield positive transfer to each individual platform, despite obvious differences in kinematics, sensors, and action spaces.[15] Modern VLAs nearly always co-train on a diverse mixture of embodiments and use embodiment-aware tokenization or adapter heads to convert the unified policy output into the action space of each specific robot. RDT-1B introduced a "Physically Interpretable Unified Action Space" specifically to handle this multi-robot setting.
Co-fine-tuning on internet-scale vision-language data alongside robot trajectories was an explicit design choice of RT-2 and has since been adopted by almost every major VLA.[3] The intuition is that without this co-training the model's language and visual capabilities would degrade quickly during behavior cloning fine-tuning. Pi0.5 takes this further by including web data, video data, and verbal human annotations in its pre-training mixture to support open-world generalization.
Real robot data is expensive, so several VLA programs lean heavily on simulation to generate synthetic trajectories. NVIDIA's GR00T-Dreams blueprint, used to train GR00T N1.5, generated synthetic data in approximately 36 hours that would otherwise have required nearly three months of manual collection.[23] The RobotArena Infinity benchmark, released in late 2025, converts real demonstrations into simulated counterparts for scalable evaluation. Sim-to-real transfer remains an active research area because gaps in physics, rendering, and contact dynamics can degrade real-world performance.
While most VLAs are trained purely by supervised behavior cloning, several recent efforts use reinforcement learning to fine-tune the action policy after pre-training, often with simulated reward functions or learned reward models. RL fine-tuning remains less mature than for language models, in part because the cost of failed exploration on real hardware is high, but the field expects this paradigm to play an increasing role through 2026 and beyond.
Unlike vision and language research, where common benchmarks have driven rapid progress for a decade, VLA evaluation remains contested and fragmented.
LIBERO, released at NeurIPS 2023, is the most widely used simulated benchmark for VLA manipulation.[30] It provides 130 tasks grouped into four suites: LIBERO-Spatial (testing spatial relation transfer), LIBERO-Object (object identity transfer), LIBERO-Goal (goal-condition transfer), and LIBERO-100 (long-horizon, entangled-knowledge transfer). High-quality human teleoperation demonstrations accompany each task. By 2025 LIBERO had become saturated for top VLAs, with OpenVLA-OFT reaching 97.1 percent average success.
The SIMPLER benchmark provides a simulation environment designed to closely match real-robot setups used in RT-1, RT-2, and Bridge data evaluations, allowing researchers to predict real-world performance from cheap simulated rollouts.[20] CALVIN evaluates long-horizon language-conditioned manipulation in tabletop scenes and has been used widely by the open-source community.
RoboArena, introduced in June 2025, takes a different approach: it crowd-sources real-world double-blind evaluation across a distributed network of evaluators who pick their own tasks and environments, generating thousands of comparison episodes per policy pair.[31] RobotArena Infinity, released in October 2025, extends this concept by converting real demonstrations into simulated environments and combining VLM-judge scoring with crowd-sourced human preferences. Both benchmarks have highlighted that current VLAs remain notably sensitive to dataset shift and are not yet true generalists.
For frontier industrial VLAs, the most informative benchmark is direct deployment. Figure AI publishes videos of Helix executing eight-hour autonomous shifts in logistics workflows. Physical Intelligence demonstrates Pi0.5 cleaning previously unseen kitchens and bedrooms. NVIDIA showcases GR00T N1.5 on the Fourier Intelligence GR-1 humanoid and on the Franka Research 3 dual-arm setup. These demonstrations are difficult to compare apples-to-apples but have become the de facto signal that the technology is approaching commercial readiness.
| Dataset | Year | Scale | Notes |
|---|---|---|---|
| Bridge / BridgeData V2 | 2021 / 2023 | 60,096 trajectories | Berkeley low-cost WidowX |
| RT-1 dataset | 2022 | 130k episodes | Everyday Robots fleet |
| RoboNet | 2019 | 162k trajectories | Multi-lab early effort |
| Open X-Embodiment | 2023 | 1M+ trajectories | 22 embodiments, 60 datasets |
| DROID | 2024 | 76k trajectories, 350 hours | Franka Panda, 13 institutions |
| LIBERO datasets | 2023 | 130 tasks, teleop | Lifelong-learning study |
| LeRobot community | 2024-2025 | 487 datasets aggregated | Hugging Face SmolVLA training set |
| ALOHA datasets | 2024 | thousands of bimanual episodes | RDT-1B fine-tuning data |
The Open X-Embodiment dataset, released in October 2023, remains the most influential training corpus, pooling 60 existing datasets from 34 robotics labs into a single standardized format with more than 1 million trajectories spanning 22 robot embodiments and 527 manipulation skills.[15] DROID, released in March 2024, complements this with 76,000 in-the-wild Franka Panda trajectories covering 564 scenes across 52 buildings, providing the kind of unstructured environmental diversity that the lab-collected datasets often lack.[32] BridgeData V2 from Berkeley provides 60,096 low-cost teleoperation trajectories on the WidowX 250 platform and is one of the most widely used baselines in academic VLA work.
The VLA field is split fairly cleanly between open-weights research models from academia and consortia (OpenVLA, Pi0, RDT-1B, CogACT, Octo, SmolVLA, GR00T N1.5, RoboBrain) and closed commercial models from large companies and well-funded startups (RT-2, Gemini Robotics, Helix). Hugging Face's SmolVLA, released in June 2025, pushed the open frontier further by demonstrating that a 450-million-parameter VLA pre-trained on 10 million frames of community LeRobot data could run on a MacBook at 18 ms per step and outperform much larger baselines on simulated and real tasks. The trade-off between the two ecosystems is the familiar one from language models: open models enable academic study, fine-tuning, and on-premises deployment, while closed frontier models tend to have more aggressive scale, proprietary data, and tighter integration with specific hardware platforms.
Moving a VLA from a benchmark paper to a production robot exposes several practical problems that the research literature is only now starting to address.
Robots demand control frequencies of 10 to 200 Hz depending on the task, while a single forward pass of a 7-billion-parameter VLM can take tens or hundreds of milliseconds on automotive-grade GPUs. The dominant solutions are (a) action chunking, so that one VLM pass yields many control cycles; (b) System-1/System-2 decomposition, so that the expensive VLM runs at low frequency while a small visuomotor head runs at high frequency; (c) on-device quantization, including INT8 and INT4 weights with carefully tuned activation precision; and (d) speculative or parallel decoding, as in OpenVLA-OFT, which removed the autoregressive bottleneck for action tokens.
LLMs train on trillions of text tokens and VLMs on billions of images, but the best public VLA datasets contain on the order of one million trajectories: many orders of magnitude smaller. Because teleoperation is the dominant data source and cannot easily scale, the field is exploring synthetic data generation (GR00T-Dreams), passive video data (Pi0.5 and Magma), and learned reward functions to enable RL on hardware.
Even within the manipulator class of robots, gripper geometry, camera placement, control rate, and joint limits vary considerably. VLAs handle this with embodiment-aware tokenization, learned adapters, and unified action spaces, but performance on a brand-new embodiment with little fine-tuning data remains hit-or-miss. Humanoid platforms compound the problem by adding bipedal locomotion, full-body coordination, and tens of additional degrees of freedom.
A hallucinating chatbot is annoying. A hallucinating robot can break things or hurt people. The field has only beginnings of formal verification, runtime safety filters, and learned uncertainty estimation for VLA control. Most commercial deployments still rely on hard kinematic constraints, force limits, and human-in-the-loop monitoring as safety nets behind the learned policy.
Despite headline demonstrations such as Pi0.5 cleaning unseen kitchens, recent benchmarks like RobotArena Infinity have shown that VLA performance still drops noticeably when evaluated outside the training distribution.[31] The field has not yet achieved the level of distribution-shift robustness that humans take for granted, and improving this remains a central research priority for 2026.
VLA models are being deployed or piloted in a growing list of application domains:
Applications outside manipulation are also emerging, including autonomous driving research that frames the driving stack as a VLA, augmented reality agents that interpret physical scenes, and agricultural robots for precision harvesting.
Looking ahead from mid-2026, the most active VLA research themes include scaling laws (whether VLAs benefit from the same kind of clean log-log scaling seen in language models), data scaling beyond teleoperation (synthetic data, video pre-training, RL from real rollouts), agentic VLAs that combine planning, tool use, and long-horizon reasoning, and verifiable safety guarantees that would unlock deployment in high-stakes settings such as healthcare and elderly care. Hardware co-design is also moving quickly, with NVIDIA's Jetson Thor platform, Qualcomm's robotics chips, and bespoke accelerators from humanoid startups all targeting on-board VLA inference as a first-class workload.