VLA

From AI Wiki
See also: Terms and artificial intelligence terms

A Vision-Language-Action model (VLA) is a type of foundation model that enables robot control through the integration of visual perception, natural language processing, and action generation capabilities.[1] VLAs represent a significant advancement in embodied AI, allowing robots to understand and execute complex tasks based on visual inputs and natural language commands while directly outputting motor control actions.[2] These models are transforming robotics by reducing the need for complex, task-specific programming, enabling robots to learn from visual and language data much like humans do.[3]

Overview

VLA models are constructed by extending vision-language models (VLMs) with the capability to generate robot action sequences.[4] Unlike traditional robotic control systems that require separate modules for perception, planning, and control, VLAs provide an end-to-end solution that directly maps visual observations and language instructions to robot actions.[5]

The key innovation of VLA models is their ability to leverage pre-trained knowledge from large-scale internet data while adapting to robotic control tasks.[6] This approach addresses one of the fundamental challenges in robotics: the need for extensive task-specific training data for each robot, environment, and application.[7] Actions are represented in the same token space as text so that gradient updates align linguistic and kinematic concepts, allowing the model to inherit world knowledge from internet-scale corpora while grounding it in robot trajectories.[8]

How It Works

A VLA model typically consists of three main components that work together in a unified, end-to-end system:[3]

Vision Encoder

This component processes visual input from cameras and other sensors, converting raw pixel data into a meaningful representation of the environment. It identifies objects, understands their spatial relationships, and extracts other relevant visual features.

Language Model

The language model interprets natural language commands given to the robot in the form of text or speech. It understands the user's intent and extracts the goal of the task, as well as any constraints or preferences.

Action Policy

This component takes the outputs from the vision encoder and the language model and generates a sequence of actions for the robot to execute. It plans the robot's movements, controls its actuators, and ensures that the task is performed safely and efficiently.

This integrated approach is a key advantage of VLAs over traditional robotic systems, where perception, planning, and control are often treated as separate and independent modules.

History

The concept of Vision-Language-Action models emerged from the convergence of advances in large language models (LLMs), computer vision, and robotic learning. The term "VLA" was first coined in the Google RT-2 paper, which described the model as transforming "Pixels to Actions".[9]

Early Development

The foundation for VLA models was laid by earlier work in robotic learning, including:

  • RT-1 (Robotics Transformer 1): Introduced in late 2022, RT-1 was the first "large robot model" trained with robot demonstration data collected from 13 robots over 17 months.[10] It demonstrated the potential of transformer architecture for robotic control but did not yet integrate vision-language pre-training.
  • PaLM-E: An embodied multimodal language model that enhanced robots' ability to understand their surroundings through vision models.[11]

RT-2: The First True VLA

In July 2023, Google DeepMind announced Robotic Transformer 2 (RT-2), described as "a first-of-its-kind vision-language-action (VLA) model".[12] RT-2 represented a breakthrough in enabling robots to:

  • Apply knowledge and reasoning from large web datasets to real-world robotic tasks
  • Understand and execute commands not present in robot training data
  • Perform rudimentary reasoning in response to user commands
  • Use chain-of-thought reasoning for multi-stage semantic reasoning[13]

Architecture

VLA models typically consist of three main components:[14]

Vision Module

The vision module processes raw image inputs into perceptual tokens. Common architectures include:

Language Module

The language module integrates visual information with natural language instructions and performs cognitive reasoning. This typically involves:

Action Module

The action module generates robot control commands. Two main approaches have emerged:

Discrete Action Tokenization

Early VLA models like RT-2 represent robot actions as discrete tokens within the language model's vocabulary. Actions are encoded as text strings (for example "1 128 91 241 5 101 127 217") that map to specific motor commands.[17]

Continuous Action Generation

More recent models employ continuous action generation through:

  • Flow Matching: Used in models like π0, enabling direct prediction of continuous action sequences at high frequencies (up to 50Hz)[18]
  • Diffusion Models: Applied in models like CogACT for generating smooth, multi-modal action distributions[14]
  • FAST Tokenization: Frequency-space Action Sequence Tokenization using Discrete Cosine Transform for efficient autoregressive generation[19]

Training

Data Requirements

VLA models are typically trained on two types of data:

  1. Internet-scale vision-language data: Provides semantic understanding and reasoning capabilities
  2. Robot demonstration data: Teaches specific motor control and manipulation skills

The Open X-Embodiment dataset, released in October 2023, represents the largest collection of robot learning data, containing:[20]

  • Over 1 million real robot trajectories
  • 22 different robot embodiments
  • 527 skills across 160,266 tasks
  • Data from 21 institutions worldwide[21]

Training Methodology

VLA training typically involves:

  1. Pre-training: Starting from a pre-trained VLM to inherit semantic knowledge
  2. Co-fine-tuning: Training on both vision-language tasks and robot control data simultaneously
  3. Action representation learning: Teaching the model to output appropriate action tokens or continuous values
  4. Multi-task learning: Training on diverse tasks to improve generalization[22]

Notable Models

RT-2 (Robotic Transformer 2)

Developed by Google DeepMind, RT-2 was the first model to be explicitly called a VLA. It comes in two variants:

  • A 12B parameter version based on PaLM-E
  • A 55B parameter version based on PaLI-X[11]

Key capabilities include:

  • Understanding of spatial relationships and object properties
  • Chain-of-thought reasoning for complex tasks
  • Generalization to novel objects and instructions[23]

In over 6,000 physical trials, RT-2 doubled success on unseen tasks (62%) versus RT-1 (32%).[1]

OpenVLA

Released in 2024, OpenVLA is a 7B-parameter open-source VLA that democratized access to VLA technology.[24] Key features include:

  • Built on Llama 2 7B with fused SigLIP and DinoV2 vision encoders
  • Trained on 970k robot manipulation trajectories
  • Support for LoRA fine-tuning for efficient adaptation
  • 16.5% better performance than RT-2-X (55B) despite having 7x fewer parameters[4]
  • Can be fine-tuned on a single GPU[25]

π0 (Pi-Zero)

Developed by Physical Intelligence, π0 introduced several innovations:[26]

  • Full upper-body humanoid control including individual finger movements
  • Multi-robot collaboration capabilities
  • Flow matching for high-frequency (50Hz) continuous control
  • Training on data from 7 robotic platforms and 68 unique tasks[27]

π0.5

An extension of π0 that demonstrates "open-world generalization":[28]

  • Can operate in entirely new environments not seen during training
  • Performs long-horizon tasks like cleaning kitchens or bedrooms
  • Uses hierarchical reasoning with high-level language planning and low-level motor control
  • Co-trained on heterogeneous data sources for improved transfer learning[29]

Gemini Robotics

Google DeepMind's Gemini Robotics on-device VLA executes dexterous tasks without cloud connectivity, critical for privacy-sensitive sites and edge deployment.[30]

Other Notable Models

Model Developer Key Features Parameters
Helix Figure AI First VLA for full humanoid upper-body control, multi-robot collaboration, deployed in BMW factories, runs entirely on embedded GPUs 7B (System 2), 80M (System 1)[31]
VLAS Various researchers First end-to-end model integrating speech modality for robot manipulation Not specified[32]
CogACT Various researchers Uses Diffusion Transformer for action decoding, strong performance on SIMPLER benchmark ~7.3B total[14]
SmolVLA Hugging Face Lightweight model optimized for CPU inference and consumer hardware, asynchronous inference stack 450M[16]
MiniVLA Stanford AI Lab 7x smaller than OpenVLA with improved inference speed ~1B[33]

Applications

VLA models have demonstrated capabilities across a wide range of robotic manipulation tasks:

Basic Manipulation

  • Pick and place operations
  • Opening and closing drawers, doors, and containers
  • Grasping objects of various shapes and materials
  • Tool use and manipulation[34]

Complex Tasks

  • Folding laundry from a hamper
  • Assembling cardboard boxes
  • Table bussing and kitchen cleaning
  • Multi-step meal preparation
  • Collaborative tasks between multiple robots[35]

Industry Applications

  • Manufacturing: VLA-powered robots automate assembly, packaging, and quality control tasks. Figure AI's Helix has been deployed in BMW factories for industrial automation.[31]
  • Logistics and Warehousing: Zero-shot grasping of unseen SKUs, sorting packages, and loading/unloading operations. Fine-tuning OpenVLA with few-shot LoRA achieves reliable warehouse picking.[24]
  • Healthcare: Assisting with delivering supplies, patient care support, and simple medical procedures, freeing healthcare professionals for more complex tasks.[3]
  • Home and Personal Assistance: Cooking, cleaning, laundry tasks, and providing companionship and assistance to the elderly and people with disabilities.[3]
  • Disaster Response: Search and rescue operations, damage assessment, and emergency supply delivery in hazardous environments.[3]

Emerging Applications

Evidence suggests VLAs are becoming key in fields like:

Emergent Capabilities

VLA models have shown emergent behaviors not explicitly programmed:

  • Understanding spatial relationships (for example "pick up the object to the left of the red block")
  • Semantic reasoning (for example identifying which object could serve as an improvised hammer)
  • Adapting to novel objects and environments
  • Following abstract instructions (for example "clean the kitchen")[1]

Technical Challenges

Action Space Representation

One of the key challenges in VLA development is representing the continuous, high-dimensional action space of robots in a way that can be effectively learned and generated by neural networks. Solutions include:

  • Discretization into action tokens
  • Flow matching for continuous generation
  • Hierarchical action representations
  • Action chunking for temporal coherence[36]

Real-Time Performance

Achieving real-time control (typically 10-50Hz) while running large neural networks remains challenging. Approaches to address this include:

  • Model quantization and optimization
  • Efficient inference architectures
  • Hardware acceleration
  • Asynchronous inference pipelines[37]

Generalization

While VLAs show improved generalization compared to traditional approaches, challenges remain:

  • Sim-to-real transfer
  • Cross-embodiment transfer between different robot types
  • Long-horizon task planning
  • Robustness to environmental variations[38]

Safety and Reliability

Ensuring the safety and reliability of VLA-powered robots is critical, especially in human-centric environments. Robots must operate safely and predictably, even in the presence of unexpected events and disturbances. Current access to many VLA models remains gated to mitigate potential misuse.[30]

Data Efficiency

Training VLA models requires large amounts of data, which can be expensive and time-consuming to collect. Improving the data efficiency of VLA models through better architectures and training techniques is an active area of research.[25]

Ethical Considerations

The development and deployment of VLA-powered robots raise several ethical concerns that need careful consideration:

  • Job displacement in manufacturing and service sectors
  • Privacy concerns with robots operating in personal spaces
  • Bias in decision-making inherited from training data
  • Ensuring equitable access to beneficial robotic technologies[3]

Future Directions

Research Areas

Active areas of VLA research include:

  • Multi-modal integration: Incorporating tactile, force, and proprioceptive feedback beyond vision and language
  • Self-improvement: Enabling robots to learn from their own experience through reinforcement learning
  • Long-horizon planning: Improving capabilities for complex, multi-step tasks requiring extended reasoning
  • Safety and robustness: Ensuring reliable operation in human environments with formal verification methods
  • Efficiency: Reducing computational requirements for deployment on edge devices
  • Agentic AI adaptation: Enabling VLAs to adapt more independently to new tasks without extensive fine-tuning[8]

Industry Adoption

VLA models are beginning to see widespread adoption in:

  • Manufacturing and assembly lines
  • Warehouse automation and logistics
  • Service robotics in hospitality and retail
  • Healthcare applications and elderly care
  • Home assistance and personal robotics
  • Agricultural automation[5]

See Also

References

  1. 1.0 1.1 1.2 "RT-2: New model translates vision and language into action". Google DeepMind. https://deepmind.google/discover/blog/rt-2-new-model-translates-vision-and-language-into-action/.
  2. "Vision-language-action model - Wikipedia". https://en.wikipedia.org/wiki/Vision-language-action_model.
  3. 3.0 3.1 3.2 3.3 3.4 3.5 "Vision-Language-Action Models". Labellerr. https://labellerr.com/blog/vision-language-action-models/.
  4. 4.0 4.1 Kim, Moo Jin et al. (2024). "OpenVLA: An Open-Source Vision-Language-Action Model". https://arxiv.org/abs/2406.09246.
  5. 5.0 5.1 "Vision-Language-Action (VLA) Models: LLMs for robots". Black Coffee Robotics. https://www.blackcoffeerobotics.com/blog/vision-language-action-vla-models-llms-for-robots.
  6. "What is RT-2? Google DeepMind's vision-language-action model for robotics". 2023-12-05. https://blog.google/technology/ai/google-deepmind-rt2-robotics-vla-model/.
  7. "Scaling up learning across many different robot types". Google DeepMind. https://deepmind.google/discover/blog/scaling-up-learning-across-many-different-robot-types/.
  8. 8.0 8.1 8.2 "A Survey on Vision-Language-Action Models for Embodied AI". 2024. https://arxiv.org/abs/2405.14093.
  9. "Vision Language Action Models (VLA) & Policies for Robots". 2025-04-11. https://learnopencv.com/vision-language-action-models-lerobot-policy/.
  10. "RT-1: Robotics Transformer for real-world control at scale". Google Research. https://research.google/blog/rt-1-robotics-transformer-for-real-world-control-at-scale/.
  11. 11.0 11.1 "Google DeepMind Announces LLM-Based Robot Controller RT-2". 2023-10-17. https://www.infoq.com/news/2023/10/deepmind-robot-transformer/.
  12. "Google DeepMind Unveils RT-2, Bringing Robots Closer to General Intelligence". 2023-07-28. https://www.maginative.com/article/google-deepmind-unveils-rt-2-bringing-robots-closer-to-general-intelligence/.
  13. "Enhanced Robotic Control with DeepMind RT-2". 2023-09-12. https://www.rtinsights.com/google-deepmind-unveils-enhanced-robotic-control-with-rt-2/.
  14. 14.0 14.1 14.2 "CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation". https://cogact.github.io/.
  15. "Re-implementation of pi0 vision-language-action (VLA) model from Physical Intelligence". GitHub. https://github.com/allenzren/open-pi-zero.
  16. 16.0 16.1 "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data". Hugging Face. https://huggingface.co/blog/smolvla.
  17. "RT-2: Vision-Language-Action Models". https://robotics-transformer2.github.io/.
  18. "Our First Generalist Policy". Physical Intelligence. 2024-10-31. https://www.physicalintelligence.company/blog/pi0.
  19. "π0 and π0-FAST: Vision-Language-Action Models for General Robot Control". Hugging Face. https://huggingface.co/blog/pi0.
  20. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models". 2023. https://arxiv.org/abs/2310.08864.
  21. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models". https://robotics-transformer-x.github.io/.
  22. "OpenVLA: An open-source vision-language-action model for robotic manipulation". GitHub. https://github.com/openvla/openvla.
  23. "Google Deepmind's latest AI model RT-2 "can speak robot"". 2023-07-29. https://the-decoder.com/google-deepminds-latest-ai-model-rt-2-can-speak-robot/.
  24. 24.0 24.1 "OpenVLA: An Open-Source Vision-Language-Action Model". https://openvla.github.io/.
  25. 25.0 25.1 "OpenVLA Paper". Hugging Face. https://huggingface.co/papers/2506.01844.
  26. "Physical Intelligence Unveils Robotics Foundation Model Pi-Zero". 2024-12-03. https://www.infoq.com/news/2024/12/pi-zero-robot/.
  27. "Incredible generalist robots show us a future free of chores". 2024-11-01. https://newatlas.com/robotics/pi-generalist-autonomous-robot/.
  28. "A VLA with Open-World Generalization". Physical Intelligence. 2025-04-22. https://www.physicalintelligence.company/blog/pi05.
  29. "Physical Intelligence's π0.5 VLA with Open-World Generalization". 2025-04-23. https://mikekalil.com/blog/pi-vla-open-world-generalization/.
  30. 30.0 30.1 "Google DeepMind's optimized AI model runs directly on robots". The Verge. 2024-06-24. https://theverge.com/google-deepmind-optimized-ai-model-robots.
  31. 31.0 31.1 "Helix: A Vision-Language-Action Model for Generalist Humanoid Control". Figure. https://www.figure.ai/news/helix.
  32. "VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation". OpenReview. https://openreview.net/forum?id=K4FAFNRpko.
  33. "MiniVLA: A Better VLA with a Smaller Footprint". SAIL Blog. 2024-12-12. http://ai.stanford.edu/blog/minivla/.
  34. "Benchmarking Vision, Language, & Action Models on Robotic Learning Tasks". 2024. https://arxiv.org/abs/2411.05821.
  35. "π₀: A Vision-Language-Action Flow Model for General Robot Control". https://arxiv.org/html/2410.24164v1.
  36. "A curated list of state-of-the-art research in embodied AI". GitHub. https://github.com/jonyzhang2023/awesome-embodied-vla-va-vln.
  37. "OpenVLA - NVIDIA Jetson AI Lab". NVIDIA. https://www.jetson-ai-lab.com/openvla.html.
  38. "VLAs that Train Fast, Run Fast, and Generalize Better". Physical Intelligence. https://www.physicalintelligence.company/research/knowledge_insulation.

External Links