OpenVLA
Last reviewed
May 2, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 3,111 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 3,111 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenVLA is a 7-billion-parameter open-source vision-language-action model (VLA) for robotic manipulation, released in June 2024 by a collaboration of researchers from Stanford University, UC Berkeley, the Toyota Research Institute, Google DeepMind, Physical Intelligence, and MIT. It was the first generalist robot policy of its size to ship with public weights, training code, and a permissive license, and it set a reference point for what an open VLA could look like in an area that had been dominated by closed industrial models such as RT-2.
The model is built on a Prismatic VLM backbone that fuses DINOv2 and SigLIP visual encoders with a Llama 2 7B language model. It was trained on roughly 970,000 robot manipulation trajectories drawn from the Open X-Embodiment dataset, covering more than 70 individual robot datasets and 22 embodiments. Despite being seven times smaller than RT-2-X (Google DeepMind's 55B closed model), OpenVLA outperformed it by 16.5 percentage points on a 29-task evaluation suite spanning multiple robots. The paper was accepted at the Conference on Robot Learning (CoRL) 2024.
OpenVLA matters less because of any single benchmark number and more because of what it unlocked. Within months of release, dozens of follow-ups began fine-tuning it on new arms, distilling it into smaller policies, retrofitting it with continuous action heads, and using it as the baseline that any new VLA had to beat. The 2025 sequel OpenVLA-OFT pushed the same checkpoint to state of the art on the LIBERO simulation benchmark and made bimanual high-frequency control practical on ALOHA-style hardware.
Vision-language-action models are a class of foundation models that take an image (or several) plus a natural-language instruction and output low-level robot actions, usually end-effector deltas or joint targets. The lineage runs through Google's RT-1 (2022), which was a small transformer policy trained on robot demonstrations, and RT-2 (2023), which fine-tuned a closed vision-language model (PaLI-X, up to 55B parameters) to emit action tokens directly. RT-2 showed that internet-scale visual and linguistic priors transfer to manipulation in ways pure robot data cannot match, but it was never released. RT-X / RT-2-X, announced alongside the Open X-Embodiment collaboration in late 2023, retrained RT-2 on the cross-embodiment dataset and reported strong generalization across robots, again without releasing weights.
The situation by spring 2024 was awkward. The strongest generalist policies were closed and ran on Google's internal infrastructure. Open alternatives existed (notably Octo from UC Berkeley, a 27M to 93M parameter transformer trained on the same Open X-Embodiment data), but they did not use a pretrained language model as the backbone and lagged on language-conditioned generalization. OpenVLA was designed to close that gap: take the recipe RT-2 had validated, swap in an open VLM, train on the same public data, and release everything.
OpenVLA is, mechanically, a fine-tuned Prismatic VLM. The Prismatic family was introduced earlier in 2024 by Karamcheti and collaborators as an open recipe for vision-language model training; OpenVLA adopts the prism-dinosiglip-224px configuration. Three components matter.
Each RGB image is fed through two pretrained vision encoders in parallel. SigLIP (a CLIP-style sigmoid-loss image-text model from Google) provides features that are well aligned with language. DINOv2 (Meta's self-supervised vision transformer) provides features that are stronger for spatial and geometric reasoning. The two feature maps are concatenated channel-wise. This dual-encoder setup is one of the Prismatic recipe's key contributions: it gave better grounding than either encoder alone.
A two-layer MLP projects the fused visual features into the language model's embedding space, where they are treated as if they were tokens.
The backbone is Llama 2 7B. It autoregressively predicts a sequence of tokens, conditioned on the visual embeddings followed by the natural-language instruction. The output sequence is interpreted as a robot action.
This is where OpenVLA diverges from a typical VLM. Robot actions are continuous (7 degrees of freedom for a single arm: 3 translation, 3 rotation, 1 gripper), but Llama 2 emits discrete tokens. The authors discretize each action dimension into 256 bins by mapping the value through the per-dimension empirical quantiles of the training data, so that bins are roughly equally populated.
Rather than expand the vocabulary, OpenVLA overwrites the 256 least-used tokens in the Llama 2 tokenizer (numerically, the rarest BPE tokens) and reassigns them to action bins. At inference time the model produces 7 of these tokens per timestep, which a small de-tokenizer maps back into a continuous action vector that the robot can execute. The trick is cheap, simple, and lets the entire VLM remain unchanged below the embedding layer.
The input image resolution is 224x224. The model takes a single third-person camera frame at each step (no proprioception, no history), which keeps things simple but also limits the kinds of tasks the base model can solve.
The pretraining mixture is a curated subset of Open X-Embodiment, the cross-institution dataset that aggregates over 70 robot datasets and roughly 2 million trajectories from 22 embodiments. The OpenVLA team filtered this down to about 970,000 trajectories. The filtering removed datasets that were too small, used non-standard action spaces, or had sparse language annotations, and rebalanced the mixture so that no single source dominated. The action space is unified to a single 7-DoF end-effector Cartesian formulation; per-dataset proprioceptive and gripper conventions are normalized.
OpenVLA-7B was trained on 64 NVIDIA A100 GPUs for roughly 14 days. Total compute was about 21,500 A100-hours at a global batch size of 2,048. By 2024 foundation model standards, that is small. By robotics standards, it is enormous; most published manipulation policies before OpenVLA used a few thousand GPU-hours at most.
The training objective is plain next-token cross-entropy on the action token sequence. There is no separate action loss, no diffusion head, no flow-matching component (those approaches came later, in pi0 and OpenVLA-OFT). The full Llama 2 backbone is updated. The vision encoders are kept frozen for most of training, which the authors found important; tuning them too aggressively hurt language grounding.
The original paper reports zero-shot evaluation on 29 tasks across multiple robots, comparing OpenVLA-7B to RT-1-X, Octo, and the closed RT-2-X (55B). Results are aggregated by embodiment.
| Model | Parameters | Open weights | BridgeData V2 | Google robot | Average |
|---|---|---|---|---|---|
| RT-1-X | 35M | yes | low | low | low |
| Octo | 93M | yes | mid | mid | mid |
| RT-2-X | 55B | no | mid | high | mid-high |
| OpenVLA-7B | 7B | yes | high | mid-high | high |
The headline number is +16.5 percentage points absolute success rate over RT-2-X averaged across the 29 tasks, despite OpenVLA having seven times fewer parameters. On the WidowX BridgeData V2 benchmark in particular, OpenVLA outperforms RT-2-X by a wide margin; on the Google robot evaluation it is roughly comparable. Against Octo and RT-1-X, OpenVLA wins on essentially every task category, with the gap widening on language-conditioned generalization (novel object combinations, novel attribute descriptions).
The paper also reports fine-tuning experiments. Given a few hundred demonstrations of a new task on a new robot (Franka Panda is the headline platform), full fine-tuning takes 5 to 15 hours on 8 A100s and reaches 50%+ success on most diverse tasks. With LoRA fine-tuning, the same range can be reached in 10 to 15 hours on a single A100 while updating only about 1.4% of parameters, an 8x reduction in training compute over full fine-tuning. The authors observe that LoRA fine-tunes match full fine-tuning quality on the tasks they tested, which made OpenVLA the first generalist VLA that could be adapted to a new robot in a graduate student's afternoon rather than on a TPU pod.
| Model | Year | Parameters | Backbone | Action representation | Open weights |
|---|---|---|---|---|---|
| RT-1 | 2022 | 35M | EfficientNet + transformer | Discrete tokens | No |
| RT-2 | 2023 | up to 55B | PaLI-X | Discrete tokens | No |
| RT-2-X | 2023 | 55B | PaLI-X | Discrete tokens | No |
| Octo | 2024 | 27M / 93M | Transformer + diffusion head | Continuous (diffusion) | Yes |
| OpenVLA | 2024 | 7B | Llama 2 + DINOv2 + SigLIP | Discrete tokens (256 bins) | Yes (Apache 2.0 + Llama 2) |
| pi0 | 2024 | 3.3B | PaliGemma + flow-matching action expert | Continuous (flow matching) | Partial |
| OpenVLA-OFT | 2025 | 7B | OpenVLA + L1 regression head | Continuous (action chunks) | Yes |
| GR00T N1 | 2025 | 2.2B | VLM + diffusion transformer | Continuous (action chunks) | Yes |
A few observations are worth pulling out. OpenVLA is the largest model in the open-VLA category, and the only one that ships a pure autoregressive decoder. Almost every model that came after it moved to continuous action representations (diffusion or flow matching) for one main reason: faster inference. Token-by-token decoding caps OpenVLA at roughly 5 Hz on an A100 for a single arm, which is not enough for high-frequency control. The flip side is that the discrete-token formulation lets the VLM be trained essentially as a language model, which is conceptually simple and lets it inherit instruction-following behavior from Llama 2 cleanly.
Against Octo, OpenVLA wins on language understanding and out-of-distribution objects but is two orders of magnitude larger and much slower at inference. Octo's 93M parameter model fine-tunes in 2 to 4 hours on a single GPU and runs at around 10 Hz. The choice between them is genuinely a tradeoff, not a strict ordering.
Against pi0, released a few months later by Physical Intelligence, OpenVLA is open under a permissive license while pi0's pretrained weights were initially gated. pi0 uses flow matching to produce 50 Hz continuous action chunks, which is more practical for dexterous tasks. On the laundry-folding and table-bussing demos that pi0 leaned on, OpenVLA out of the box would not be competitive. With OFT-style fine-tuning, the gap narrows considerably.
Against GR00T N1, released by NVIDIA in March 2025, OpenVLA is older and smaller in vision capability but more general across embodiments. GR00T N1 is humanoid-focused and uses a dual-system architecture (slow planner + fast diffusion action policy), trained on 50,000 H100 GPU-hours, much more compute than OpenVLA used.
OpenVLA was released with code under the Apache 2.0 license at https://github.com/openvla/openvla, model weights on Hugging Face at https://huggingface.co/openvla/openvla-7b, training scripts, the data filter and weighting code, and inference servers for several robots. The weights are subject to the Llama 2 community license inherited from the backbone, which is permissive for most uses but has Meta's standard restrictions for very large commercial deployments.
The practical effect of releasing the entire pipeline (and not just weights) was significant. Within weeks, third parties had ports for Jetson edge devices, 4-bit quantized inference (which the authors showed matches bfloat16 within noise while halving memory), and adaptations to robots that were not in the training data. The official codebase supports full fine-tuning, partial fine-tuning, and quantized LoRA via Hugging Face's PEFT library out of the box.
Fine-tuning OpenVLA on a new task usually means collecting 100 to 500 teleoperated demonstrations on the target robot and running one of three recipes.
Full fine-tuning. Update all parameters, typically with 8 A100 or H100 GPUs for 5 to 15 hours. Best ceiling, most compute.
LoRA fine-tuning. Update low-rank adaptation modules over the attention and MLP layers, typically rank 32. One A100, 10 to 15 hours, comparable success rate on the tasks the authors tested. This is the most common recipe in published follow-ups.
Quantized LoRA. Combine LoRA with 4-bit weight quantization. Memory drops by more than half, and the authors report no measurable performance loss. Several papers in 2025 have shown this configuration running on consumer GPUs with as little as 8 GB of VRAM, though throughput remains the bottleneck.
The fine-tuning workflow is one of the main reasons OpenVLA caught on. RT-2 fine-tuning is something only Google can do; OpenVLA fine-tuning is something a small lab can do over a weekend. That accessibility, more than any benchmark number, is what made OpenVLA the default VLA for academic robotics work through 2024 and into 2025.
The paper and follow-up work are candid about what OpenVLA does not solve.
Inference throughput. Autoregressive token decoding produces one action per forward pass, with each forward pass spending most of its time on the 7B language model. Real measurements give roughly 3 to 5 Hz on an A100 for a single arm and lower for bimanual setups, well below the 25 to 50 Hz that contact-rich and high-frequency control needs. This is the limitation that drove most subsequent VLA work.
Single-image, no history. The base model takes one third-person camera frame and no proprioception. Tasks that need wrist cameras, force feedback, or memory of recent states are out of distribution.
Coarse action discretization. 256 bins per dimension is fine for pick-and-place but coarse for fine insertion or pouring tasks. Continuous action heads (added in OpenVLA-OFT, pi0, GR00T) close this gap.
Bimanual and dexterous manipulation. With a single image and slow autoregressive decoding, bimanual tasks running at 50 Hz on an ALOHA cell are essentially impossible at the base recipe. OFT-style fine-tuning was needed to make this work.
Reliability ceiling. Even on tasks where OpenVLA outperforms RT-2-X, success rates typically live in the 70-90% band rather than the 95%+ range that a deployed system would want. Generalist robot policies are not yet at the reliability that, say, language models reached for translation; OpenVLA is no exception.
The most directly tied successor is OpenVLA-OFT (Optimized Fine-Tuning), released in February 2025 by Moo Jin Kim, Chelsea Finn, and Percy Liang. OFT is not a new model so much as a fine-tuning recipe applied to the same OpenVLA-7B checkpoint. It does four things at once: replaces the discrete action token head with an L1 regression head producing continuous values, predicts action chunks (multiple future timesteps at once) instead of single actions, decodes those chunks in parallel rather than autoregressively, and fine-tunes with a simple regression objective. Together these push throughput up by 26x over base OpenVLA and lift LIBERO benchmark success from 76.5% to 97.1%. On real bimanual ALOHA hardware, OpenVLA-OFT outperforms pi0 and RDT-1B fine-tuned with their default recipes, as well as scratch-trained Diffusion Policy and ACT, by up to 15 percentage points on average. With 25-step action chunks, throughput reaches 43x the base model.
Other notable follow-ups include MiniVLA, smaller variants for edge deployment; numerous task-specific fine-tunes released on Hugging Face across the second half of 2024 and 2025; and the use of OpenVLA as a starting point for online reinforcement learning experiments, where a pretrained generalist policy is bootstrapped with on-robot RL. Most VLA papers published after mid-2024 use OpenVLA as a baseline, which is its own kind of citation impact.
In the months after release, OpenVLA was widely covered in robotics-focused press as the first credibly open competitor to RT-2. VentureBeat called it an "open-source generalist robotics model" and tied it to the broader push to make embodied AI reproducible. The MIT AI Agent Index and several academic surveys list it as the canonical reference for open VLAs.
The research-community impact is harder to summarize cleanly because it is still ongoing. Two things are clear. First, almost every open VLA paper since June 2024 either uses OpenVLA's recipe (Prismatic backbone, Open X-Embodiment data filter, action tokenization scheme) or compares against it. Second, the model became the default starting point for graduate students training policies on lab arms; a search of recent CoRL and RSS papers finds dozens of OpenVLA-derived systems. The paper has been cited in many subsequent surveys of VLA models and robot foundation models.
The authors went on to other work in roughly the directions you would expect. Moo Jin Kim led OpenVLA-OFT. Karl Pertsch and others moved closer to Physical Intelligence and the pi0 line. Sergey Levine remains at UC Berkeley, where his group continues releasing open robot policies. Chelsea Finn and Percy Liang co-advised the OFT follow-up. Russ Tedrake's involvement at TRI signaled the institute's continued bet on open VLAs alongside its own internal work.
The full author list, in order: Moo Jin Kim (Stanford), Karl Pertsch (Stanford / UC Berkeley), Siddharth Karamcheti (Stanford), Ted Xiao (Google DeepMind), Ashwin Balakrishna (TRI), Suraj Nair (TRI), Rafael Rafailov (Stanford), Ethan Foster (Stanford), Grace Lam (Stanford), Pannag Sanketi (Google DeepMind), Quan Vuong (Google DeepMind), Thomas Kollar (TRI), Benjamin Burchfiel (TRI), Russ Tedrake (TRI / MIT), Dorsa Sadigh (Stanford), Sergey Levine (UC Berkeley), Percy Liang (Stanford), Chelsea Finn (Stanford). Lead authorship (Kim, Pertsch, Karamcheti) is shared.