SmolVLA
Last reviewed
Sources
4 citations
Review status
Source-backed
Revision
v4 ยท 2,542 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
4 citations
Review status
Source-backed
Revision
v4 ยท 2,542 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Terms and artificial intelligence terms
SmolVLA (Small Vision-Language-Action) is a compact, open-source vision-language-action model (VLA) for robotics developed by Hugging Face and released in June 2025. At roughly 450 million parameters, it is about 10 times smaller than typical VLAs yet achieves comparable task performance, can be trained on a single GPU, and runs on consumer-grade hardware including a single GPU, a CPU, or a MacBook.[1][2] It was pretrained entirely on around 10 million frames drawn from roughly 481 to 487 open, community-contributed LeRobot datasets, making it the first widely adopted VLA built on freely shared, affordable-robot data rather than proprietary fleets.[2][3]
The paper, titled "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics," states the design goal plainly: "we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance."[1] The name "SmolVLA" reflects its small size, being a playful spelling of "small."
SmolVLA is a robotics foundation model that takes camera images, a natural-language instruction, and the robot's current joint state, and outputs continuous low-level actions (motor commands) for a manipulator. It follows the same three-modality vision, language, and action paradigm as larger VLAs such as OpenVLA, RT-2, and Physical Intelligence's pi0, but it is built explicitly for affordability and accessibility rather than scale.[1]
The model was created to address three obstacles that the authors argue keep robot learning out of reach for most people: the high compute cost of billion-parameter VLAs, the reliance on closed academic and industrial datasets, and the difficulty of deploying large policies on cheap hardware.[1] As the abstract puts it, existing VLAs are "typically massive, often with billions of parameters, leading to high training costs and limited real-world deployability," and they "rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms."[1] SmolVLA is Hugging Face's answer: a model and recipe that a student, hobbyist, or small lab can fine-tune and run end to end.
The model card describes it as "a compact, efficient Vision-Language-Action (VLA) model designed for affordable and efficient robotics, trainable on a single GPU and deployable on consumer hardware."[3] Its code, weights, and training data are released openly under the LeRobot project.
SmolVLA democratizes robotics by enabling vision-language-action capabilities on consumer-grade hardware, making it suitable for educational projects, research, and small-scale automation.[2] Despite its compact size, the paper reports that "SmolVLA achieves performance comparable to VLAs that are 10x larger."[1]
The model achieves its efficiency through several design choices:
Its open-source nature and reliance on community-driven data foster collaboration, potentially accelerating innovation in robotics.[2]
SmolVLA was developed by a team at Hugging Face and academic collaborators. The arXiv paper (ID 2506.01844) was submitted on June 2, 2025, and the official Hugging Face blog post, model weights, and code were published on June 3, 2025.[1][2] The model is part of Hugging Face's LeRobot ecosystem, the open robotics stack launched in 2024 that provides robotics-focused models, datasets, and tools.[2]
The listed authors are: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf (co-founder of Hugging Face), and Remi Cadene (research scientist leading LeRobot at Hugging Face).[1] Lead author Mustafa Shukor and Dana Aubakirova are affiliated with Sorbonne University and ENS Paris-Saclay respectively, and Matthieu Cord is also at Sorbonne University; the work is a Hugging Face and academic collaboration.
| Date | Milestone |
|---|---|
| June 2, 2025 | SmolVLA arXiv paper (2506.01844) submitted[1] |
| June 3, 2025 | Official Hugging Face blog post, model weights, and code released[2] |
| June 4, 2025 | Press coverage of a robotics model efficient enough to "run on a MacBook"[4] |
SmolVLA's architecture has two main components: a pretrained vision-language backbone for perception, and a lightweight action expert that generates robot motions.[1]
The perception module is based on SmolVLM-2, an efficient open vision-language model (~500M parameters) optimized for multi-image and video inputs. It pairs a SigLIP visual encoder with a compact language decoder built on SmolLM2.[1] Key features:
The action expert is a specialized transformer module of about 100 million parameters that generates continuous robot actions:[1]
A key contribution is SmolVLA's asynchronous inference system, which introduces a RobotClient and PolicyServer schema:[2]
SmolVLA processes three input types:[1]
SmolVLA was pretrained exclusively on open, community-contributed datasets from the LeRobot ecosystem. The paper reports roughly 481 datasets totaling about 22,900 episodes and 10.6 million frames (the Hugging Face blog rounds this to "487 community datasets" and "10 million frames"); both figures describe "fewer than 30k episodes."[1][2] By comparison, the authors note this is far less data than state-of-the-art VLAs such as OpenVLA use.[1] The training data consists of:[2]
The datasets were curated and cleaned by the community: a custom filtering tool surfaced low-quality data, and noisy task labels were standardized automatically using a vision-language model (Qwen2.5-VL) to rewrite instructions into concise, action-verb phrases.[2]
The training methodology follows a two-stage approach inspired by large language models:[1][2]
Key training facts:[3]
Despite its size, SmolVLA matches or beats much larger VLAs on standard simulation benchmarks and on real low-cost robots.[1]
On the LIBERO benchmark, the 0.45B SmolVLA reaches an average success rate of 87.3%, outperforming Octo (0.09B, 75.1%), OpenVLA (7B, 76.5%), and the robotics-pretrained pi0 (3.3B, 86.0%), as well as a Diffusion Policy baseline (72.4%).[1]
| LIBERO (avg. success) | Parameters | Score |
|---|---|---|
| SmolVLA | 0.45B | 87.3% |
| pi0 | 3.3B | 86.0% |
| OpenVLA | 7B | 76.5% |
| Octo | 0.09B | 75.1% |
| Diffusion Policy | (baseline) | 72.4% |
On Meta-World, the 0.45B SmolVLA reaches an average success rate of 57.3%, far above a Diffusion Policy baseline at 10.5% and TinyVLA at 31.6%, and ahead of pi0 (50.5%) in the same no-pretraining setting.[1]
On real, low-cost SO-100 and SO-101 arms, SmolVLA outperforms strong baselines despite being multitask:[1]
| Real-world (SO-100 / SO-101) | Parameters | Avg. success |
|---|---|---|
| SmolVLA | 0.45B | 78.3% |
| pi0 (multitask) | 3.5B | 61.7% |
| ACT (single-task) | 0.08B | 48.3% |
On an out-of-distribution SO-101 "pick-place-Lego" generalization test, SmolVLA reached 90% in-distribution and 50% out-of-distribution, versus ACT at 70% and 40% respectively.[1]
Pretraining on community data is decisive. On multitask SO-100 manipulation, success rose from 51.7% without pretraining to 78.3% with community pretraining, a gain of about 26.6 percentage points.[1]
The asynchronous inference stack improves real-world responsiveness:[2]
| Specification | Value |
|---|---|
| Total parameters | ~450 million (about 10x smaller than typical VLAs)[1] |
| VLM backbone | SmolVLM-2 (~500M)[1] |
| Action expert | ~100 million parameters[1] |
| VLM layers used | First N = L/2 (half the language-decoder layers)[1] |
| Visual tokens per frame | 64 (pixel-shuffle reduction)[1] |
| Action output | Continuous action chunks via flow matching[1] |
| Training data | ~481 to 487 community datasets, ~10.6M frames, <30k episodes[1][2] |
| License | Apache-2.0 (open code and weights)[3] |
SmolVLA is fully integrated with the LeRobot framework, so the same library is used to download a dataset, fine-tune lerobot/smolvla_base, and run inference. A minimal fine-tuning command looks like:
python lerobot/scripts/train.py \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=lerobot/your_dataset \
--batch_size=64 \
--steps=20000
SmolVLA targets low-cost manipulation tasks and edge deployment.[1][2]
| Model | Parameters | Training data | Hardware | Open weights |
|---|---|---|---|---|
| SmolVLA | ~0.45B | Community LeRobot datasets | Consumer GPU / CPU / MacBook | Yes |
| OpenVLA | 7B | Open X-Embodiment | High-end GPU | Yes |
| RT-2 | 55B | Proprietary (Google) | Enterprise GPU cluster | No |
| pi0 | 3.3B-3.5B | Mixed proprietary and open | High-end GPU | Partial |
| ACT | ~0.08B | Task-specific demonstrations | Mid-range GPU | Yes |
SmolVLA's distinguishing claim is that it reaches the performance band of models 10 times its size while being trainable and deployable on hardware most people already own.[1]
SmolVLA was widely covered as a step toward democratizing robot learning. Reporting emphasized that the model is "so efficient it can run on a MacBook," highlighting how it lowers the barrier to entry for robotics research and hobbyist use.[4] Commentators noted three themes:[2][4]
The authors acknowledge several limitations:[1]
The SmolVLA team and community point to several next steps:[1]
Imagine a robot arm that can see with a camera, listen to a simple instruction like "put the red block in the box," and then actually move to do it. Most robot brains that smart are huge and need expensive computers. SmolVLA is a much smaller robot brain made by Hugging Face. It is small enough to run on a regular laptop, even a MacBook, but it still does the job almost as well as the big ones. It learned by watching lots of videos that people around the world shared for free, instead of secret company data, so anyone can download it and teach their own cheap robot arm new tricks.