SmolVLA (Small Vision-Language-Action) is a compact, open-source vision-language-action model (VLA) for robotics developed by Hugging Face in collaboration with Google DeepMind. Released in June 2025, SmolVLA represents a significant advancement in making advanced robotic control accessible to researchers and practitioners with limited computational resources.[1] The name "SmolVLA" reflects its small size, being a diminutive of "small."[2]
Unlike existing VLAs that typically require billions of parameters and extensive computational resources, SmolVLA achieves competitive performance with only 450 million parameters, two orders of magnitude smaller than many contemporary VLAs, and can run on consumer-grade hardware, including CPUs, single GPUs, and even devices like a MacBook or Raspberry Pi-class edge computers.[3][4]
SmolVLA is designed to address key challenges in robotic learning: the high computational costs of existing VLA models, limited access to training data, and the difficulty of deploying models on affordable hardware.[5] The model democratizes robotics by enabling advanced vision-language-action capabilities on consumer-grade hardware, making it suitable for educational projects, research, and small-scale automation.[6]
The model achieves its efficiency through several innovations:
SmolVLA follows the same three-modality paradigm as larger VLAs, vision, language, and action, but emphasizes efficiency and accessibility over scale. Its open-source nature and reliance on community-driven data foster collaboration, potentially accelerating innovation in robotics.[8]
SmolVLA emerged from Hugging Face's broader initiative to democratize robotics through open-source tools and models. The project builds upon the company's LeRobot ecosystem, launched in 2024, which provides a collection of robotics-focused models, datasets, and tools.[4] The development was motivated by the observation that existing VLA models, while powerful, remained inaccessible to most researchers due to their computational requirements and reliance on proprietary datasets.[3]
The development represents a shift in robotics foundation models toward more open, efficient, and reproducible systems. By leveraging community-contributed data and affordable hardware, SmolVLA lowers the barrier to entry for robotics research and encourages broader participation.[9]
SmolVLA was developed by a team of researchers at Hugging Face and collaborating institutions, including Google DeepMind. The primary authors include:[1]
| Date | Milestone | Significance |
|---|---|---|
| June 2, 2025 | Initial arXiv paper released[1] | First public disclosure of the model |
| June 3, 2025 | Official blog post and model release on Hugging Face[3] | Model weights and code made publicly available |
| June 4, 2025 | Media coverage highlighting the model's efficiency[4] | Widespread recognition of accessibility features |
| June 10, 2025 | Community adoption milestone | Over 1,000 downloads and first community contributions |
| June 13, 2025 | Asynchronous inference stack released | Performance improvements made available |
SmolVLA's architecture consists of two main components that work together to process visual inputs, language instructions, and generate robot actions:[7]
The perception module is based on SmolVLM-2, an efficient vision-language model optimized for multi-image and video inputs. It comprises a SigLIP visual encoder and a compact language decoder based on SmolLM2.[12] Key features include:
The action expert is a specialized transformer module (~100M parameters) that generates continuous robot actions:[3]
A key innovation is SmolVLA's asynchronous inference system that introduces a RobotClient ↔ PolicyServer schema:[14]
SmolVLA processes three types of inputs:[7]
SmolVLA was trained exclusively on community-contributed datasets from the LeRobot ecosystem, totaling approximately 10 million frames across ~30,000 episodes. The training data consists of:[3]
| Dataset Family | Episodes | Frames (M) | Robot Types | Notes |
|---|---|---|---|---|
| SO-100 multi-task | ~20,000 | ~7.2 | SO-100 arm | Primary training data |
| SO-101 OOD test | ~3,000 | ~1.1 | SO-101 arm | Out-of-distribution testing |
| LeKiwi mobile base | ~2,000 | ~0.7 | Mobile manipulator | Navigation + manipulation |
| Misc. hobby datasets | ~5,000 | ~1.6 | Various DIY rigs | Community contributions |
The datasets were curated using a custom filtering tool created by Alexandre Chapin and Ville Kuosmanen, with manual review by Marina Barannikov. Automatic instruction rewriting with Qwen2.5-VL-3B-Instruct standardized noisy labels to a maximum of 30 characters with action verbs.[3]
Camera views were standardized as follows:
| Camera View | Description |
|---|---|
| OBS_IMAGE_1 | Top-down view |
| OBS_IMAGE_2 | Wrist-mounted camera |
| OBS_IMAGE_3+ | Additional views |
The training methodology follows a two-phase approach inspired by large language models:[3]
Key training specifications:[7]
SmolVLA demonstrates strong performance on established robotics benchmarks despite its compact size:[7]
| Benchmark | SmolVLA (0.45B) | π₀ (3.3B) | OpenVLA (7B) | Diffusion Policy | ACT |
|---|---|---|---|---|---|
| LIBERO-40 | 87.3% | ~85% | <87.3% | <87.3% | <87.3% |
| Meta-World MT50 | Outperforms | - | Lower | Lower | - |
| Average Success Rate | 82.5% | 80.2% | 78.9% | 75.3% | 76.8% |
On real-world robotic platforms, SmolVLA achieves:[5]
| Platform | Task | Success Rate | Notes |
|---|---|---|---|
| SO100 | Pick-Place | 78.3% (avg) | Trained on this platform |
| SO100 | Stacking | In-distribution performance | |
| SO100 | Sorting | With object variations | |
| SO101 | Pick-Place | 76.5% | Zero-shot generalization |
| SO101 | Complex manipulation | 72.1% | Out-of-distribution |
The effectiveness of community dataset pretraining is demonstrated by:[3]
The asynchronous inference stack provides:[5]
| Operation | Minimum Hardware | Recommended Hardware | Performance Notes |
|---|---|---|---|
| Training | Single consumer GPU (6GB VRAM) | GPU with 12GB+ VRAM | 4 hours for 20k steps on A100 |
| Inference | CPU (modern laptop) | Consumer GPU | Real-time on MacBook Pro |
| Fine-tuning | RTX 3080Ti (12GB) | A100 GPU | Batch size adjustable |
| Edge Deployment | Raspberry Pi 4 | Jetson Nano | For privacy-sensitive installations |
SmolVLA is fully integrated with the LeRobot framework:[18]
# Example training command
python lerobot/scripts/train.py \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=lerobot/your_dataset \
--batch_size=64 \
--steps=20000
SmolVLA has been successfully deployed for various robotic manipulation tasks and environments:[3]
| Model | Parameters | Training Data | Hardware Requirements | Open Source | Real-time Capable |
|---|---|---|---|---|---|
| SmolVLA | 450M | Community datasets | Consumer GPU/CPU | ✓ | ✓ |
| OpenVLA | 7B | OXE dataset | High-end GPU | ✓ | ✗ |
| RT-2 | 55B | Proprietary | Enterprise GPU cluster | ✗ | ✗ |
| π0 | 3.3B | Mixed proprietary/open | High-end GPU | Partial | ✗ |
| ACT | 1B | Task-specific | Mid-range GPU | ✓ | Partial |
SmolVLA has been cited as a significant advancement in democratizing robotic learning. Researchers have noted its importance in:[20]
The model has seen rapid adoption in:[21]
The robotics community has responded positively, with researchers describing it as potentially a "BERT moment for robotics".[4] Key community contributions include:
Despite its achievements, SmolVLA has several acknowledged limitations:[16]
The SmolVLA team and community have identified several areas for future development:[3]
Cite error: <ref> tag with name "lerobot-model" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "github-blog" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "lerobot-community" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "libero-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "release-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "pureai" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "learnopoly" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "chatpaper" defined in <references> is not used in prior text.