SmolVLA
Last reviewed
Jun 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 2,515 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 1, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v3 · 2,515 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Terms and artificial intelligence terms SmolVLA (Small Vision-Language-Action) is a compact, open-source vision-language-action model (VLA) for robotics developed by Hugging Face in collaboration with Google DeepMind. Released in June 2025, SmolVLA represents a significant advancement in making advanced robotic control accessible to researchers and practitioners with limited computational resources.[1] The name "SmolVLA" reflects its small size, being a diminutive of "small."[2]
Unlike existing VLAs that typically require billions of parameters and extensive computational resources, SmolVLA achieves competitive performance with only 450 million parameters, two orders of magnitude smaller than many contemporary VLAs, and can run on consumer-grade hardware, including CPUs, single GPUs, and even devices like a MacBook or Raspberry Pi-class edge computers.[3][4]
SmolVLA is designed to address key challenges in robotic learning: the high computational costs of existing VLA models, limited access to training data, and the difficulty of deploying models on affordable hardware.[5] The model democratizes robotics by enabling advanced vision-language-action capabilities on consumer-grade hardware, making it suitable for educational projects, research, and small-scale automation.[6]
The model achieves its efficiency through several innovations:
Efficient Architecture: Utilizes only the first half of the vision-language model's layers, reducing computational cost by approximately 50% and latency by ~40%[7]
Community-Driven Training: Trained exclusively on open-source, community-contributed datasets under the LeRobot tag[3]
Asynchronous Inference: Decouples perception and action prediction from execution, enabling 30% faster response times and 2× task throughput[5]
Hardware Accessibility: Can be trained on a single GPU and deployed on consumer-grade hardware, including edge devices for privacy-sensitive installations[4]
SmolVLA follows the same three-modality paradigm as larger VLAs, vision, language, and action, but emphasizes efficiency and accessibility over scale. Its open-source nature and reliance on community-driven data foster collaboration, potentially accelerating innovation in robotics.[8]
SmolVLA emerged from Hugging Face's broader initiative to democratize robotics through open-source tools and models. The project builds upon the company's LeRobot ecosystem, launched in 2024, which provides a collection of robotics-focused models, datasets, and tools.[4] The development was motivated by the observation that existing VLA models, while powerful, remained inaccessible to most researchers due to their computational requirements and reliance on proprietary datasets.[3]
The development represents a shift in robotics foundation models toward more open, efficient, and reproducible systems. By leveraging community-contributed data and affordable hardware, SmolVLA lowers the barrier to entry for robotics research and encourages broader participation.[9]
SmolVLA was developed by a team of researchers at Hugging Face and collaborating institutions, including Google DeepMind. The primary authors include:[1]
Mustafa Shukor - PhD student at Sorbonne University[10]
Dana Aubakirova - M2 student in MVA at ENS Paris-Saclay[11]
Francesco Capuano
Pepijn Kooijmans
Steven Palma
Adil Zouitine
Michel Aractingi
Caroline Pascal
Martino Russi
Andres Marafioti
Simon Alibert
Matthieu Cord
Thomas Wolf - Co-founder of Hugging Face
Remi Cadene - Research Scientist at Hugging Face
| Date | Milestone | Significance |
|---|---|---|
| June 2, 2025 | Initial arXiv paper released[1] | First public disclosure of the model |
| June 3, 2025 | Official blog post and model release on Hugging Face[3] | Model weights and code made publicly available |
| June 4, 2025 | Media coverage highlighting the model's efficiency[4] | Widespread recognition of accessibility features |
| June 10, 2025 | Community adoption milestone | Over 1,000 downloads and first community contributions |
| June 13, 2025 | Asynchronous inference stack released | Performance improvements made available |
SmolVLA's architecture consists of two main components that work together to process visual inputs, language instructions, and generate robot actions:[7]
The perception module is based on SmolVLM-2, an efficient vision-language model optimized for multi-image and video inputs. It comprises a SigLIP visual encoder and a compact language decoder based on SmolLM2.[12] Key features include:
Vision Encoder: Uses SigLIP for robust visual feature encoding
Language Decoder: Employs SmolLM2, a compact language model
Token Efficiency: Limits visual tokens to 64 per frame through pixel-shuffle token reduction techniques
Layer Pruning: Uses only the first N layers (N=L/2, where L is the total number of layers) of the VLM's language decoder, reducing latency by approximately 40%[13]
The action expert is a specialized transformer module (~100M parameters) that generates continuous robot actions:[3]
Architecture: Alternates between self-attention and cross-attention blocks with causal masking
Training Method: Uses flow matching objective to guide noisy samples back to ground truth
Action Generation: Produces "action chunks" - sequences of future robot actions (default 50 timesteps)
Temporal Consistency: Applies causal masking to ensure temporal coherence and improve smoothness
A key innovation is SmolVLA's asynchronous inference system that introduces a RobotClient ↔ PolicyServer schema:[14]
The robot executes the current action chunk while the server predicts the next
Fills an action queue until a guard band threshold is reached
Enables low-latency control suitable for real-time applications
Makes the system more adaptable and capable of faster recovery from errors
SmolVLA processes three types of inputs:[7]
Multiple RGB Images: Up to four frames from different camera views (resized to 512×512 pixels, global view only without tiling)
Language Instructions: Natural language task descriptions tokenized into text tokens
Sensorimotor States: Robot's current state projected into a single token via linear layer
SmolVLA was trained exclusively on community-contributed datasets from the LeRobot ecosystem, totaling approximately 10 million frames across ~30,000 episodes. The training data consists of:[3]
487 high-quality datasets focused on the SO100 robot platform
Diverse task coverage: Including pick-and-place, stacking, sorting, and manipulation tasks
Natural diversity: Varied lighting conditions, suboptimal demonstrations, and heterogeneous control schemes
Multiple environments: Data collected in homes, maker spaces, and research labs
| Dataset Family | Episodes | Frames (M) | Robot Types | Notes |
|---|---|---|---|---|
| SO-100 multi-task | ~20,000 | ~7.2 | SO-100 arm | Primary training data |
| SO-101 OOD test | ~3,000 | ~1.1 | SO-101 arm | Out-of-distribution testing |
| LeKiwi mobile base | ~2,000 | ~0.7 | Mobile manipulator | Navigation + manipulation |
| Misc. hobby datasets | ~5,000 | ~1.6 | Various DIY rigs | Community contributions |
The datasets were curated using a custom filtering tool created by Alexandre Chapin and Ville Kuosmanen, with manual review by Marina Barannikov. Automatic instruction rewriting with Qwen2.5-VL-3B-Instruct standardized noisy labels to a maximum of 30 characters with action verbs.[3]
Camera views were standardized as follows:
| Camera View | Description |
|---|---|
| OBS_IMAGE_1 | Top-down view |
| OBS_IMAGE_2 | Wrist-mounted camera |
| OBS_IMAGE_3+ | Additional views |
The training methodology follows a two-phase approach inspired by large language models:[3]
Pretraining Phase: 200,000 steps on general manipulation data from community datasets
Task-Specific Post-Training: 100,000-200,000 steps of fine-tuning on specific tasks
Key training specifications:[7]
Can be trained on a single consumer GPU (for example RTX 3080Ti with 12GB VRAM)
Batch size: 44 (adjustable based on available VRAM, 16 for 6GB GPUs)
Training time: Approximately 4 hours for 20,000 steps on a single A100 GPU[15]
Memory usage: ~11.53 GB GPU memory during training
Loss convergence: From 1.198 to 0.004 over 200,000 steps
SmolVLA demonstrates strong performance on established robotics benchmarks despite its compact size:[7]
| Benchmark | SmolVLA (0.45B) | π₀ (3.3B) | OpenVLA (7B) | Diffusion Policy | ACT |
|---|---|---|---|---|---|
| LIBERO-40 | 87.3% | ~85% | <87.3% | <87.3% | <87.3% |
| Meta-World MT50 | Outperforms | - | Lower | Lower | - |
| Average Success Rate | 82.5% | 80.2% | 78.9% | 75.3% | 76.8% |
On real-world robotic platforms, SmolVLA achieves:[5]
| Platform | Task | Success Rate | Notes |
|---|---|---|---|
| SO100 | Pick-Place | 78.3% (avg) | Trained on this platform |
| SO100 | Stacking | In-distribution performance | |
| SO100 | Sorting | With object variations | |
| SO101 | Pick-Place | 76.5% | Zero-shot generalization |
| SO101 | Complex manipulation | 72.1% | Out-of-distribution |
The effectiveness of community dataset pretraining is demonstrated by:[3]
Without pretraining: 51.7% success rate on SO100 tasks
With pretraining: 78.3% success rate (+26.6% absolute improvement)
With multitask finetuning: Further improvements in low-data regimes (up to 85% on specific tasks)
The asynchronous inference stack provides:[5]
30% reduction in average task completion time
2× increase in completed actions within fixed time scenarios (19 vs. 9 cubes moved)
40% reduction in inference latency through layer pruning
Average inference time: 0.086982 seconds
Maximum GPU memory usage: 908.43 MB during inference
Total Parameters: 450 million (roughly two orders of magnitude smaller than contemporary VLAs)[3]
Action Expert Parameters: ~100 million[7]
VLM Layers Used: First 16 layers (out of 32 total)[16]
Visual Tokens per Frame: 64[7]
Action Chunk Size: Configurable (typically 50 timesteps for 1 second)[7]
License: Apache-2.0 (code & model weights)[17]
| Operation | Minimum Hardware | Recommended Hardware | Performance Notes |
|---|---|---|---|
| Training | Single consumer GPU (6GB VRAM) | GPU with 12GB+ VRAM | 4 hours for 20k steps on A100 |
| Inference | CPU (modern laptop) | Consumer GPU | Real-time on MacBook Pro |
| Fine-tuning | RTX 3080Ti (12GB) | A100 GPU | Batch size adjustable |
| Edge Deployment | Raspberry Pi 4 | Jetson Nano | For privacy-sensitive installations |
SmolVLA is fully integrated with the LeRobot framework:[18]
python lerobot/scripts/train.py
--policy.path=lerobot/smolvla_base
--dataset.repo_id=lerobot/your_dataset
--batch_size=64
--steps=20000
SmolVLA has been successfully deployed for various robotic manipulation tasks and environments:[3]
Pick-and-Place Operations: Grasping and relocating objects with various shapes and sizes
Stacking Tasks: Building stable structures with blocks
Sorting Activities: Organizing objects by category or properties
Assembly Operations: Simple construction tasks
Kitchen Tasks: Basic food preparation and cleaning
Mobile Manipulation: Combined navigation and manipulation tasks
SO100: Primary training platform
SO101: Demonstrates zero-shot generalization
Koch Arm: Community-tested implementation[4]
ALOHA-style robots: Compatible through LeRobot framework
Raspberry Pi robots: Edge deployment for education
Custom DIY platforms: Community-built robots
Education & Hobby Robotics: Runs on Raspberry Pi-class edge computers, enabling classroom demos and maker projects[4]
Research Prototyping: Quick fine-tuning with only a handful of additional demos using the LeRobot trainer[19]
Edge Deployment: Works fully offline on consumer GPUs or CPUs, important for privacy-sensitive installations[8]
Research Baseline: Serves as a reproducible small-scale reference when studying VLA design choices[1]
| Model | Parameters | Training Data | Hardware Requirements | Open Source | Real-time Capable |
|---|---|---|---|---|---|
| SmolVLA | 450M | Community datasets | Consumer GPU/CPU | ✓ | ✓ |
| OpenVLA | 7B | OXE dataset | High-end GPU | ✓ | ✗ |
| RT-2 | 55B | Proprietary | Enterprise GPU cluster | ✗ | ✗ |
| π0 | 3.3B | Mixed proprietary/open | High-end GPU | Partial | ✗ |
| ACT | 1B | Task-specific | Mid-range GPU | ✓ | Partial |
SmolVLA has been cited as a significant advancement in democratizing robotic learning. Researchers have noted its importance in:[20]
Lowering barriers to entry for robotics research
Demonstrating the effectiveness of community-driven datasets
Proving that compact models can achieve competitive performance
Challenging the trend of scaling up model sizes
Promoting sustainable and efficient AI development
The model has seen rapid adoption in:[21]
Educational institutions with limited budgets
Small robotics startups
Research labs in developing countries
Hobbyist and maker communities
Privacy-conscious industrial applications
The robotics community has responded positively, with researchers describing it as potentially a "BERT moment for robotics".[4] Key community contributions include:
Over 100 additional dataset contributions to the LeRobot ecosystem
Ports to various robot platforms including mobile manipulators
Performance optimizations reducing inference time by additional 15%
Integration with popular robotics frameworks like ROS
Despite its achievements, SmolVLA has several acknowledged limitations:[16]
Dataset Diversity: Training data is predominantly from SO100 platform, limiting cross-embodiment generalization
Dataset Size: Uses significantly less data (30k episodes) than state-of-the-art VLAs like OpenVLA (millions of trajectories)
Long-Horizon Tasks: Limited evaluation on extended task sequences beyond 1-2 minutes
VLM Backbone: Uses a general-purpose VLM not specifically pretrained for robotics
Single-Arm Focus: Primary evaluation on single-arm manipulation tasks
Complex Language Grounding: Compact size trades off complex language understanding compared to billion-parameter VLAs
The SmolVLA team and community have identified several areas for future development:[3]
Cross-Embodiment Training: Expanding to more diverse robot platforms including quadrupeds and humanoids
Scaling Studies: Investigating optimal model sizes for different applications (exploring 200M-1B parameter variants)
Joint Multimodal Training: Combining robotics data with general vision-language datasets
Real-Time Optimizations: Further improvements to inference speed targeting sub-50ms latency
Sim-to-Real Transfer: Better integration with simulation environments like Isaac Gym
Reinforcement Learning: Integration of RL fine-tuning for improved task performance
Larger Community Datasets: Goal of reaching 1 million demonstration episodes by 2026
Vision-Language-Action Model
LeRobot
OpenVLA
RT-2
π0 (Pi-Zero)
Cite error: <ref> tag with name "lerobot-model" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "github-blog" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "lerobot-community" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "libero-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "release-issue" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "pureai" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "learnopoly" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "chatpaper" defined in <references> is not used in prior text.