# SmolVLA

> Source: https://aiwiki.ai/wiki/smolvla
> Updated: 2026-06-25
> Categories: AI Hardware, AI Models, Artificial Intelligence, Embodied AI, Google DeepMind, Multimodal AI, Open Source AI, Robotics
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Terms](/wiki/terms) and [artificial intelligence terms](/wiki/artificial_intelligence_terms)*

SmolVLA (Small Vision-Language-Action) is a compact, open-source [vision-language-action model](/wiki/vision_language_action_model) ([VLA](/wiki/vla)) for [robotics](/wiki/robotics) developed by [Hugging Face](/wiki/hugging_face) and released in June 2025. At roughly 450 million parameters, it is about 10 times smaller than typical VLAs yet achieves comparable task performance, can be trained on a single GPU, and runs on consumer-grade hardware including a single GPU, a CPU, or a MacBook.[1][2] It was pretrained entirely on around 10 million frames drawn from roughly 481 to 487 open, community-contributed [LeRobot](/wiki/lerobot) datasets, making it the first widely adopted VLA built on freely shared, affordable-robot data rather than proprietary fleets.[2][3]

The paper, titled "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics," states the design goal plainly: "we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance."[1] The name "SmolVLA" reflects its small size, being a playful spelling of "small."

## What is SmolVLA?

SmolVLA is a robotics foundation model that takes camera images, a natural-language instruction, and the robot's current joint state, and outputs continuous low-level [actions](/wiki/action) (motor commands) for a manipulator. It follows the same three-modality vision, language, and action paradigm as larger VLAs such as OpenVLA, RT-2, and Physical Intelligence's pi0, but it is built explicitly for affordability and accessibility rather than scale.[1]

The model was created to address three obstacles that the authors argue keep robot learning out of reach for most people: the high compute cost of billion-parameter VLAs, the reliance on closed academic and industrial datasets, and the difficulty of deploying large policies on cheap hardware.[1] As the abstract puts it, existing VLAs are "typically massive, often with billions of parameters, leading to high training costs and limited real-world deployability," and they "rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms."[1] SmolVLA is Hugging Face's answer: a model and recipe that a student, hobbyist, or small lab can fine-tune and run end to end.

The model card describes it as "a compact, efficient Vision-Language-Action (VLA) model designed for affordable and efficient robotics, trainable on a single GPU and deployable on consumer hardware."[3] Its code, weights, and training data are released openly under the LeRobot project.

## Overview

SmolVLA democratizes robotics by enabling vision-language-action capabilities on consumer-grade hardware, making it suitable for educational projects, research, and small-scale automation.[2] Despite its compact size, the paper reports that "SmolVLA achieves performance comparable to VLAs that are 10x larger."[1]

The model achieves its efficiency through several design choices:

- **Efficient backbone**: Uses only the first half of the vision-language model's layers (the first N = L/2 layers of the language decoder), cutting compute and latency without retraining a custom encoder.[1]
- **Community-driven training**: Pretrained exclusively on open, community-contributed datasets tagged "lerobot" on the [Hugging Face](/wiki/hugging_face) Hub, with no proprietary data.[2]
- **Asynchronous [inference](/wiki/inference)**: An inference stack that decouples perception and action prediction from action execution, reported to give about 30% faster response and 2x task throughput.[2]
- **Hardware accessibility**: Trainable on a single consumer or A100 GPU and deployable on consumer GPUs, CPUs, or a MacBook for offline, privacy-sensitive use.[3]

Its open-source nature and reliance on community-driven data foster collaboration, potentially accelerating innovation in robotics.[2]

## Who built SmolVLA, and when was it released?

SmolVLA was developed by a team at [Hugging Face](/wiki/hugging_face) and academic collaborators. The arXiv paper (ID 2506.01844) was submitted on June 2, 2025, and the official Hugging Face blog post, model weights, and code were published on June 3, 2025.[1][2] The model is part of Hugging Face's LeRobot ecosystem, the open robotics stack launched in 2024 that provides robotics-focused models, datasets, and tools.[2]

The listed authors are: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf (co-founder of Hugging Face), and Remi Cadene (research scientist leading LeRobot at Hugging Face).[1] Lead author Mustafa Shukor and Dana Aubakirova are affiliated with Sorbonne University and ENS Paris-Saclay respectively, and Matthieu Cord is also at Sorbonne University; the work is a Hugging Face and academic collaboration.

| Date | Milestone |
| --- | --- |
| June 2, 2025 | SmolVLA arXiv paper (2506.01844) submitted[1] |
| June 3, 2025 | Official Hugging Face blog post, model weights, and code released[2] |
| June 4, 2025 | Press coverage of a robotics model efficient enough to "run on a MacBook"[4] |

## What is SmolVLA's architecture?

SmolVLA's architecture has two main components: a pretrained vision-language backbone for perception, and a lightweight action expert that generates robot motions.[1]

### Perception module (SmolVLM-2)

The perception module is based on SmolVLM-2, an efficient open vision-language model (~500M parameters) optimized for multi-image and video inputs. It pairs a SigLIP visual encoder with a compact language decoder built on SmolLM2.[1] Key features:

- **Vision encoder**: SigLIP for visual feature encoding.
- **Language decoder**: SmolLM2, a compact language model.
- **[Token](/wiki/token) efficiency**: Limits visual tokens to 64 per frame using pixel-shuffle token reduction (down from over 1,000), greatly cutting compute.[1]
- **Layer [pruning](/wiki/pruning)**: Uses only the first N layers (N = L/2, half the total) of the VLM's language decoder, reducing latency.[1]

### Action expert

The action expert is a specialized [transformer](/wiki/transformer) module of about 100 million parameters that generates continuous robot actions:[1]

- **Architecture**: Alternates self-attention and cross-attention blocks with causal masking, conditioned on the VLM's hidden states.
- **Training objective**: Uses flow matching to map noisy samples back toward ground-truth actions.
- **Action chunks**: Produces sequences of future robot actions ("action chunks") rather than one step at a time.
- **Temporal consistency**: Causal masking improves smoothness and temporal coherence.

### Asynchronous inference stack

A key contribution is SmolVLA's asynchronous inference system, which introduces a RobotClient and PolicyServer schema:[2]

- The robot executes the current action chunk while the server already predicts the next one.
- An action queue is filled until a guard-band threshold is reached.
- This enables higher control rates and low-latency control suitable for real-time applications.
- It makes the system more adaptive and able to recover faster from errors.

### Input processing

SmolVLA processes three input types:[1]

1. **RGB images**: One or more camera views (for example a top-down view plus a wrist camera).
2. **Language instructions**: Natural-language task descriptions tokenized into text tokens.
3. **Sensorimotor state**: The robot's current joint state, projected into a token via a linear layer.

## How was SmolVLA trained?

### Datasets

SmolVLA was pretrained exclusively on open, community-contributed datasets from the LeRobot ecosystem. The paper reports roughly 481 datasets totaling about 22,900 episodes and 10.6 million frames (the Hugging Face blog rounds this to "487 community datasets" and "10 million frames"); both figures describe "fewer than 30k episodes."[1][2] By comparison, the authors note this is far less data than state-of-the-art VLAs such as OpenVLA use.[1] The training data consists of:[2]

- **Community datasets** tagged "lerobot" on the Hugging Face Hub, focused largely on the affordable SO-100 / SO-101 robot arms.
- **Diverse task coverage**: pick-and-place, stacking, sorting, and other manipulation tasks.
- **Natural diversity**: varied lighting, suboptimal demonstrations, and heterogeneous control schemes.
- **Multiple environments**: data collected in homes, maker spaces, and research labs.

The datasets were curated and cleaned by the community: a custom filtering tool surfaced low-quality data, and noisy task labels were standardized automatically using a vision-language model (Qwen2.5-VL) to rewrite instructions into concise, action-verb phrases.[2]

### Training process

The training methodology follows a two-stage approach inspired by [large language models](/wiki/large_language_model):[1][2]

1. **Community pretraining**: General manipulation pretraining on the LeRobot community datasets.
2. **Task-specific post-training**: Fine-tuning on the target robot and tasks, including in low-data regimes with only a handful of demonstrations.

Key training facts:[3]

- Can be trained on a single GPU (consumer or A100 class).
- Training for 20,000 steps takes roughly 4 hours on a single A100 GPU.[3]
- Batch size and VRAM use are adjustable to the available GPU.

## How does SmolVLA perform?

Despite its size, SmolVLA matches or beats much larger VLAs on standard simulation benchmarks and on real low-cost robots.[1]

### Simulation benchmarks

On the [LIBERO](/wiki/libero) benchmark, the 0.45B SmolVLA reaches an average success rate of 87.3%, outperforming Octo (0.09B, 75.1%), OpenVLA (7B, 76.5%), and the robotics-pretrained pi0 (3.3B, 86.0%), as well as a Diffusion Policy baseline (72.4%).[1]

| LIBERO (avg. success) | Parameters | Score |
| --- | --- | --- |
| **SmolVLA** | 0.45B | **87.3%** |
| pi0 | 3.3B | 86.0% |
| OpenVLA | 7B | 76.5% |
| Octo | 0.09B | 75.1% |
| Diffusion Policy | (baseline) | 72.4% |

On Meta-World, the 0.45B SmolVLA reaches an average success rate of 57.3%, far above a Diffusion Policy baseline at 10.5% and TinyVLA at 31.6%, and ahead of pi0 (50.5%) in the same no-pretraining setting.[1]

### Real-world performance

On real, low-cost SO-100 and SO-101 arms, SmolVLA outperforms strong baselines despite being multitask:[1]

| Real-world (SO-100 / SO-101) | Parameters | Avg. success |
| --- | --- | --- |
| **SmolVLA** | 0.45B | **78.3%** |
| pi0 (multitask) | 3.5B | 61.7% |
| ACT (single-task) | 0.08B | 48.3% |

On an out-of-distribution SO-101 "pick-place-Lego" generalization test, SmolVLA reached 90% in-distribution and 50% out-of-distribution, versus ACT at 70% and 40% respectively.[1]

### Impact of community pretraining

Pretraining on community data is decisive. On multitask SO-100 manipulation, success rose from 51.7% without pretraining to 78.3% with community pretraining, a gain of about 26.6 percentage points.[1]

### Asynchronous inference benefits

The asynchronous inference stack improves real-world responsiveness:[2]

- About **30% faster** task completion (roughly 9.7 seconds versus 13.75 seconds in the reported comparison).[1]
- About **2x more tasks completed** in a fixed time window (19 cubes moved versus 9 in the synchronous baseline).[1]

## Technical specifications

| Specification | Value |
| --- | --- |
| Total parameters | ~450 million (about 10x smaller than typical VLAs)[1] |
| VLM backbone | SmolVLM-2 (~500M)[1] |
| Action expert | ~100 million parameters[1] |
| VLM layers used | First N = L/2 (half the language-decoder layers)[1] |
| Visual tokens per frame | 64 (pixel-shuffle reduction)[1] |
| Action output | Continuous action chunks via flow matching[1] |
| Training data | ~481 to 487 community datasets, ~10.6M frames, <30k episodes[1][2] |
| License | Apache-2.0 (open code and weights)[3] |

### Software integration

SmolVLA is fully integrated with the [LeRobot](/wiki/lerobot) framework, so the same library is used to download a dataset, fine-tune `lerobot/smolvla_base`, and run inference. A minimal fine-tuning command looks like:

```
python lerobot/scripts/train.py \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=lerobot/your_dataset \
    --batch_size=64 \
    --steps=20000
```

## What is SmolVLA used for?

SmolVLA targets low-cost manipulation tasks and edge deployment.[1][2]

### Supported tasks

- Pick-and-place of objects with varied shapes and sizes.
- Stacking blocks into stable structures.
- Sorting objects by category or property.
- Simple assembly and manipulation sequences.

### Robot platforms

- **SO-100**: the primary low-cost training platform.
- **SO-101**: used to demonstrate generalization to a new arm.
- **Other LeRobot-compatible arms**: including community-built and ALOHA-style rigs through the LeRobot framework.

### Use cases

- **Education and hobby robotics**: cheap enough for classroom demos and maker projects.[4]
- **Research prototyping**: quick fine-tuning with a handful of demonstrations using the LeRobot trainer.[2]
- **Edge and offline deployment**: runs on consumer GPUs or CPUs, useful for privacy-sensitive installations.[3]
- **Research baseline**: a reproducible small-scale reference for studying VLA design choices.[1]

## How does SmolVLA compare with other VLAs?

| Model | Parameters | Training data | Hardware | Open weights |
| --- | --- | --- | --- | --- |
| **SmolVLA** | ~0.45B | Community LeRobot datasets | Consumer GPU / CPU / MacBook | Yes |
| OpenVLA | 7B | Open X-Embodiment | High-end GPU | Yes |
| RT-2 | 55B | Proprietary (Google) | Enterprise GPU cluster | No |
| pi0 | 3.3B-3.5B | Mixed proprietary and open | High-end GPU | Partial |
| ACT | ~0.08B | Task-specific demonstrations | Mid-range GPU | Yes |

SmolVLA's distinguishing claim is that it reaches the performance band of models 10 times its size while being trainable and deployable on hardware most people already own.[1]

## Impact and reception

SmolVLA was widely covered as a step toward democratizing robot learning. Reporting emphasized that the model is "so efficient it can run on a MacBook," highlighting how it lowers the barrier to entry for robotics research and hobbyist use.[4] Commentators noted three themes:[2][4]

- It lowers the cost and hardware barrier for robotics research.
- It demonstrates the value of open, community-collected datasets over proprietary fleets.
- It shows that compact models can stay competitive with much larger VLAs, pushing back on pure scaling.

## Limitations

The authors acknowledge several limitations:[1]

1. **Dataset breadth**: training data skews toward the SO-100 / SO-101 arms, limiting cross-embodiment generalization.
2. **Dataset size**: it uses far fewer episodes (<30k) than the largest VLAs, which train on much larger trajectory collections.
3. **Long-horizon tasks**: evaluation focuses on relatively short manipulation tasks.
4. **General-purpose backbone**: the VLM is not specifically pretrained for robotics.
5. **Single-arm focus**: primary evaluation is on single-arm manipulation.
6. **Language grounding**: the compact size trades off some complex language understanding versus billion-parameter VLAs.

## Future directions

The SmolVLA team and community point to several next steps:[1]

- **Cross-embodiment training**: extending beyond SO-100 / SO-101 to more diverse platforms.
- **Scaling studies**: exploring how performance changes across model sizes.
- **Joint multimodal training**: combining robotics data with general vision-language data.
- **Real-time optimization**: further reducing inference latency.
- **Larger community datasets**: growing the open LeRobot data pool to broaden coverage.

## ELI5 (Explain Like I'm 5)

Imagine a robot arm that can see with a camera, listen to a simple instruction like "put the red block in the box," and then actually move to do it. Most robot brains that smart are huge and need expensive computers. SmolVLA is a much smaller robot brain made by Hugging Face. It is small enough to run on a regular laptop, even a MacBook, but it still does the job almost as well as the big ones. It learned by watching lots of videos that people around the world shared for free, instead of secret company data, so anyone can download it and teach their own cheap robot arm new tricks.

## See also

- [Vision-language-action model](/wiki/vision_language_action_model)
- [LeRobot](/wiki/lerobot)
- [Hugging Face](/wiki/hugging_face)
- [OpenVLA](/wiki/openvla)
- [Embodied AI](/wiki/embodied_ai)
- [Robot learning](/wiki/robot_learning)
- [Foundation models](/wiki/foundation_models)
- [Edge AI](/wiki/edge_ai)
- [Robotics](/wiki/robotics)

## References

1. Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., et al. "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics." arXiv:2506.01844, June 2, 2025. https://arxiv.org/abs/2506.01844
2. Hugging Face. "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data." Hugging Face Blog, June 3, 2025. https://huggingface.co/blog/smolvla
3. Hugging Face. "SmolVLA model card (lerobot/smolvla_base)" and SmolVLA documentation. https://huggingface.co/lerobot/smolvla_base and https://huggingface.co/docs/lerobot/smolvla
4. Wiggers, K. "Hugging Face says its new robotics model is so efficient it can run on a MacBook." TechCrunch, June 4, 2025. https://techcrunch.com/2025/06/04/hugging-face-says-its-new-robotics-model-is-so-efficient-it-can-run-on-a-macbook/

## External links

- [Official SmolVLA blog post](https://huggingface.co/blog/smolvla)
- [SmolVLA model on Hugging Face Hub](https://huggingface.co/lerobot/smolvla_base)
- [SmolVLA arXiv paper](https://arxiv.org/abs/2506.01844)
- [LeRobot GitHub repository](https://github.com/huggingface/lerobot)
- [SmolVLA documentation](https://huggingface.co/docs/lerobot/smolvla)

