SmolVLA

AI Hardware AI Models Artificial Intelligence Embodied AI Google DeepMind Multimodal AI Open Source AI Robotics

13 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v4 · 2,542 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Terms and artificial intelligence terms

SmolVLA (Small Vision-Language-Action) is a compact, open-source vision-language-action model (VLA) for robotics developed by Hugging Face and released in June 2025. At roughly 450 million parameters, it is about 10 times smaller than typical VLAs yet achieves comparable task performance, can be trained on a single GPU, and runs on consumer-grade hardware including a single GPU, a CPU, or a MacBook.^[1]^[2] It was pretrained entirely on around 10 million frames drawn from roughly 481 to 487 open, community-contributed LeRobot datasets, making it the first widely adopted VLA built on freely shared, affordable-robot data rather than proprietary fleets.^[2]^[3]

The paper, titled "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics," states the design goal plainly: "we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance."^[1] The name "SmolVLA" reflects its small size, being a playful spelling of "small."

What is SmolVLA?

SmolVLA is a robotics foundation model that takes camera images, a natural-language instruction, and the robot's current joint state, and outputs continuous low-level actions (motor commands) for a manipulator. It follows the same three-modality vision, language, and action paradigm as larger VLAs such as OpenVLA, RT-2, and Physical Intelligence's pi0, but it is built explicitly for affordability and accessibility rather than scale.^[1]

The model was created to address three obstacles that the authors argue keep robot learning out of reach for most people: the high compute cost of billion-parameter VLAs, the reliance on closed academic and industrial datasets, and the difficulty of deploying large policies on cheap hardware.^[1] As the abstract puts it, existing VLAs are "typically massive, often with billions of parameters, leading to high training costs and limited real-world deployability," and they "rely on academic and industrial datasets, overlooking the growing availability of community-collected data from affordable robotic platforms."^[1] SmolVLA is Hugging Face's answer: a model and recipe that a student, hobbyist, or small lab can fine-tune and run end to end.

The model card describes it as "a compact, efficient Vision-Language-Action (VLA) model designed for affordable and efficient robotics, trainable on a single GPU and deployable on consumer hardware."^[3] Its code, weights, and training data are released openly under the LeRobot project.

Overview

SmolVLA democratizes robotics by enabling vision-language-action capabilities on consumer-grade hardware, making it suitable for educational projects, research, and small-scale automation.^[2] Despite its compact size, the paper reports that "SmolVLA achieves performance comparable to VLAs that are 10x larger."^[1]

The model achieves its efficiency through several design choices:

Efficient backbone: Uses only the first half of the vision-language model's layers (the first N = L/2 layers of the language decoder), cutting compute and latency without retraining a custom encoder.^[1]
Community-driven training: Pretrained exclusively on open, community-contributed datasets tagged "lerobot" on the Hugging Face Hub, with no proprietary data.^[2]
Asynchronous inference: An inference stack that decouples perception and action prediction from action execution, reported to give about 30% faster response and 2x task throughput.^[2]
Hardware accessibility: Trainable on a single consumer or A100 GPU and deployable on consumer GPUs, CPUs, or a MacBook for offline, privacy-sensitive use.^[3]

Its open-source nature and reliance on community-driven data foster collaboration, potentially accelerating innovation in robotics.^[2]

Who built SmolVLA, and when was it released?

SmolVLA was developed by a team at Hugging Face and academic collaborators. The arXiv paper (ID 2506.01844) was submitted on June 2, 2025, and the official Hugging Face blog post, model weights, and code were published on June 3, 2025.^[1]^[2] The model is part of Hugging Face's LeRobot ecosystem, the open robotics stack launched in 2024 that provides robotics-focused models, datasets, and tools.^[2]

The listed authors are: Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf (co-founder of Hugging Face), and Remi Cadene (research scientist leading LeRobot at Hugging Face).^[1] Lead author Mustafa Shukor and Dana Aubakirova are affiliated with Sorbonne University and ENS Paris-Saclay respectively, and Matthieu Cord is also at Sorbonne University; the work is a Hugging Face and academic collaboration.

Date	Milestone
June 2, 2025	SmolVLA arXiv paper (2506.01844) submitted^[1]
June 3, 2025	Official Hugging Face blog post, model weights, and code released^[2]
June 4, 2025	Press coverage of a robotics model efficient enough to "run on a MacBook"^[4]

What is SmolVLA's architecture?

SmolVLA's architecture has two main components: a pretrained vision-language backbone for perception, and a lightweight action expert that generates robot motions.^[1]

Perception module (SmolVLM-2)

The perception module is based on SmolVLM-2, an efficient open vision-language model (~500M parameters) optimized for multi-image and video inputs. It pairs a SigLIP visual encoder with a compact language decoder built on SmolLM2.^[1] Key features:

Vision encoder: SigLIP for visual feature encoding.
Language decoder: SmolLM2, a compact language model.
Token efficiency: Limits visual tokens to 64 per frame using pixel-shuffle token reduction (down from over 1,000), greatly cutting compute.^[1]
Layer pruning: Uses only the first N layers (N = L/2, half the total) of the VLM's language decoder, reducing latency.^[1]

Action expert

The action expert is a specialized transformer module of about 100 million parameters that generates continuous robot actions:^[1]

Architecture: Alternates self-attention and cross-attention blocks with causal masking, conditioned on the VLM's hidden states.
Training objective: Uses flow matching to map noisy samples back toward ground-truth actions.
Action chunks: Produces sequences of future robot actions ("action chunks") rather than one step at a time.
Temporal consistency: Causal masking improves smoothness and temporal coherence.

Asynchronous inference stack

A key contribution is SmolVLA's asynchronous inference system, which introduces a RobotClient and PolicyServer schema:^[2]

The robot executes the current action chunk while the server already predicts the next one.
An action queue is filled until a guard-band threshold is reached.
This enables higher control rates and low-latency control suitable for real-time applications.
It makes the system more adaptive and able to recover faster from errors.

Input processing

SmolVLA processes three input types:^[1]

RGB images: One or more camera views (for example a top-down view plus a wrist camera).
Language instructions: Natural-language task descriptions tokenized into text tokens.
Sensorimotor state: The robot's current joint state, projected into a token via a linear layer.

How was SmolVLA trained?

Datasets

SmolVLA was pretrained exclusively on open, community-contributed datasets from the LeRobot ecosystem. The paper reports roughly 481 datasets totaling about 22,900 episodes and 10.6 million frames (the Hugging Face blog rounds this to "487 community datasets" and "10 million frames"); both figures describe "fewer than 30k episodes."^[1]^[2] By comparison, the authors note this is far less data than state-of-the-art VLAs such as OpenVLA use.^[1] The training data consists of:^[2]

Community datasets tagged "lerobot" on the Hugging Face Hub, focused largely on the affordable SO-100 / SO-101 robot arms.
Diverse task coverage: pick-and-place, stacking, sorting, and other manipulation tasks.
Natural diversity: varied lighting, suboptimal demonstrations, and heterogeneous control schemes.
Multiple environments: data collected in homes, maker spaces, and research labs.

The datasets were curated and cleaned by the community: a custom filtering tool surfaced low-quality data, and noisy task labels were standardized automatically using a vision-language model (Qwen2.5-VL) to rewrite instructions into concise, action-verb phrases.^[2]

Training process

The training methodology follows a two-stage approach inspired by large language models:^[1]^[2]

Community pretraining: General manipulation pretraining on the LeRobot community datasets.
Task-specific post-training: Fine-tuning on the target robot and tasks, including in low-data regimes with only a handful of demonstrations.

Key training facts:^[3]

Can be trained on a single GPU (consumer or A100 class).
Training for 20,000 steps takes roughly 4 hours on a single A100 GPU.^[3]
Batch size and VRAM use are adjustable to the available GPU.

How does SmolVLA perform?

Despite its size, SmolVLA matches or beats much larger VLAs on standard simulation benchmarks and on real low-cost robots.^[1]

Simulation benchmarks

On the LIBERO benchmark, the 0.45B SmolVLA reaches an average success rate of 87.3%, outperforming Octo (0.09B, 75.1%), OpenVLA (7B, 76.5%), and the robotics-pretrained pi0 (3.3B, 86.0%), as well as a Diffusion Policy baseline (72.4%).^[1]

LIBERO (avg. success)	Parameters	Score
SmolVLA	0.45B	87.3%
pi0	3.3B	86.0%
OpenVLA	7B	76.5%
Octo	0.09B	75.1%
Diffusion Policy	(baseline)	72.4%

On Meta-World, the 0.45B SmolVLA reaches an average success rate of 57.3%, far above a Diffusion Policy baseline at 10.5% and TinyVLA at 31.6%, and ahead of pi0 (50.5%) in the same no-pretraining setting.^[1]

Real-world performance

On real, low-cost SO-100 and SO-101 arms, SmolVLA outperforms strong baselines despite being multitask:^[1]

Real-world (SO-100 / SO-101)	Parameters	Avg. success
SmolVLA	0.45B	78.3%
pi0 (multitask)	3.5B	61.7%
ACT (single-task)	0.08B	48.3%

On an out-of-distribution SO-101 "pick-place-Lego" generalization test, SmolVLA reached 90% in-distribution and 50% out-of-distribution, versus ACT at 70% and 40% respectively.^[1]

Impact of community pretraining

Pretraining on community data is decisive. On multitask SO-100 manipulation, success rose from 51.7% without pretraining to 78.3% with community pretraining, a gain of about 26.6 percentage points.^[1]

Asynchronous inference benefits

The asynchronous inference stack improves real-world responsiveness:^[2]

About 30% faster task completion (roughly 9.7 seconds versus 13.75 seconds in the reported comparison).^[1]
About 2x more tasks completed in a fixed time window (19 cubes moved versus 9 in the synchronous baseline).^[1]

Technical specifications

Specification	Value
Total parameters	~450 million (about 10x smaller than typical VLAs)^[1]
VLM backbone	SmolVLM-2 (~500M)^[1]
Action expert	~100 million parameters^[1]
VLM layers used	First N = L/2 (half the language-decoder layers)^[1]
Visual tokens per frame	64 (pixel-shuffle reduction)^[1]
Action output	Continuous action chunks via flow matching^[1]
Training data	~481 to 487 community datasets, ~10.6M frames, <30k episodes^[1]^[2]
License	Apache-2.0 (open code and weights)^[3]

Software integration

SmolVLA is fully integrated with the LeRobot framework, so the same library is used to download a dataset, fine-tune lerobot/smolvla_base, and run inference. A minimal fine-tuning command looks like:

python lerobot/scripts/train.py \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=lerobot/your_dataset \
    --batch_size=64 \
    --steps=20000

What is SmolVLA used for?

SmolVLA targets low-cost manipulation tasks and edge deployment.^[1]^[2]

Supported tasks

Pick-and-place of objects with varied shapes and sizes.
Stacking blocks into stable structures.
Sorting objects by category or property.
Simple assembly and manipulation sequences.

Robot platforms

SO-100: the primary low-cost training platform.
SO-101: used to demonstrate generalization to a new arm.
Other LeRobot-compatible arms: including community-built and ALOHA-style rigs through the LeRobot framework.

Use cases

Education and hobby robotics: cheap enough for classroom demos and maker projects.^[4]
Research prototyping: quick fine-tuning with a handful of demonstrations using the LeRobot trainer.^[2]
Edge and offline deployment: runs on consumer GPUs or CPUs, useful for privacy-sensitive installations.^[3]
Research baseline: a reproducible small-scale reference for studying VLA design choices.^[1]

How does SmolVLA compare with other VLAs?

Model	Parameters	Training data	Hardware	Open weights
SmolVLA	~0.45B	Community LeRobot datasets	Consumer GPU / CPU / MacBook	Yes
OpenVLA	7B	Open X-Embodiment	High-end GPU	Yes
RT-2	55B	Proprietary (Google)	Enterprise GPU cluster	No
pi0	3.3B-3.5B	Mixed proprietary and open	High-end GPU	Partial
ACT	~0.08B	Task-specific demonstrations	Mid-range GPU	Yes

SmolVLA's distinguishing claim is that it reaches the performance band of models 10 times its size while being trainable and deployable on hardware most people already own.^[1]

Impact and reception

SmolVLA was widely covered as a step toward democratizing robot learning. Reporting emphasized that the model is "so efficient it can run on a MacBook," highlighting how it lowers the barrier to entry for robotics research and hobbyist use.^[4] Commentators noted three themes:^[2]^[4]

It lowers the cost and hardware barrier for robotics research.
It demonstrates the value of open, community-collected datasets over proprietary fleets.
It shows that compact models can stay competitive with much larger VLAs, pushing back on pure scaling.

Limitations

The authors acknowledge several limitations:^[1]

Dataset breadth: training data skews toward the SO-100 / SO-101 arms, limiting cross-embodiment generalization.
Dataset size: it uses far fewer episodes (<30k) than the largest VLAs, which train on much larger trajectory collections.
Long-horizon tasks: evaluation focuses on relatively short manipulation tasks.
General-purpose backbone: the VLM is not specifically pretrained for robotics.
Single-arm focus: primary evaluation is on single-arm manipulation.
Language grounding: the compact size trades off some complex language understanding versus billion-parameter VLAs.

Future directions

The SmolVLA team and community point to several next steps:^[1]

Cross-embodiment training: extending beyond SO-100 / SO-101 to more diverse platforms.
Scaling studies: exploring how performance changes across model sizes.
Joint multimodal training: combining robotics data with general vision-language data.
Real-time optimization: further reducing inference latency.
Larger community datasets: growing the open LeRobot data pool to broaden coverage.

ELI5 (Explain Like I'm 5)

Imagine a robot arm that can see with a camera, listen to a simple instruction like "put the red block in the box," and then actually move to do it. Most robot brains that smart are huge and need expensive computers. SmolVLA is a much smaller robot brain made by Hugging Face. It is small enough to run on a regular laptop, even a MacBook, but it still does the job almost as well as the big ones. It learned by watching lots of videos that people around the world shared for free, instead of secret company data, so anyone can download it and teach their own cheap robot arm new tricks.

References

Shukor, M., Aubakirova, D., Capuano, F., Kooijmans, P., Palma, S., Zouitine, A., et al. "SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics." arXiv:2506.01844, June 2, 2025. https://arxiv.org/abs/2506.01844 ↩
Hugging Face. "SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data." Hugging Face Blog, June 3, 2025. https://huggingface.co/blog/smolvla ↩
Hugging Face. "SmolVLA model card (lerobot/smolvla_base)" and SmolVLA documentation. https://huggingface.co/lerobot/smolvla_base and https://huggingface.co/docs/lerobot/smolvla ↩
Wiggers, K. "Hugging Face says its new robotics model is so efficient it can run on a MacBook." TechCrunch, June 4, 2025. https://techcrunch.com/2025/06/04/hugging-face-says-its-new-robotics-model-is-so-efficient-it-can-run-on-a-macbook/ ↩

External links

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Artificial intelligence terms Embodied AI NVIDIA Isaac GR00T N1 Terms Vision-language-action model π₀ (pi-zero)

What is SmolVLA?

Overview

Who built SmolVLA, and when was it released?

What is SmolVLA's architecture?

Perception module (SmolVLM-2)

Action expert

Asynchronous inference stack

Input processing

How was SmolVLA trained?

Datasets

Training process

How does SmolVLA perform?

Simulation benchmarks

Real-world performance

Impact of community pretraining

Asynchronous inference benefits

Technical specifications

Software integration

What is SmolVLA used for?

Supported tasks

Robot platforms

Use cases

How does SmolVLA compare with other VLAs?

Impact and reception

Limitations

Future directions

ELI5 (Explain Like I'm 5)

See also

References

External links

Improve this article

Related Articles

Gemma 3

ERQA

PaLM-E: An Embodied Multimodal Language Model

Vision-language-action model

RoboCat

Wuji Hand

What links here

Related Articles

Gemma 3

ERQA

PaLM-E: An Embodied Multimodal Language Model

Vision-language-action model

RoboCat

Wuji Hand

What links here