SmolLM 2
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 2,740 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
15 citations
Review status
Source-backed
Revision
v1 ยท 2,740 words
Add missing citations, update stale details, or suggest a clearer explanation.
SmolLM 2 is a family of compact open-weight language models released by Hugging Face on November 1, 2024. The family contains three sizes, 135M, 360M, and 1.7 billion parameters, and was designed to run on consumer hardware, mobile devices, and edge environments where running larger models is impractical. SmolLM 2 succeeds the original SmolLM line from July 2024 and applies a data-centric training approach across roughly 11 trillion tokens for the 1.7B flagship, with smaller checkpoints trained on 4T and 2T tokens respectively.
The series ships under the Apache 2.0 license and is distributed through the HuggingFaceTB organization on the Hugging Face Hub. Each size has a base pretrained checkpoint and an instruct variant fine-tuned with supervised fine-tuning followed by Direct Preference Optimization. According to the team's published comparisons, the 1.7B instruct model is competitive with, and in several reported benchmarks ahead of, contemporary open releases such as Llama 3.2 1B and Qwen 2.5 1.5B at similar parameter counts. A follow-up technical report titled "SmolLM2: When Smol Goes Big, Data-Centric Training of a Small Language Model" was posted on arXiv on February 4, 2025.
SmolLM 2 sits inside Hugging Face's broader push into small language models and predates the SmolLM 3 release. The project is led by Loubna Ben Allal's Smol Models Research team at Hugging Face.
The original SmolLM family was published by Hugging Face in July 2024 as a research artifact built on the SmolLM-Corpus, a curated mix of Cosmopedia v2, FineWeb-Edu, Python-Edu, and a small slice of StarCoder data. SmolLM 1 demonstrated that a 1.7B model trained on a tighter, higher-quality corpus could rival much larger general-purpose checkpoints on common-sense reasoning benchmarks like ARC and HellaSwag. The 135M and 360M checkpoints in that first release were aimed at researchers studying scaling laws for very small models and at hobbyists trying to run language models on phones and Raspberry Pi-class hardware.
The SmolLM 2 team has been candid about the limitations of the first generation. SmolLM 1 was weak on instruction following, sometimes refused well-formed requests, struggled with multi-turn chat, and had thin math coverage. Knowledge cutoff was also limited because the pretraining corpus emphasized educational web text over broader factual coverage. The second generation was framed as a direct response to those issues, with the team treating data composition as the primary lever rather than chasing larger parameter counts. The technical report frames this as "data-centric training," arguing that for small models, a careful multi-stage data mix matters more than additional capacity.
The wider context was a 2024 surge in small open models. Meta released the Llama 3.2 small variants in September 2024 with 1B and 3B sizes intended for on-device use. Alibaba shipped Qwen 2.5 at the same time, with checkpoints from 0.5B up to 72B and several sub-2B models that competed directly with the SmolLM and Llama 3.2 small variants. Microsoft's Phi-3 mini, released in April 2024, had already established that a heavily filtered corpus could push a sub-4B model past much larger competitors on reasoning tasks. SmolLM 2 entered this market from a different angle: instead of treating small models as a downstream-distillation problem from a larger teacher, the team trained from scratch on a carefully composed mix and published the recipe and the datasets in full.
The SmolLM 2 family launched with three sizes. Each size has both a base pretrained model and an instruction-tuned variant.
| Model | Parameters | Pretraining tokens | License | Release |
|---|---|---|---|---|
| SmolLM2-135M | 135 million | 2 trillion | Apache 2.0 | November 1, 2024 |
| SmolLM2-360M | 360 million | 4 trillion | Apache 2.0 | November 1, 2024 |
| SmolLM2-1.7B | 1.7 billion | 11 trillion | Apache 2.0 | November 1, 2024 |
The instruct counterparts (SmolLM2-135M-Instruct, SmolLM2-360M-Instruct, SmolLM2-1.7B-Instruct) were released alongside the base models, also under Apache 2.0. The 1.7B instruct checkpoint is the headline model and adds explicit support for function calling, text rewriting, and summarization in addition to general chat. The 135M and 360M instruct models target lighter use cases such as classification, on-device autocomplete, and short-form generation.
All three sizes were trained with bfloat16 precision using the nanotron framework on NVIDIA H100 GPUs. The 135M used 64 H100s, the 360M used 128, and the 1.7B used 256.
The SmolLM 2 training corpus is the central contribution of the project. The 1.7B model was pretrained on approximately 11 trillion tokens, which is high relative to the parameter count, a deliberate over-training choice meant to extract as much capability as possible from a small model. The data mix evolved across several stages, with the team manually adjusting the proportions between stages based on intermediate evaluation results.
The bulk of the corpus comes from two heavily filtered web datasets. FineWeb-Edu, a Hugging Face project that scores Common Crawl pages for educational value using a classifier and keeps only the high-scoring shards, provides the educational backbone. DCLM (DataComp for Language Models), released by the DCLM consortium in mid-2024, supplies a more general web corpus that was already shown to perform well at small scale. Mixing the two was meant to balance the narrow educational style of FineWeb-Edu with broader factual coverage from DCLM.
For mathematical content, the team introduced FineMath, a new dataset purpose-built for SmolLM 2 because they judged existing open math corpora to be either too small or too noisy. FineMath filters mathematical web pages using both classifier signals and rule-based heuristics and is released as part of the SmolLM 2 publication.
For code, the project introduced Stack-Edu, a filtered subset of The Stack v2 that keeps only files judged to have educational value. The reasoning is the same as for FineWeb-Edu: small models benefit from carefully filtered data because they have less capacity to learn from noisy or irrelevant examples.
The instruct variants are post-trained with a separate dataset called SmolTalk, a curated instruction-following corpus that the team also released. SmolTalk combines publicly available instruction datasets, synthetic data generated by larger models, and the Argilla Synth-APIGen-v0.1 dataset for function calling. Preference optimization uses the public UltraFeedback dataset.
The team framed the corpus design as the main reason SmolLM 2 outperformed its predecessors and several contemporary peers. In particular, they argue that introducing fresh datasets at later training stages, instead of holding the mix constant from the start, lets the model lock in basic language modeling first and then specialize. This is the practical justification for the multi-stage curriculum described in the technical report.
SmolLM 2 uses a standard decoder-only transformer architecture in the Llama family. There were no major architectural inventions in the release; the team described the work as a data and training-recipe contribution rather than a new architecture. The published configuration for the 1.7B checkpoint reports 24 transformer layers, a model dimension of 2,048, a feed-forward dimension of 8,192, and 32 attention heads. Tokenization reuses the SmolLM tokenizer, which has a 49,152-token vocabulary.
Context length for all three sizes is 8,192 tokens, achieved through long-context training in the later stages of pretraining. The base 135M and 360M models also offer 2k-context variants for resource-constrained deployments where activation memory matters.
The 1.7B model has a bfloat16 memory footprint of roughly 3.4 GB at full precision, and the 135M sits at about 720 MB. Quantized to 8-bit, the 135M shrinks to about 138 MB, which is small enough to load on a smartphone without specialized hardware. The team also published evaluation harness configurations alongside the model cards, so third parties can reproduce the reported benchmark numbers with minimal setup. The decision to keep the architecture conventional was explicit; the team wanted to make the recipe portable and easy to adopt for other groups building small models.
Hugging Face published benchmark numbers for both the base and instruct variants in the model cards on the Hub. The figures below come from those cards and are the team's own evaluations.
| Benchmark | SmolLM2-135M | SmolLM2-360M | SmolLM2-1.7B |
|---|---|---|---|
| HellaSwag | 42.1 | 54.5 | 68.7 |
| ARC (average) | 43.9 | 53.0 | 60.5 |
| PIQA | 68.4 | 71.7 | not reported in card extract |
| MMLU (cloze) | 31.5 | 35.8 | not reported in card extract |
| MMLU-Pro | not reported | not reported | 19.4 |
| CommonsenseQA | 33.9 | 38.0 | not reported |
| TriviaQA | not reported | not reported | 36.7 |
| GSM8K (5-shot) | 1.4 | not reported | 31.0 |
The 1.7B base model is the strongest of the three. HellaSwag at 68.7 and ARC at 60.5 put it ahead of the original SmolLM-1.7B and broadly in line with other open models in the 1B to 2B range.
| Benchmark | SmolLM2-135M-Instruct | SmolLM2-360M-Instruct | SmolLM2-1.7B-Instruct |
|---|---|---|---|
| IFEval (average) | 29.9 | 41.0 | 56.7 |
| MT-Bench | 1.98 | 3.66 | 6.13 |
| HellaSwag | 40.9 | 52.1 | not reported in card extract |
| ARC (average) | 37.3 | 43.7 | not reported |
| PIQA | 66.3 | not reported | not reported |
| GSM8K (5-shot) | not reported | not reported | 48.2 |
The instruct numbers show the largest gains over SmolLM 1. The 135M-Instruct's IFEval score nearly doubled from 17.2 in SmolLM 1 to 29.9 in SmolLM 2, and the 1.7B-Instruct's IFEval of 56.7 and MT-Bench of 6.13 put it in usable chat-assistant territory for a model of its size.
Each base checkpoint has a matching instruct version. The post-training pipeline is the same across sizes: supervised fine-tuning on the SmolTalk corpus, followed by Direct Preference Optimization on UltraFeedback. The 1.7B instruct variant also includes function-calling fine-tuning using Argilla's Synth-APIGen-v0.1 dataset, which gives it a usable tool-call output format compatible with OpenAI-style function schemas.
The instruct chat template uses the ChatML-style format with explicit system, user, and assistant roles. For the 1.7B variant, the team published a short cookbook of recipes for common applications: a retrieval-augmented chatbot, a summarization pipeline, a rewriting agent for grammar and tone, and a JSON-mode function caller. The 360M instruct model targets a slightly different niche: a quick local classifier or short-form generator that can run with minimal hardware, where the 1.7B is overkill but the 135M is too small to instruction-follow reliably.
Intended use cases listed in the model cards include on-device assistants, summarization, paraphrasing and rewriting, classification, and lightweight function calling. The team is explicit that the models are primarily English, can produce factual errors, and should be treated as assistive rather than authoritative. The 135M instruct in particular is described in the model card as best for narrow tasks with consistent prompts rather than open-ended chat.
The SmolLM 2 instruct models are widely available in quantized form. The Hugging Face Hub hosts dozens of community GGUF, AWQ, and GPTQ conversions, and the models are packaged in Ollama, LM Studio, and llama.cpp out of the box, making them one of the more frictionless small models to run locally. The 1.7B instruct in 4-bit GGUF form occupies roughly 1.1 GB on disk and runs at well over 20 tokens per second on a modern laptop CPU, which is part of why it became a common pick for offline-first applications and demos.
All six SmolLM 2 checkpoints (three base, three instruct) are released under the Apache 2.0 license, which permits commercial use, redistribution, and fine-tuning without revenue thresholds or use-case restrictions. The training datasets released alongside the model (FineMath, Stack-Edu, SmolTalk) are also open and live on the Hugging Face Hub.
This is in contrast to several competing small models. Llama 3.2 is released under Meta's community license, which has revenue caveats and acceptable-use restrictions. Phi-3 is MIT-licensed. Qwen 2.5 small variants are mostly Apache 2.0 as of the late-2024 releases, with the exception of the 3B which uses a separate Qwen Research license. SmolLM 2 is therefore one of the more permissive small models available, which has made it popular as a starting point for fine-tuning experiments.
The SmolLM 2 launch positioned the family against three lines of small models released around the same time: Meta's Llama 3.2 small variants, Alibaba's Qwen 2.5 small variants, and Microsoft's Phi-3 mini family. The comparisons below are from the SmolLM 2 model cards and the team's announcement post, which are the only first-party sources for these numbers; independent reviewers have reproduced parts of the comparison but not all of it.
| Model | Parameters | Pretraining tokens | License | IFEval reported by maker |
|---|---|---|---|---|
| SmolLM2-1.7B-Instruct | 1.7B | 11T | Apache 2.0 | 56.7 |
| Llama 3.2 1B Instruct | 1.24B | 9T (Llama 3.2 family) | Llama 3.2 community license | reported by Meta |
| Qwen 2.5 1.5B Instruct | 1.54B | ~18T (Qwen 2.5 family) | Apache 2.0 | reported by Alibaba |
| Phi-3 mini (3.8B) | 3.8B | 3.3T | MIT | reported by Microsoft |
The SmolLM 2 team's claim, repeated in the technical report and the announcement blog, is that SmolLM2-1.7B-Instruct outperforms Llama 3.2 1B and Qwen 2.5 1.5B on their reported instruction-following and reasoning benchmarks. Phi-3 mini is not a direct comparison because it has more than twice the parameter count of SmolLM2-1.7B; the team includes it for context rather than as a head-to-head.
It is worth flagging that benchmark rankings between these models shift substantially depending on which evaluation harness, prompt template, and decoding settings are used. Independent leaderboards have given different orderings, and several reviewers have argued that the practical differences between SmolLM 2, Llama 3.2, and Qwen 2.5 at the sub-2B scale are small. Where SmolLM 2 has stood out most clearly in third-party tests is on the smaller end (135M and 360M), where it has fewer well-resourced competitors.
Reception of SmolLM 2 was broadly positive in the open-source AI community. Simon Willison covered the release the same day, called his first impressions "really positive," and emphasized that the 135M model is small enough to run effectively on a phone. VentureBeat, MarkTechPost, and a number of other outlets framed the release as a notable step in the on-device LLM trend, noting that the 1.7B variant runs comfortably on consumer laptops without GPU acceleration.
The research reception, once the technical report appeared in February 2025, focused on the data-centric methodology. The paper became a frequent reference for follow-up work on small-model training recipes, and the SmolTalk, FineMath, and Stack-Edu datasets have been reused in other small-model projects. The combined download counts for the SmolLM 2 checkpoints on the Hugging Face Hub run into the millions per month, and the model has spawned hundreds of community fine-tunes, quantizations, and adapter releases.
Criticism has been mostly about scope rather than execution. The models are primarily English, do not have native multimodal capability, and have a relatively short 8k context window compared to larger contemporaries. The Hugging Face team has acknowledged each of these limits in the model cards and has positioned multilingual coverage, longer context, and multimodal extensions as work for follow-up releases, including the later SmolLM 3 line.