SmolLM
Last reviewed
May 16, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,946 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 ยท 3,946 words
Add missing citations, update stale details, or suggest a clearer explanation.
SmolLM is the first generation of a family of small, fully open language models released by Hugging Face on July 16, 2024. The line covers three sizes (135 million, 360 million, and 1.7 billion parameters), all trained on a curated mixture called SmolLM-Corpus, and all distributed under the Apache 2.0 license. The project was led by Loubna Ben Allal, Anton Lozhkov, and Elie Bakouch as part of Hugging Face's Smol Models Research effort (the HuggingFaceTB account on the Hub), with the stated goal of showing that careful data curation can let very small models reach or beat much larger systems on common sense reasoning and world knowledge benchmarks.
The release came at a moment when Small Language Models had become a serious research subfield rather than just a hobbyist concern. Microsoft's Phi line had demonstrated that a synthetic textbook style corpus could lift sub-2B models to competitive scores on reasoning tests. Apple's MobileLLM paper had argued that depth, embedding tying, and grouped-query attention were the main levers for sub-1B accuracy. SmolLM stitched those threads together with Hugging Face's own data infrastructure (the FineWeb pipeline and the Cosmopedia synthetic corpus) and shipped a fully open package: weights, training framework, datasets, tokenizer, and recipes.
Within a few months SmolLM was superseded by SmolLM 2 in November 2024 and later by SmolLM 3 in July 2025, but the first generation remains the reference point for the family and for a wider category of edge friendly language models from 2024.
| Field | Value |
|---|---|
| Developer | Hugging Face (Smol Models Research, HuggingFaceTB) |
| Initial release | July 16, 2024 |
| Sizes | 135M, 360M, 1.7B parameters |
| Architecture | Decoder only transformer (causal language model) |
| Context length | 2,048 tokens |
| Tokenizer | cosmo2-tokenizer, vocabulary 49,152 |
| Training framework | Nanotron |
| Training hardware | 64 NVIDIA H100 GPUs |
| Pretraining tokens | 600 billion (135M and 360M), 1 trillion (1.7B) |
| Precision | bfloat16 |
| Training dataset | SmolLM-Corpus (Cosmopedia v2, FineWeb-Edu, Python-Edu) |
| License | Apache 2.0 |
| Lead authors | Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, Thomas Wolf |
| Repository | huggingface.co/HuggingFaceTB |
| Blog post | huggingface.co/blog/smollm |
By mid 2024 the open weights landscape had two clear poles. At the top were 7 billion parameter models such as Llama 2 7B, Mistral 7B, and Qwen 2 7B, which were treated as the default "small" tier even though their memory footprints were still uncomfortable for consumer hardware. At the very bottom were sub-1B research models such as Pythia 1B, TinyLlama 1.1B, and various Cerebras and BLOOM ports, none of which were strong enough on reasoning benchmarks to be useful as general assistants. The middle ground (capable models small enough for phones, browsers, and CPUs) was occupied by closed releases like Apple's MobileLLM and Microsoft's Phi-1.5, with limited data transparency in both cases.
Hugging Face's Smol Models Research team had spent the first half of 2024 building two pieces of infrastructure that turned out to be the prerequisites for a fully open small model. The first was FineWeb, a 15 trillion token open web crawl with a derived educational quality subset called FineWeb-Edu, filtered by a Llama 3 70B trained classifier. The second was Cosmopedia, a 25 billion token synthetic corpus modelled on the Phi recipe but generated with Mixtral-8x7B-Instruct and released openly. Both datasets shipped under permissive licenses with full filtering scripts.
With those two ingredients in place, plus a code subset filtered from The Stack by a similar educational classifier (Python-Edu), the team had enough data to train a small model from scratch and document every step. The SmolLM blog post, published on July 16, 2024, framed the effort as a demonstration that "meticulously curated data can lead to high performance even with smaller model sizes," with the implicit argument that the gap between closed and open small models was a data engineering problem rather than a compute problem. The release came one day after Mistral AI shipped Mathstral 7B, and about a week before the launch of Llama 3.1, giving SmolLM a brief uncontested window in the on-device tier.
The SmolLM family at launch consisted of three sizes, each in a base pretrained version and an instruction tuned variant.
| Model | Parameters | Layers | Hidden size | Attention heads | Attention type | Context | Training tokens |
|---|---|---|---|---|---|---|---|
| SmolLM-135M | 135M | 30 | 576 | 9 | Grouped Query Attention | 2,048 | 600B |
| SmolLM-360M | 360M | 32 | 960 | 15 | Grouped Query Attention | 2,048 | 600B |
| SmolLM-1.7B | 1.7B | 24 | 2,048 | 32 | Multi Head Attention | 2,048 | 1T |
The two smaller models followed the MobileLLM design recipe, which prioritises depth over width for parameter efficiency at sub-1B scale, and used Grouped Query Attention with embedding tying to reduce memory. The 1.7B model used a more conventional layout closer to Llama 2 style transformers, with multi head attention and a wider hidden dimension. All three shared the same 49,152 token cosmo2 tokenizer, trained on the SmolLM-Corpus mixture so that the vocabulary was well matched to the training distribution.
Each base model was paired with a SmolLM-Instruct version produced by supervised fine tuning on a mixture of WebInstructSub (the permissive subset), StarCoder2-Self-OSS-Instruct, OpenHermes 2.5, and Everyday Conversations, followed by Direct Preference Optimization. The DPO mix differed by size, with HelpSteer used for the 135M and 1.7B instruct checkpoints and argilla/dpo-mix-7k used for 360M, a choice motivated by ablations on small scale post training stability.
The SmolLM-Corpus (also referred to as Cosmo-Corpus in the model cards) is the central artifact of the release. It combines three openly published datasets, each filtered or generated by the Hugging Face team, and is itself available on the Hub under permissive licenses.
Cosmopedia v2 is a 28 billion token synthetic dataset of textbooks, stories, articles, and code snippets generated by Mixtral-8x7B-Instruct-v0.1. It comprises around 39 million documents grouped under 34,000 topics, with the topic taxonomy derived from the BISAC book classification system. Audience targeting is fixed in the prompt for each document, with roughly 40 percent of generations aimed at a middle school reading level, 30 percent at college level, and 30 percent at mixed or other styles. The middle school subset turned out to be the strongest contributor on benchmarks other than MMLU, a finding the team reported in detail in the blog post.
Cosmopedia v2 is an evolution of the original Cosmopedia released in March 2024. The v2 release improved topic coverage, deduplicated overlapping documents, and tuned the prompt templates after the v1 run revealed gaps in coverage of mathematics, code, and current affairs. Cosmopedia v2 is the largest fully open synthetic pretraining corpus that had been released at the time of SmolLM's launch.
FineWeb-Edu is the educational quality subset of FineWeb, Hugging Face's open Common Crawl based pretraining corpus. The full FineWeb is around 15 trillion tokens; FineWeb-Edu is the 1.3 trillion token slice that scored highly on a Llama 3 70B trained classifier for educational value. For SmolLM, the team further deduplicated FineWeb-Edu and used the resulting 220 billion token subset as the largest single ingredient in the training mix.
The choice to lean heavily on FineWeb-Edu was based on ablations in the FineWeb-Edu technical report, which showed that training on the filtered subset reached the same downstream accuracy as training on a much larger unfiltered crawl. For a small model with a fixed token budget, this kind of quality filtering was effectively a way to buy more capability without more compute.
Python-Edu is a 4 billion token subset of The Stack v2, filtered by a code educational quality classifier trained on annotations from Llama 3 70B. The classifier rated Python files for their suitability as teaching material, and the Python-Edu set kept the top scoring 4 billion tokens. The Hugging Face team reported in the blog post that training on this filtered subset converged about three times faster than training on the unfiltered raw Python from The Stack, measured against HumanEval pass@1.
The final SmolLM-Corpus mixture is dominated by FineWeb-Edu (220 billion tokens) with Cosmopedia v2 contributing 28 billion and Python-Edu contributing 4 billion. The 135M and 360M models passed over this mixture for roughly 600 billion tokens of training, well beyond the Chinchilla optimal point for their sizes, while the 1.7B model trained for 1 trillion tokens.
All three SmolLM models are decoder only transformers with the standard pre normalisation, RoPE position embeddings, and SwiGLU feed forward blocks that had become the open weights default by 2024. The differences across sizes are mostly in depth, width, and attention layout.
The 135M model has 30 transformer layers with a hidden size of 576, intermediate FFN size of 1,536, and 9 attention heads using Grouped Query Attention with 3 key value heads. The 360M version is 32 layers deep with a hidden size of 960, 2,560 wide FFN, and 15 query heads against 5 key value heads. Both small models use embedding tying, where the input embedding matrix and the output projection share weights, which saves a noticeable fraction of total parameters at this scale.
The 1.7B model breaks with the MobileLLM template. It has 24 transformer layers, a hidden size of 2,048, an intermediate FFN width of 8,192, and 32 multi head attention heads without grouping. Embedding tying is still used. The reason for the different layout, according to the blog post, is that at 1.7B the additional inference cost of full multi head attention is acceptable on the target hardware (laptops, mid range smartphones), and the wider per-layer representation helps with downstream tasks that the smaller, deeper models struggled with.
All three models use a 2,048 token context window, which was already short compared to the 8,192 and 32,768 token contexts becoming standard in larger 2024 open releases. The choice was justified in the blog post as a deliberate trade off; longer contexts would have increased training cost per token and reduced memory headroom on edge hardware, while most expected use cases (assistants, summarisers, autocomplete) fit comfortably in 2,048 tokens.
Training used the Nanotron framework, Hugging Face's open source 3D parallel training library, on 64 NVIDIA H100 GPUs in bfloat16 precision. The learning rate schedule was a trapezoidal warmup, constant, and cooldown shape, with the cooldown phase covering the final 20 percent of the training run. This shape, sometimes called WSD (warmup-stable-decay), had been popularised by the MiniCPM team in early 2024 and was adopted across the SmolLM family.
The headline benchmarks reported in the SmolLM blog post place each model at or near the top of its respective size class on standard zero shot and few shot reasoning tasks. Scores are reported on a fixed evaluation pipeline using lighteval, with HellaSwag, PIQA, OpenBookQA, WinoGrande, ARC, MMLU (cloze formulation), and CommonsenseQA as the core suite. The team emphasised that they used the cloze MMLU formulation rather than the multiple choice version often reported by other small models, because small models with limited instruction following ability tend to score artificially well on the formatted multiple choice version through label memorisation.
The 135M tier was benchmarked against Apple's MobileLLM-125M and Meta's smaller dense baselines. SmolLM-135M outperformed MobileLLM-125M across the suite despite training on 600 billion tokens against MobileLLM's 1 trillion, a result the team attributed to the higher quality of the SmolLM-Corpus relative to the dataset behind MobileLLM.
The 360M model was positioned against models under 500 million parameters, including MobileLLM-350M, Qwen2-500M, and Pythia-410M. SmolLM-360M outperformed all of them on the average benchmark score and was within a few points of much larger models like TinyLlama 1.1B and Pythia 1B.
The 1.7B tier was the most contested. SmolLM-1.7B was compared against Phi-1.5, MobileLLM-1.5B, and Qwen2-1.5B, and won on the Hugging Face reported average. The blog post reproduced the per-benchmark scores in a single chart and table; reproducing exact numerical results here requires the model to be evaluated under the same lighteval configuration, so the article reports only the ranking claims that Hugging Face made at launch.
| Comparison | Result reported by Hugging Face |
|---|---|
| SmolLM-135M vs MobileLLM-125M | SmolLM wins on average across the standard suite |
| SmolLM-360M vs all sub-500M models | SmolLM wins on average across MobileLLM-350M, Qwen2-500M, Pythia-410M |
| SmolLM-1.7B vs Phi-1.5 | SmolLM wins on average across the standard suite |
| SmolLM-1.7B vs Qwen2-1.5B | SmolLM wins on most reasoning tasks, trails on HumanEval (24 pass@1 vs 31.1 pass@1) |
| SmolLM-1.7B vs MobileLLM-1.5B | SmolLM wins on average |
For instruction tuned variants, Hugging Face reported IFEval results in which SmolLM-1.7B-Instruct sat below Qwen2-1.5B-Instruct (which scored 29.94 on Prompt Strict Accuracy) but described the SmolLM Instruct line as offering "a good balance between model size and performance" using only publicly available post training datasets. The team explicitly noted that Qwen 2's stronger instruct numbers came from a much larger and partially closed post training mix.
Memory footprints at full bfloat16 precision were reported as approximately 520 megabytes for the 135M model, 1.4 gigabytes for the 360M, and roughly 3.4 to 6.5 gigabytes for the 1.7B depending on activation overhead. The team published 8-bit and 4-bit quantised versions, with the 1.7B model dropping to about 1 gigabyte at 4-bit precision, sufficient to fit on an iPhone 15 with 6 gigabytes of unified memory.
Every component of the SmolLM release is licensed under Apache 2.0. This includes the three base models, the three SmolLM-Instruct variants, the cosmo2 tokenizer, the SmolLM-Corpus (including all three sub-corpora), the Nanotron training framework, the FineWeb-Edu and Python-Edu classifiers, and the WebGPU demo Spaces published at launch.
Apache 2.0 permits commercial use, redistribution, modification, and the creation of derivative works, subject to preservation of the license notice and attribution. There are no acceptable use clauses, no field of use restrictions, and no user count thresholds, in contrast to Meta's Llama Community License which gates large commercial users behind a separate agreement. The decision to license the synthetic Cosmopedia v2 corpus permissively is particularly notable, since most synthetic pretraining datasets at the time were either not released or released under research only terms.
The permissive license is one of the main reasons SmolLM was adopted as a baseline in third party academic work on small models, including in MobileLLM follow ups from Apple and the LFM line from Liquid AI. The Apache 2.0 framing also made it straightforward to integrate SmolLM into commercial on-device products, with at least a handful of mobile keyboard apps and developer tools picking up the 135M or 360M variants for autocomplete and summarisation features in late 2024.
At launch, SmolLM was the most data transparent open release in its size class. The table below summarises the main contemporaries.
| Model | Parameters | Release | License | Data published | Training tokens |
|---|---|---|---|---|---|
| SmolLM 135M | 135M | Jul 2024 | Apache 2.0 | Yes | 600B |
| SmolLM 360M | 360M | Jul 2024 | Apache 2.0 | Yes | 600B |
| SmolLM 1.7B | 1.7B | Jul 2024 | Apache 2.0 | Yes | 1T |
| MobileLLM 125M | 125M | Jun 2024 | Meta License | No | 1T |
| MobileLLM 350M | 350M | Jun 2024 | Meta License | No | 1T |
| MobileLLM 1.5B | 1.5B | Jun 2024 | Meta License | No | 1T |
| Phi-1.5 | 1.3B | Sep 2023 | MIT | No | 150B |
| Phi-2 | 2.7B | Dec 2023 | MIT | No | 1.4T |
| Qwen2-0.5B | 500M | Jun 2024 | Apache 2.0 | No | 12T |
| Qwen2-1.5B | 1.5B | Jun 2024 | Apache 2.0 | No | 7T |
| TinyLlama 1.1B | 1.1B | Sep 2023 | Apache 2.0 | Yes | 3T |
| Pythia 410M / 1B | 410M / 1B | Apr 2023 | Apache 2.0 | Yes (The Pile) | 300B |
Against Apple's MobileLLM line, SmolLM offered comparable or better quality at smaller token budgets, with the major difference being that MobileLLM did not release its training data or filtering scripts. Against Microsoft's Phi-3 line, SmolLM was clearly less capable in absolute terms (Phi-3-mini at 3.8B parameters was the closest competitor by that month, and trained on 3.3 trillion tokens of mostly proprietary data), but SmolLM was also smaller and fully open. Against Alibaba's Qwen 2 sub-2B tier, SmolLM was competitive on common sense reasoning but lagged on coding and on Chinese language tasks, reflecting the Anglophone bias of the SmolLM-Corpus.
The closest direct comparison in terms of release philosophy was TinyLlama, which had also published its training data and code. TinyLlama, however, used the much larger 3 trillion token RedPajama and SlimPajama corpora without the educational filtering step, and the SmolLM team's ablations suggested that this hurt downstream accuracy at small scale.
SmolLM was treated by Hugging Face as the first stage of a multi-year small models program rather than a single release. Within four months it was succeeded by SmolLM 2, unveiled on November 2, 2024, which kept the same three sizes (135M, 360M, 1.7B) but trained each model on an enlarged 11 trillion token corpus that incorporated FineWeb-Edu, DCLM Baseline, the original Stack, and new mathematics and code datasets. SmolLM 2 reported large improvements over SmolLM 1 across the same benchmark suite, particularly on math (GSM8K) and code (HumanEval), and the SmolLM 2 instruct variants used a more sophisticated post training mix including a custom math reasoning dataset called FineMath.
SmolLM 3 followed on July 8, 2025, in a single 3 billion parameter size. SmolLM 3 added a 128,000 token context window, dual mode reasoning (a thinking mode and a fast mode in a single checkpoint), and explicit multilingual training covering six languages natively (English, French, Spanish, German, Italian, Portuguese) and three additional languages (Arabic, Chinese, Russian) with smaller token shares. The architecture moved away from the SmolLM 1 layout to a wider 3B configuration closer to Qwen 2.5 and Llama 3.2, and the training corpus was expanded again to incorporate web math, code, and reasoning traces.
The broader SmolLM program also spun off into multimodal work, with the SmolVLM family of small vision language models released in November 2024, and into agentic and robotics applications including SmolAgents and the later SmolVLA action model. None of those derivatives shares weights with the original SmolLM, but they reuse the data engineering and training pipeline pioneered in the first SmolLM release.
Reception of the July 2024 release was strongly positive in the open weights community. The blog post was widely shared, and the three model checkpoints together accumulated several million downloads from the Hugging Face Hub in the months after launch, with the 135M and 360M variants seeing particularly heavy use as research baselines and as test loads for inference frameworks. Loubna Ben Allal, the lead author, was widely interviewed and gave talks on the project at the 2024 Open Source AI Forum and at the AI Engineer World's Fair.
The two most cited reasons for the favourable reception were the unusual completeness of the data release (Cosmopedia v2, FineWeb-Edu, and Python-Edu were all openly published) and the demonstration that quality filtering could substitute for raw token count at small scale. The blog post's finding that middle school targeted synthetic textbooks were the strongest contributor on most reasoning benchmarks attracted particular attention as a counter-intuitive empirical result.
Independent inference framework support arrived quickly. The llama.cpp project added GGUF conversion within days, ONNX exports were published on the Hub almost immediately, and WebGPU demos using transformers.js were shipped by Hugging Face itself on launch day, making SmolLM the first small model series with a fully browser based inference path on day one. By late 2024 the smaller SmolLM variants had been picked up in third party mobile keyboard apps, browser extensions for text rewriting, and as bootstrap baselines for academic groups studying alternatives to standard transformer pretraining recipes.
Criticism focused on three points. The 2,048 token context length was short for general assistant use even by mid 2024 standards. The model card disclaimers about factual accuracy were more conservative than competing releases, reflecting the team's honest stance about how much knowledge a sub-2B parameter model can actually hold. And the SmolLM-Instruct variants, while solid, were noticeably weaker than Qwen 2's much more heavily post trained chat models, a gap the SmolLM team acknowledged in the blog post and explicitly attributed to using only publicly available SFT and DPO datasets. The subsequent SmolLM 2 release in November 2024 closed several of these gaps, particularly around math and coding ability, but the 2,048 context window persisted until the SmolLM 3 launch in mid 2025.
SmolLM's longer term influence is best seen indirectly. The FineWeb-Edu filtering approach was adopted by several subsequent open model programs, the cosmo2 tokenizer became a reference vocabulary for sub-2B research models, and the trapezoidal learning rate schedule (WSD) used in the SmolLM run became one of the standard small model schedules over the next year. The blog post itself was cited in the Falcon 3 technical report, in the OLMo 2 technical report, in Apple's MobileLLM v2 paper, and in numerous community fine tunes that used SmolLM checkpoints as a starting point.