SmolLM

AI Models Large Language Models Open Source AI Small Language Models

23 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v3 · 4,580 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SmolLM is a family of small, fully open language models released by Hugging Face on July 16, 2024 in three sizes, 135 million, 360 million, and 1.7 billion parameters, all trained on a curated open dataset called SmolLM-Corpus and all distributed under the Apache 2.0 license. ^[1]^[2] It is "fully open" in the strongest sense: Hugging Face published the weights, the training datasets, the tokenizer, the training framework, and the recipes, so the entire model can be reproduced end to end. The project was led by Loubna Ben Allal, Anton Lozhkov, and Elie Bakouch as part of Hugging Face's Smol Models Research effort (the HuggingFaceTB account on the Hub), with the stated goal of showing that careful data curation can let very small models reach or beat much larger systems on common sense reasoning and world knowledge benchmarks. ^[1]

Hugging Face framed the release around a single empirical claim, that "meticulously curated data can lead to high performance even with smaller model sizes," arguing that the gap between closed and open small models was a data engineering problem rather than a compute problem. ^[1] The line was later extended by SmolLM 2 (November 2024) and SmolLM 3 (a 3-billion-parameter reasoning model, July 2025), but the first generation remains the reference point for the family and for a wider category of edge friendly small language models from 2024.

The release came at a moment when small language models had become a serious research subfield rather than just a hobbyist concern. Microsoft's Phi line had demonstrated that a synthetic textbook style corpus could lift sub-2B models to competitive scores on reasoning tests. Apple's MobileLLM paper had argued that depth, embedding tying, and grouped-query attention were the main levers for sub-1B accuracy. ^[17] SmolLM stitched those threads together with Hugging Face's own data infrastructure (the FineWeb pipeline and the Cosmopedia synthetic corpus) and shipped a fully open package: weights, training framework, datasets, tokenizer, and recipes. ^[1]

What is SmolLM?

SmolLM is Hugging Face's first generation of compact, on-device language models, released as three base checkpoints (135M, 360M, and 1.7B parameters) plus matching instruction-tuned variants, under a permissive Apache 2.0 license. ^[1]^[2] Each model is a decoder only transformer with a 2,048 token context window and a shared 49,152 token cosmo2 tokenizer, trained from scratch on the openly published SmolLM-Corpus. ^[1] The defining characteristic, relative to contemporaries like Apple's MobileLLM or Microsoft's Phi, is total transparency: the training data, the filtering classifiers, the Nanotron training code, and the WebGPU inference demos were all released openly alongside the weights. ^[1]^[10]

The practical pitch was on-device AI. At full bfloat16 precision the 135M checkpoint needs roughly 520 megabytes of memory, the 360M needs about 1.4 gigabytes, and the 1.7B needs roughly 3.4 gigabytes, and quantised builds shrink the 1.7B to about 1 gigabyte at 4-bit, small enough to run inside a browser or on a mid-range smartphone. ^[1] Hugging Face shipped browser based WebGPU demos using transformers.js on launch day, making SmolLM the first small model series with a fully browser based inference path on day one. ^[1]

Infobox

Field	Value
Developer	Hugging Face (Smol Models Research, HuggingFaceTB)
Initial release	July 16, 2024
Sizes	135M, 360M, 1.7B parameters
Architecture	Decoder only transformer (causal language model)
Context length	2,048 tokens
Tokenizer	cosmo2-tokenizer, vocabulary 49,152
Training framework	Nanotron
Training hardware	64 NVIDIA H100 GPUs
Pretraining tokens	600 billion (135M and 360M), 1 trillion (1.7B)
Precision	bfloat16
Training dataset	SmolLM-Corpus (Cosmopedia v2, FineWeb-Edu, Python-Edu)
License	Apache 2.0
Lead authors	Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Leandro von Werra, Thomas Wolf
Repository	huggingface.co/HuggingFaceTB
Blog post	huggingface.co/blog/smollm

Background

By mid 2024 the open weights landscape had two clear poles. At the top were 7 billion parameter models such as Llama 2 7B, Mistral 7B, and Qwen 2 7B, which were treated as the default "small" tier even though their memory footprints were still uncomfortable for consumer hardware. At the very bottom were sub-1B research models such as Pythia 1B, TinyLlama 1.1B, and various Cerebras and BLOOM ports, none of which were strong enough on reasoning benchmarks to be useful as general assistants. The middle ground (capable models small enough for phones, browsers, and CPUs) was occupied by closed releases like Apple's MobileLLM and Microsoft's Phi-1.5, with limited data transparency in both cases. ^[17]^[18]

Hugging Face's Smol Models Research team had spent the first half of 2024 building two pieces of infrastructure that turned out to be the prerequisites for a fully open small model. The first was FineWeb, a 15 trillion token open web crawl with a derived educational quality subset called FineWeb-Edu, filtered by a Llama 3 70B trained classifier. ^[12] The second was Cosmopedia, a 25 billion token synthetic corpus modelled on the Phi recipe but generated with Mixtral-8x7B-Instruct and released openly. ^[11] Both datasets shipped under permissive licenses with full filtering scripts.

With those two ingredients in place, plus a code subset filtered from The Stack by a similar educational classifier (Python-Edu), the team had enough data to train a small model from scratch and document every step. The SmolLM blog post, published on July 16, 2024, framed the effort as a demonstration that "meticulously curated data can lead to high performance even with smaller model sizes," with the implicit argument that the gap between closed and open small models was a data engineering problem rather than a compute problem. ^[1] The release came one day after Mistral AI shipped Mathstral 7B, and about a week before the launch of Llama 3.1, giving SmolLM a brief uncontested window in the on-device tier.

What sizes does SmolLM come in?

The SmolLM family at launch consisted of three sizes, each in a base pretrained version and an instruction tuned variant. ^[1]^[6]

Model	Parameters	Layers	Hidden size	Attention heads	Attention type	Context	Training tokens
SmolLM-135M	135M	30	576	9	Grouped Query Attention	2,048	600B
SmolLM-360M	360M	32	960	15	Grouped Query Attention	2,048	600B
SmolLM-1.7B	1.7B	24	2,048	32	Multi Head Attention	2,048	1T

The two smaller models followed the MobileLLM design recipe, which prioritises depth over width for parameter efficiency at sub-1B scale, and used Grouped Query Attention with embedding tying to reduce memory. ^[1]^[17] The 1.7B model used a more conventional layout closer to Llama 2 style transformers, with multi head attention and a wider hidden dimension. All three shared the same 49,152 token cosmo2 tokenizer, trained on the SmolLM-Corpus mixture so that the vocabulary was well matched to the training distribution. ^[1]

Each base model was paired with a SmolLM-Instruct version produced by supervised fine tuning on a mixture of WebInstructSub (the permissive subset), StarCoder2-Self-OSS-Instruct, OpenHermes 2.5, and Everyday Conversations, followed by Direct Preference Optimization. ^[5] The DPO mix differed by size, with HelpSteer used for the 135M and 1.7B instruct checkpoints and argilla/dpo-mix-7k used for 360M, a choice motivated by ablations on small scale post training stability.

How was SmolLM trained?

The SmolLM-Corpus (also referred to as Cosmo-Corpus in the model cards) is the central artifact of the release. Hugging Face describes it as a corpus that "includes Cosmopedia v2 (28B tokens of synthetic textbooks and stories generated by Mixtral), Python-Edu (4B tokens of educational Python samples from The Stack), and FineWeb-Edu (220B tokens of deduplicated educational web samples from FineWeb)." ^[2] It combines three openly published datasets, each filtered or generated by the Hugging Face team, and is itself available on the Hub under permissive licenses. ^[7] Training itself ran on the Nanotron framework using 64 NVIDIA H100 GPUs in bfloat16 precision. ^[2]

Cosmopedia v2

Cosmopedia v2 is a 28 billion token synthetic dataset of textbooks, stories, articles, and code snippets generated by Mixtral-8x7B-Instruct-v0.1. ^[2] It comprises around 39 million documents grouped under 34,000 topics, with the topic taxonomy derived from the BISAC book classification system. Audience targeting is fixed in the prompt for each document, with roughly 40 percent of generations aimed at a middle school reading level, 30 percent at college level, and 30 percent at mixed or other styles. The middle school subset turned out to be the strongest contributor on benchmarks other than MMLU, a finding the team reported in detail in the blog post. ^[1]

Cosmopedia v2 is an evolution of the original Cosmopedia released in March 2024. ^[11] The v2 release improved topic coverage, deduplicated overlapping documents, and tuned the prompt templates after the v1 run revealed gaps in coverage of mathematics, code, and current affairs. Cosmopedia v2 is the largest fully open synthetic pretraining corpus that had been released at the time of SmolLM's launch.

FineWeb-Edu

FineWeb-Edu is the educational quality subset of FineWeb, Hugging Face's open Common Crawl based pretraining corpus. ^[12] The full FineWeb is around 15 trillion tokens; FineWeb-Edu is the 1.3 trillion token slice that scored highly on a Llama 3 70B trained classifier for educational value. ^[8]^[12] For SmolLM, the team further deduplicated FineWeb-Edu and used the resulting 220 billion token subset as the largest single ingredient in the training mix. ^[2]

The choice to lean heavily on FineWeb-Edu was based on ablations in the FineWeb-Edu technical report, which showed that training on the filtered subset reached the same downstream accuracy as training on a much larger unfiltered crawl. ^[12] For a small model with a fixed token budget, this kind of quality filtering was effectively a way to buy more capability without more compute.

Python-Edu

Python-Edu is a 4 billion token subset of The Stack v2, filtered by a code educational quality classifier trained on annotations from Llama 3 70B. ^[2]^[9] The classifier rated Python files for their suitability as teaching material, and the Python-Edu set kept the top scoring 4 billion tokens. The Hugging Face team reported in the blog post that training on this filtered subset converged about three times faster than training on the unfiltered raw Python from The Stack, measured against HumanEval pass@1. ^[1]

Mixture

The final SmolLM-Corpus mixture is dominated by FineWeb-Edu (220 billion tokens) with Cosmopedia v2 contributing 28 billion and Python-Edu contributing 4 billion. ^[2] The 135M and 360M models passed over this mixture for roughly 600 billion tokens of training, well beyond the Chinchilla optimal point for their sizes, while the 1.7B model trained for 1 trillion tokens. ^[1]

Architecture

All three SmolLM models are decoder only transformers with the standard pre normalisation, RoPE position embeddings, and SwiGLU feed forward blocks that had become the open weights default by 2024. The differences across sizes are mostly in depth, width, and attention layout. ^[1]

The 135M model has 30 transformer layers with a hidden size of 576, intermediate FFN size of 1,536, and 9 attention heads using Grouped Query Attention with 3 key value heads. The 360M version is 32 layers deep with a hidden size of 960, 2,560 wide FFN, and 15 query heads against 5 key value heads. Both small models use embedding tying, where the input embedding matrix and the output projection share weights, which saves a noticeable fraction of total parameters at this scale. ^[1]

The 1.7B model breaks with the MobileLLM template. It has 24 transformer layers, a hidden size of 2,048, an intermediate FFN width of 8,192, and 32 multi head attention heads without grouping. Embedding tying is still used. The reason for the different layout, according to the blog post, is that at 1.7B the additional inference cost of full multi head attention is acceptable on the target hardware (laptops, mid range smartphones), and the wider per-layer representation helps with downstream tasks that the smaller, deeper models struggled with. ^[1]

All three models use a 2,048 token context window, which was already short compared to the 8,192 and 32,768 token contexts becoming standard in larger 2024 open releases. The choice was justified in the blog post as a deliberate trade off; longer contexts would have increased training cost per token and reduced memory headroom on edge hardware, while most expected use cases (assistants, summarisers, autocomplete) fit comfortably in 2,048 tokens. ^[1]

Training used the Nanotron framework, Hugging Face's open source 3D parallel training library, on 64 NVIDIA H100 GPUs in bfloat16 precision. ^[2]^[10] The learning rate schedule was a trapezoidal warmup, constant, and cooldown shape, with the cooldown phase covering the final 20 percent of the training run. ^[1] This shape, sometimes called WSD (warmup-stable-decay), had been popularised by the MiniCPM team in early 2024 and was adopted across the SmolLM family.

How does SmolLM perform?

The headline benchmarks reported in the SmolLM blog post place each model at or near the top of its respective size class on standard zero shot and few shot reasoning tasks. ^[1] Scores are reported on a fixed evaluation pipeline using lighteval, with HellaSwag, PIQA, OpenBookQA, WinoGrande, ARC, MMLU (cloze formulation), and CommonsenseQA as the core suite. The team emphasised that they used the cloze MMLU formulation rather than the multiple choice version often reported by other small models, because small models with limited instruction following ability tend to score artificially well on the formatted multiple choice version through label memorisation. ^[1]

The 135M tier was benchmarked against Apple's MobileLLM-125M and Meta's smaller dense baselines. SmolLM-135M outperformed MobileLLM-125M across the suite despite training on 600 billion tokens against MobileLLM's 1 trillion, a result the team attributed to the higher quality of the SmolLM-Corpus relative to the dataset behind MobileLLM. ^[1]

The 360M model was positioned against models under 500 million parameters, including MobileLLM-350M, Qwen2-500M, and Pythia-410M. SmolLM-360M outperformed all of them on the average benchmark score and was within a few points of much larger models like TinyLlama 1.1B and Pythia 1B. ^[1]

The 1.7B tier was the most contested. SmolLM-1.7B was compared against Phi-1.5, MobileLLM-1.5B, and Qwen2-1.5B, and won on the Hugging Face reported average. ^[1] It also posted what Hugging Face described as "strong Python coding performance with 24 pass@1" on HumanEval. ^[1] The blog post reproduced the per-benchmark scores in a single chart and table; reproducing exact numerical results here requires the model to be evaluated under the same lighteval configuration, so the article reports only the ranking claims that Hugging Face made at launch.

Comparison	Result reported by Hugging Face
SmolLM-135M vs MobileLLM-125M	SmolLM wins on average across the standard suite
SmolLM-360M vs all sub-500M models	SmolLM wins on average across MobileLLM-350M, Qwen2-500M, Pythia-410M
SmolLM-1.7B vs Phi-1.5	SmolLM wins on average across the standard suite
SmolLM-1.7B vs Qwen2-1.5B	SmolLM wins on most reasoning tasks, trails on HumanEval (24 pass@1 vs 31.1 pass@1)
SmolLM-1.7B vs MobileLLM-1.5B	SmolLM wins on average

For instruction tuned variants, Hugging Face reported IFEval results in which SmolLM-1.7B-Instruct sat below Qwen2-1.5B-Instruct (which scored 29.94 on Prompt Strict Accuracy) but described the SmolLM Instruct line as offering "a good balance between model size and performance" using only publicly available post training datasets. ^[1] The team explicitly noted that Qwen 2's stronger instruct numbers came from a much larger and partially closed post training mix.

Memory footprints at full bfloat16 precision were reported as approximately 520 megabytes for the 135M model, 1.4 gigabytes for the 360M, and roughly 3.4 to 6.5 gigabytes for the 1.7B depending on activation overhead. ^[1] The team published 8-bit and 4-bit quantised versions, with the 1.7B model dropping to about 1 gigabyte at 4-bit precision, sufficient to fit on an iPhone 15 with 6 gigabytes of unified memory.

Is SmolLM open source?

Yes. Every component of the SmolLM release is licensed under Apache 2.0. ^[2] This includes the three base models, the three SmolLM-Instruct variants, the cosmo2 tokenizer, the SmolLM-Corpus (including all three sub-corpora), the Nanotron training framework, the FineWeb-Edu and Python-Edu classifiers, and the WebGPU demo Spaces published at launch. ^[7]^[8]^[9]^[10]

Apache 2.0 permits commercial use, redistribution, modification, and the creation of derivative works, subject to preservation of the license notice and attribution. There are no acceptable use clauses, no field of use restrictions, and no user count thresholds, in contrast to Meta's Llama Community License which gates large commercial users behind a separate agreement. The decision to license the synthetic Cosmopedia v2 corpus permissively is particularly notable, since most synthetic pretraining datasets at the time were either not released or released under research only terms.

The permissive license is one of the main reasons SmolLM was adopted as a baseline in third party academic work on small models, including in MobileLLM follow ups from Apple and the LFM line from Liquid AI. The Apache 2.0 framing also made it straightforward to integrate SmolLM into commercial on-device products, with at least a handful of mobile keyboard apps and developer tools picking up the 135M or 360M variants for autocomplete and summarisation features in late 2024.

How does SmolLM compare to MobileLLM, Phi, and Qwen 2?

At launch, SmolLM was the most data transparent open release in its size class. The table below summarises the main contemporaries.

Model	Parameters	Release	License	Data published	Training tokens
SmolLM 135M	135M	Jul 2024	Apache 2.0	Yes	600B
SmolLM 360M	360M	Jul 2024	Apache 2.0	Yes	600B
SmolLM 1.7B	1.7B	Jul 2024	Apache 2.0	Yes	1T
MobileLLM 125M	125M	Jun 2024	Meta License	No	1T
MobileLLM 350M	350M	Jun 2024	Meta License	No	1T
MobileLLM 1.5B	1.5B	Jun 2024	Meta License	No	1T
Phi-1.5	1.3B	Sep 2023	MIT	No	150B
Phi-2	2.7B	Dec 2023	MIT	No	1.4T
Qwen2-0.5B	500M	Jun 2024	Apache 2.0	No	12T
Qwen2-1.5B	1.5B	Jun 2024	Apache 2.0	No	7T
TinyLlama 1.1B	1.1B	Sep 2023	Apache 2.0	Yes	3T
Pythia 410M / 1B	410M / 1B	Apr 2023	Apache 2.0	Yes (The Pile)	300B

Against Apple's MobileLLM line, SmolLM offered comparable or better quality at smaller token budgets, with the major difference being that MobileLLM did not release its training data or filtering scripts. ^[17] Against Microsoft's Phi-3 line, SmolLM was clearly less capable in absolute terms (Phi-3-mini at 3.8B parameters was the closest competitor by that month, and trained on 3.3 trillion tokens of mostly proprietary data), but SmolLM was also smaller and fully open. Against Alibaba's Qwen 2 sub-2B tier, SmolLM was competitive on common sense reasoning but lagged on coding and on Chinese language tasks, reflecting the Anglophone bias of the SmolLM-Corpus.

The closest direct comparison in terms of release philosophy was TinyLlama, which had also published its training data and code. TinyLlama, however, used the much larger 3 trillion token RedPajama and SlimPajama corpora without the educational filtering step, and the SmolLM team's ablations suggested that this hurt downstream accuracy at small scale. ^[1]

How do SmolLM 2 and SmolLM 3 differ?

SmolLM was treated by Hugging Face as the first stage of a multi-year small models program rather than a single release. Within four months it was succeeded by SmolLM 2, unveiled on November 1, 2024, which kept the same three sizes (135M, 360M, 1.7B) but trained each model on a much larger token budget: the 1.7B flagship trained on roughly 11 trillion tokens, with the 360M and 135M checkpoints trained on about 4 trillion and 2 trillion tokens respectively. ^[15] The SmolLM 2 mix incorporated FineWeb-Edu, DCLM Baseline, the Stack, and new mathematics and code datasets including the high quality FineMath and InfiWebMath sets. SmolLM 2 reported large improvements over SmolLM 1 across the same benchmark suite, particularly on math (GSM8K) and code (HumanEval), and the SmolLM 2 instruct variants used a more sophisticated post training mix. A follow-up technical report, "SmolLM2: When Smol Goes Big, Data-Centric Training of a Small Language Model," was posted to arXiv on February 4, 2025. ^[15]

SmolLM 3 followed on July 8, 2025, in a single 3 billion parameter size, pretrained on 11.2 trillion tokens. ^[16] SmolLM 3 was trained at a 4,096 token base context that extends to 128,000 tokens at inference via YARN extrapolation, and added dual mode reasoning (a thinking mode and a non-thinking mode in a single checkpoint, toggled by /think and /no_think flags in the system prompt) plus explicit multilingual training covering six languages natively (English, French, Spanish, German, Italian, Portuguese). ^[16] The architecture moved away from the SmolLM 1 layout to a wider 3B configuration using Grouped Query Attention and NoPE (selective removal of rotary position embeddings from every fourth layer), closer to Qwen 2.5 and Llama 3.2. Hugging Face reported that "our 3B model outperforms Llama-3.2-3B and Qwen2.5-3B while staying competitive with larger 4B alternatives (Qwen3 & Gemma3)," and framed the release as "a new competitive fully open 3B model." ^[16] Like the first generation, SmolLM 3 is distributed under Apache 2.0.

The broader SmolLM program also spun off into multimodal work, with the SmolVLM family of small vision language models released in November 2024, and into agentic and robotics applications including SmolAgents and the later SmolVLA action model. None of those derivatives shares weights with the original SmolLM, but they reuse the data engineering and training pipeline pioneered in the first SmolLM release.

Reception

Reception of the July 2024 release was strongly positive in the open weights community. The blog post was widely shared, and the three model checkpoints together accumulated several million downloads from the Hugging Face Hub in the months after launch, with the 135M and 360M variants seeing particularly heavy use as research baselines and as test loads for inference frameworks. Loubna Ben Allal, the lead author, was widely interviewed and gave talks on the project at the 2024 Open Source AI Forum and at the AI Engineer World's Fair.

The two most cited reasons for the favourable reception were the unusual completeness of the data release (Cosmopedia v2, FineWeb-Edu, and Python-Edu were all openly published) and the demonstration that quality filtering could substitute for raw token count at small scale. ^[13] The blog post's finding that middle school targeted synthetic textbooks were the strongest contributor on most reasoning benchmarks attracted particular attention as a counter-intuitive empirical result. ^[1]

Independent inference framework support arrived quickly. The llama.cpp project added GGUF conversion within days, ONNX exports were published on the Hub almost immediately, and WebGPU demos using transformers.js were shipped by Hugging Face itself on launch day, making SmolLM the first small model series with a fully browser based inference path on day one. ^[1] By late 2024 the smaller SmolLM variants had been picked up in third party mobile keyboard apps, browser extensions for text rewriting, and as bootstrap baselines for academic groups studying alternatives to standard transformer pretraining recipes.

Criticism focused on three points. The 2,048 token context length was short for general assistant use even by mid 2024 standards. The model card disclaimers about factual accuracy were more conservative than competing releases, reflecting the team's honest stance about how much knowledge a sub-2B parameter model can actually hold. And the SmolLM-Instruct variants, while solid, were noticeably weaker than Qwen 2's much more heavily post trained chat models, a gap the SmolLM team acknowledged in the blog post and explicitly attributed to using only publicly available SFT and DPO datasets. ^[1] The subsequent SmolLM 2 release in November 2024 closed several of these gaps, particularly around math and coding ability, but the 2,048 context window persisted until the SmolLM 3 launch in mid 2025. ^[15]^[16]

SmolLM's longer term influence is best seen indirectly. The FineWeb-Edu filtering approach was adopted by several subsequent open model programs, the cosmo2 tokenizer became a reference vocabulary for sub-2B research models, and the trapezoidal learning rate schedule (WSD) used in the SmolLM run became one of the standard small model schedules over the next year. The blog post itself was cited in the Falcon 3 technical report, in the OLMo 2 technical report, in Apple's MobileLLM v2 paper, and in numerous community fine tunes that used SmolLM checkpoints as a starting point.

ELI5: What is SmolLM?

Imagine the giant AI chatbots are like huge encyclopedias that need a whole library to hold them. SmolLM is like a tiny pocket dictionary instead: small enough to fit on your phone or even run inside a web page, but still surprisingly clever. Hugging Face made it small by feeding it really clean, well organised "study notes" (carefully chosen text and made-up textbooks) instead of just dumping the messy whole internet into it. And unlike most AIs, they gave away everything, the model, the study notes, and the instructions, so anyone can build the exact same thing themselves.

References

Ben Allal, Loubna; Lozhkov, Anton; Bakouch, Elie; von Werra, Leandro; Wolf, Thomas. "SmolLM - blazingly fast and remarkably powerful." Hugging Face Blog, July 16, 2024. https://huggingface.co/blog/smollm ↩
Hugging Face. "HuggingFaceTB/SmolLM-135M." Hugging Face model card. https://huggingface.co/HuggingFaceTB/SmolLM-135M ↩
Hugging Face. "HuggingFaceTB/SmolLM-360M." Hugging Face model card. https://huggingface.co/HuggingFaceTB/SmolLM-360M
Hugging Face. "HuggingFaceTB/SmolLM-1.7B." Hugging Face model card. https://huggingface.co/HuggingFaceTB/SmolLM-1.7B
Hugging Face. "HuggingFaceTB/SmolLM-135M-Instruct." Hugging Face model card. https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct ↩
Hugging Face. "SmolLM Models Collection." Hugging Face Collections. https://huggingface.co/collections/HuggingFaceTB/smollm-models-6695016cad7167254ce15966 ↩
Hugging Face. "HuggingFaceTB/smollm-corpus." Hugging Face dataset card. https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus ↩
Hugging Face. "HuggingFaceFW/fineweb-edu-classifier." Hugging Face model card. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier ↩
Hugging Face. "HuggingFaceTB/python-edu-scorer." Hugging Face model card. https://huggingface.co/HuggingFaceTB/python-edu-scorer ↩
Hugging Face. "Nanotron training framework." GitHub. https://github.com/huggingface/nanotron ↩
Ben Allal, Loubna; Lozhkov, Anton; Penedo, Guilherme; Wolf, Thomas; von Werra, Leandro. "Cosmopedia: how to create large-scale synthetic data for pre-training." Hugging Face Blog, March 2024. https://huggingface.co/blog/cosmopedia ↩
Penedo, Guilherme et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." Hugging Face Blog and arXiv:2406.17557, June 2024. https://arxiv.org/abs/2406.17557 ↩
MarkTechPost. "Hugging Face Introduces SmolLM: Transforming On-Device AI with High-Performance Small Language Models from 135M to 1.7B Parameters." July 16, 2024. https://www.marktechpost.com/2024/07/16/hugging-face-introduces-smollm-transforming-on-device-ai-with-high-performance-small-language-models-from-135m-to-1-7b-parameters/ ↩
Willison, Simon. "SmolLM2." Simon Willison's Weblog, November 2, 2024. https://simonwillison.net/2024/Nov/2/smollm2/
Ben Allal, Loubna et al. "SmolLM2: When Smol Goes Big - Data-Centric Training of a Small Language Model." arXiv:2502.02737, February 2025. https://arxiv.org/abs/2502.02737 ↩
Bakouch, Elie; Ben Allal, Loubna et al. "SmolLM3: smol, multilingual, long-context reasoner." Hugging Face Blog, July 8, 2025. https://huggingface.co/blog/smollm3 ↩
Liu, Zechun et al. "MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases." arXiv:2402.14905, February 2024. https://arxiv.org/abs/2402.14905 ↩
Li, Yuanzhi et al. "Textbooks Are All You Need II: phi-1.5 technical report." arXiv:2309.05463, September 2023. https://arxiv.org/abs/2309.05463 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

BitNet b1.58 Common Crawl Cosmopedia SmolLM 2 SmolLM 3

What is SmolLM?

Infobox

Background

What sizes does SmolLM come in?

How was SmolLM trained?

Cosmopedia v2

FineWeb-Edu

Python-Edu

Mixture

Architecture

How does SmolLM perform?

Is SmolLM open source?

How does SmolLM compare to MobileLLM, Phi, and Qwen 2?

How do SmolLM 2 and SmolLM 3 differ?

Reception

ELI5: What is SmolLM?

See also

References

Improve this article

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini

Phi-4-mini-flash-reasoning

What links here

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini

Phi-4-mini-flash-reasoning

What links here