SmolLM 3
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,503 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,503 words
Add missing citations, update stale details, or suggest a clearer explanation.
SmolLM 3 is a 3 billion parameter open-weights language model released by Hugging Face on July 8, 2025. It is the third entry in the SmolLM line of small language models, following SmolLM from July 2024 and SmolLM 2 from November 2024. The model was trained on 11.2 trillion tokens, supports a native 64,000 token context that can be extended to 128,000 tokens through YaRN extrapolation, and ships with native multilingual coverage of six European languages. Its defining feature is a dual reasoning behaviour: a single set of weights can be switched between an extended thinking mode that produces visible chain of thought traces before answering and a faster non-thinking mode that emits direct responses. Both the base checkpoint and the post-trained instruct checkpoint are released under the Apache License 2.0.
The project was led by the Hugging Face Smol Models Research group, with Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, and Thomas Wolf among the principal authors. Hugging Face published the full recipe alongside the weights, including the pretraining data mixture, the training and post-training configurations, the synthetic reasoning data, and the model merging procedure used to combine the final checkpoints. The release blog post on the Hugging Face site framed SmolLM 3 as an attempt to close the gap between 3 billion parameter open models, which had historically lagged on reasoning and long context, and the 4 billion parameter tier represented by Qwen 3 4B and Gemma 3 4B, while remaining cheap enough to run on a single consumer GPU or on edge devices with quantisation.
The SmolLM line was started in summer 2024 as a research bet that a careful pretraining recipe could produce a competitive language model in the 100 million to 1 billion parameter range, sizes that had been treated as too small to matter for chat and instruction following. The first SmolLM release in July 2024 shipped three checkpoints at 135 million, 360 million, and 1.7 billion parameters, trained on the SmolLM-Corpus, an English language pretraining dataset put together from Cosmopedia, Python-Edu, and FineWeb-Edu. The headline finding was that the 1.7 billion parameter SmolLM outperformed several other open small models of similar size on aggregate evaluations, an unusual outcome for a first release from a team that had not previously published a frontier pretraining run.
SmolLM 2 followed in November 2024 with the same three parameter scales but a substantially expanded training budget. The 1.7 billion parameter SmolLM 2 was trained on 11 trillion tokens, an order of magnitude more than SmolLM 1.7B, and added math and code corpora on top of the SmolLM-Corpus. The model was also released with chat and instruct variants that used a new post-training recipe built on supervised fine tuning and direct preference optimisation. SmolLM 2 was widely reported as the strongest open 1.7 billion parameter chat model of its time and remained competitive into mid 2025.
By the first half of 2025 the small open model landscape had shifted in three ways. Reasoning models such as DeepSeek R1, OpenAI o1, and Qwen 3 Thinking had popularised the idea that test-time compute spent on visible chain of thought traces could lift small models well above their static benchmark scores. Long context windows of 32,000 to 128,000 tokens had become a standard expectation rather than a niche feature. Multilingual coverage in the European long tail was being added even to small open models by Mistral, Qwen, and Google's Gemma family. SmolLM 3 was conceived as the SmolLM team's response to these three trends at a slightly larger parameter scale, with the explicit goal of producing a 3 billion parameter model that could hold its own against the 4 billion parameter competition while running comfortably on a single 8 to 12 gigabyte GPU at half precision.
SmolLM 3 is a dense decoder only Transformer built in the same general family as Meta's Llama 3 and Llama 3.2. The model uses tied input and output embeddings, Grouped-Query Attention with a four to one ratio of query to key value heads, RMSNorm pre-normalisation, and the SwiGLU activation function used by most modern Llama style models. The total parameter count is 3 billion at BF16 precision.
The most distinctive architectural choice in SmolLM 3 is the use of NoPE, a no positional encoding variant introduced in earlier research on length generalisation. Rather than applying a single positional encoding scheme uniformly across all layers, SmolLM 3 keeps standard rotary position embedding (RoPE) on three quarters of its layers and strips position encoding entirely on the remaining quarter, following a 3 to 1 ratio in which every fourth layer is a NoPE layer. The Hugging Face team reports that this selective removal improves the model's ability to generalise to context lengths beyond those seen during training without harming short context accuracy. The RoPE theta is set to 1.5 million during the standard 32,000 token training stage and increased to 5 million for the 64,000 token long context stage.
The tokenizer is derived from Llama 3.2 with the beginning of sequence token removed. The vocabulary is 128,000 tokens, combining the 100,000 tiktoken3 token base used by Llama 3 with roughly 28,000 additional tokens added for non-English language coverage. Training uses intra-document masking to prevent attention bleed between documents packed into the same sequence, and the embedding layers are excluded from weight decay during the optimiser step, a small change that the team found stabilised the early phase of training. Long context support up to 128,000 tokens at inference time is achieved through YaRN rope scaling with a factor of 2.0 over the 65,536 token trained window.
| Specification | Value |
|---|---|
| Parameters | 3 billion |
| Layers | dense decoder only Transformer |
| Attention | grouped query, 4 to 1 query to key value ratio |
| Positional encoding | RoPE on 3 of every 4 layers, NoPE on the remaining layer |
| Tokenizer | Llama 3.2 base with bos removed, 128k vocabulary |
| Precision | BF16 |
| Trained context length | 64,000 tokens |
| Maximum context with YaRN | 128,000 tokens |
| Activation | SwiGLU |
| Normalisation | RMSNorm with pre normalisation |
The pretraining stack uses 11.2 trillion tokens, split across three stages with progressively more code and mathematics mixed into the diet as the run advances. The framework is Hugging Face's open source Nanotron trainer and the hardware budget is 384 NVIDIA H100 GPUs for 24 days, with a global batch size of 2.36 million tokens at a 4,096 token sequence length. The optimiser is AdamW with beta1 of 0.9, beta2 of 0.95, a peak learning rate of 2e-4, and weight decay of 0.1, on a warmup stable decay schedule with 2,000 warmup steps and a linear decay applied over the final 10 per cent of the run.
The stage one mixture, used from 0 to 8 trillion tokens, is 85 per cent web text drawn from FineWeb-Edu, DCLM, FineWeb 2, and FineWeb 2 HQ, with 12 per cent of the web portion in non-English European languages. The remaining mass is split between 12 per cent code from The Stack v2, StarCoder 2, Jupyter notebooks, Kaggle, and Stack Exchange, and 3 per cent math from FineMath 3 plus and InfiWebMath 3 plus.
The stage two mixture from 8 to 10 trillion tokens lowers the web fraction to 75 per cent, raises code to 15 per cent with the addition of the curated Stack-Edu set, and bumps math to 10 per cent with the higher quality FineMath 4 plus, InfiWebMath 4 plus, and MegaMath subsets. The multilingual share of the web data is preserved at 12 per cent throughout.
The stage three decay mixture from 10 to 11.1 trillion tokens shifts more weight onto code, at 24 per cent, and math, at 13 per cent, with the web fraction dropping to 63 per cent. The OpenMathReasoning dataset is added at this stage to expose the model to step by step mathematical reasoning traces during the final, lowest learning rate phase of the pretraining run. A separate mid-training stage of 100 billion tokens is then used to extend the context from 4,096 first to 32,000 tokens and then to 64,000 tokens. A further reasoning specific mid-training of 35 billion unique tokens, repeated for four epochs to roughly 140 billion total tokens, is sourced from the OpenThoughts3 1.2 million example set and a subset of NVIDIA's Llama-Nemotron reasoning data, all templated in the ChatML format.
SmolLM 3 is natively trained on six languages: English, French, Spanish, German, Italian, and Portuguese. Twelve per cent of the web portion of the pretraining mixture is in non-English European languages throughout all three pretraining stages, which the Hugging Face team identifies as the share that allowed the model to retain strong English benchmark scores while picking up meaningful multilingual capability. Additional smaller exposure was given during training to Arabic, Chinese, and Russian, which the team describes as supported on a best effort basis with fewer training tokens and correspondingly lower expected quality.
On the base model the team reports MLMM HellaSwag scores of 63.94 for French, 65.85 for Spanish, and 59.56 for German, putting SmolLM 3 in front of similarly sized open competitors on those languages. Flores 200 five-shot translation accuracy reaches 62.85 on French, 48.25 on Spanish, and 56.60 on German. Belebele reading comprehension is reported at 51.00 for French. The release blog identifies multilingual European coverage as one of the three main differentiators of SmolLM 3 relative to other 3 billion parameter open models, alongside long context and dual reasoning behaviour.
Unlike most other small open models, SmolLM 3 ships with a single instruct checkpoint that can switch between two behavioural modes at inference time. Extended thinking mode is enabled by default and is also activated by inserting the flag /think in the system prompt. In this mode the model first emits a visible chain of thought inside a dedicated <think>...</think> block before producing the user facing answer. Non-thinking mode is activated by the flag /no_think, in which case the model pre-fills an empty think block and proceeds directly to the answer. Both modes share the same weights; the switch is encoded entirely in the chat template and the system prompt.
The reasoning behaviour comes from the combination of three post-training stages. The first is the 140 billion token reasoning mid-training described above, which conditions the model on long form chain of thought traces. The second is supervised fine tuning on a 1.8 billion token mixture, split between 1 billion tokens of conventional non-reasoning conversational data drawn from 12 datasets and 0.8 billion tokens of reasoning data drawn from 10 datasets, all with explicit chain of thought traces, trained for four epochs to roughly 8 billion total tokens. The team uses the best-fit decreasing packing strategy and reports that a significant portion of the SFT data was generated synthetically by prompting Qwen3 32B.
The third stage is preference alignment with Anchored Preference Optimization, a variant of direct preference optimization (DPO) that adds an anchor term to stabilise training on long context preference data. The non-reasoning preference data is drawn from the public Tulu 3 preference set, while the reasoning preference pairs are synthetically constructed by treating Qwen3 32B completions as the chosen response and Qwen3 0.6B completions as the rejected response. The maximum context length seen during alignment training is 24,000 tokens. The final shipped checkpoint is a linear merge of the APO checkpoint at 0.9 weight and the mid-training checkpoint at 0.1 weight, a step the team added to recover a small amount of base model behaviour that had been over-erased by post-training.
The practical effect of the dual mode design is most visible on hard reasoning benchmarks. AIME 2025 jumps from 9.3 in non-thinking mode to 36.7 in extended thinking mode, a roughly four-fold improvement. LiveCodeBench v4 doubles from 15.2 to 30.0, and GPQA Diamond rises from 35.7 to 41.7. The trade-off is latency and token cost, since the thinking trace can run to several thousand tokens before the answer begins.
The Hugging Face team publishes evaluations against three different cohorts: base model scores against other 3 billion parameter base models, instruct model scores in non-thinking mode against other instruct models, and extended thinking scores against the Qwen 3 reasoning family at 1.7 billion and 4 billion parameters.
| Benchmark | SmolLM 3 3B | Qwen 2.5 3B | Llama 3.2 3B |
|---|---|---|---|
| HellaSwag | 76.15 | 74.19 | 75.52 |
| ARC CF | 65.61 | 59.81 | 58.58 |
| MMLU CF | 44.13 | 42.93 | 41.32 |
| HumanEval Plus | 30.48 | 34.14 | 25.00 |
| MBPP Plus | 52.91 | 52.11 | 38.88 |
| RULER 64k | 67.85 | 64.90 | 72.93 |
| Benchmark | SmolLM 3 3B | Qwen 2.5 3B Instruct | Llama 3.1 3B Instruct |
|---|---|---|---|
| AIME 2025 | 9.3 | 2.9 | 0.3 |
| GSM-Plus | 72.8 | 74.1 | 59.2 |
| LiveCodeBench v4 | 15.2 | 10.5 | 3.4 |
| GPQA Diamond | 35.7 | 32.2 | 29.4 |
| IFEval | 76.7 | 65.6 | 71.6 |
| BFCL tool calling | 92.3 | not reported | 92.3 |
| Global MMLU | 53.5 | 50.54 | 46.8 |
| Benchmark | SmolLM 3 3B | Qwen 3 1.7B | Qwen 3 4B |
|---|---|---|---|
| AIME 2025 | 36.7 | 30.7 | 58.8 |
| GSM-Plus | 83.4 | 79.4 | 88.2 |
| LiveCodeBench v4 | 30.0 | 34.4 | 52.9 |
| GPQA Diamond | 41.7 | 39.9 | 55.3 |
The pattern across these tables is consistent with the team's framing. The base 3B model is competitive with or beats both Qwen 2.5 3B and Llama 3.2 3B on English knowledge and code, while sitting behind Llama 3.2 3B on long context retrieval as measured by RULER 64k. In non-thinking mode the instruct model leads its 3 billion parameter peers on most benchmarks, with particularly large margins on IFEval and AIME 2025. In extended thinking mode the same checkpoint approaches but does not match the 4 billion parameter Qwen 3 4B, while outperforming the smaller Qwen 3 1.7B on three of the four reported benchmarks.
The full SmolLM 3 release is published under the Apache License 2.0. This applies to the base 3 billion parameter checkpoint, the instruct checkpoint, the published intermediate training checkpoints, the pretraining and post-training datasets that Hugging Face owns or co-curates, and the Nanotron training configurations released alongside the model. The Apache 2.0 grant permits commercial use, modification, and redistribution provided the licence notice and any patent grants are preserved, and it does not impose acceptable use restrictions or per-user thresholds of the kind found in Meta's Llama Community Licence or Google's Gemma Terms of Use.
The team notes in the release blog that a small number of datasets used in the mixture, including portions of OpenThoughts3 and NVIDIA's Llama-Nemotron reasoning subset, carry their own upstream licences and that users redistributing those datasets should consult the original sources. The model weights themselves are not subject to any such downstream restriction.
SmolLM 3 sits in a crowded 3 to 4 billion parameter open weight tier that includes Llama 3.2 3B from Meta, Qwen 2.5 3B and Qwen 3 4B from Alibaba, and Gemma 3 4B from Google. The table below summarises the headline specifications.
| Model | Parameters | Released | Context | Languages | Reasoning mode | Licence |
|---|---|---|---|---|---|---|
| SmolLM 3 | 3B | July 2025 | 128k via YaRN | 6 native, 3 additional | Dual think and no-think | Apache 2.0 |
| Llama 3.2 3B | 3B | September 2024 | 128k | English plus 7 official | No | Llama 3.2 Community |
| Qwen 2.5 3B | 3B | September 2024 | 32k | 29 listed | No | Qwen Research |
| Qwen 3 4B | 4B | April 2025 | 128k | 119 listed | Dual think and no-think | Apache 2.0 |
| Gemma 3 4B | 4B | March 2025 | 128k | 35 plus | No | Gemma Terms of Use |
On parameter efficiency and licensing SmolLM 3 is the more permissive choice in the 3 billion parameter slot, since Apache 2.0 has fewer restrictions than the Llama 3.2 Community licence and avoids the research only constraint of the original Qwen 2.5 3B release. On capability the team's own win-rate tables show SmolLM 3 outperforming both Llama 3.2 3B and Qwen 2.5 3B on a basket of HellaSwag, ARC, Winogrande, CommonsenseQA, MMLU, MMLU Pro, PIQA, OpenBookQA, GSM8K, MATH, HumanEval Plus, and MBPP Plus. At the 4 billion parameter tier the Qwen 3 4B model retains an edge on most reasoning benchmarks under extended thinking, while Gemma 3 4B is the strongest of the 4 billion parameter competitors on certain multilingual and image-text tasks, although Gemma 3 ships as a multimodal model and so is not a like-for-like comparison.
Reception in the open model community was broadly positive. The dual reasoning mode was singled out as the most interesting feature, since it represented one of the first times that a 3 billion parameter open model had shipped with switchable thinking on a single checkpoint rather than as two separate models. The roughly four-fold AIME 2025 lift from non-thinking to extended thinking was widely cited as evidence that the post-training recipe, in particular the 140 billion token reasoning mid-training and the APO alignment step, was doing useful work.
A second talking point was efficiency. The full pretraining run on 384 H100 GPUs for 24 days was an order of magnitude smaller than the budgets reported for some 7 to 8 billion parameter open models, and the model's strong showing against Llama 3.2 3B and Qwen 2.5 3B was read as evidence that the SmolLM team's pretraining recipe was unusually data-efficient. The release of the training configurations and the SmolTalk 2 post-training dataset on Hugging Face allowed third parties to reproduce sections of the run within hours of the announcement.
The 128k context length, while standard for late 2025 frontier models, was noted as a genuine novelty at the 3 billion parameter scale, where 32,000 token windows had still been common. Quantised versions of SmolLM 3 in GGUF and MLX formats appeared on Hugging Face within days of release, and integrations with llama.cpp, Ollama, LM Studio, and Jan were live in the first week. Several reviewers noted that the choice of a single chat template covering both reasoning and non-reasoning behaviour made the model unusually easy to deploy compared with families that ship separate base, instruct, and reasoning checkpoints.
Criticism focused on three points. The model trails the slightly larger Qwen 3 4B on most extended thinking benchmarks, which several reviewers said was the more meaningful peer comparison given that the two models share a similar dual-mode design and licence. Languages outside the six natively supported ones receive only limited exposure during pretraining, which the team itself flagged as a known limitation. And while the base 3 billion parameter model is competitive on RULER at 64,000 tokens, the gap to Llama 3.2 3B on the same benchmark suggests that Meta's long context recipe still has some advantage in needle-in-a-haystack style retrieval tasks.