SmolLM 3

AI Models Large Language Models Open Source AI Small Language Models

20 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 3,986 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SmolLM 3 is a fully open 3 billion parameter language model released by Hugging Face on July 8, 2025, trained on 11.2 trillion tokens and designed as a small, multilingual, long-context reasoner.^[1] It is the third entry in the SmolLM line of small language models, following SmolLM from July 2024 and SmolLM 2 from November 2024. SmolLM 3 supports a native 64,000 token context that can be extended to 128,000 tokens through YaRN extrapolation, ships with native multilingual coverage of six European languages, and uniquely offers a dual reasoning behaviour: a single set of weights can be switched between an extended thinking mode that produces visible chain of thought traces before answering and a faster non-thinking mode that emits direct responses.^[1]^[2] Hugging Face positions the model as a strong fully-open entry at its size, writing in the release blog that "Our 3B model outperforms Llama-3.2-3B and Qwen2.5-3B while staying competitive with larger 4B alternatives" such as Qwen 3 4B and Gemma 3 4B.^[1] Both the base checkpoint and the post-trained instruct checkpoint are released under the Apache License 2.0.^[1]^[2]

Unlike a weights-only release, SmolLM 3 is what Hugging Face calls a "new competitive fully open 3B model," with the team stating, "We're releasing SmolLM3 with our engineering blueprint."^[1] The project was led by the Hugging Face Smol Models Research group, with Elie Bakouch, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, Lewis Tunstall, and Thomas Wolf among the principal authors.^[1] Hugging Face published the full recipe alongside the weights, including the pretraining data mixture, the training and post-training configurations, the synthetic reasoning data, and the model merging procedure used to combine the final checkpoints.^[1]^[6] The release blog post framed SmolLM 3 as an attempt to close the gap between 3 billion parameter open models, which had historically lagged on reasoning and long context, and the 4 billion parameter tier, while remaining cheap enough to run on a single consumer GPU or on edge devices with quantisation.^[1]

What is SmolLM 3?

SmolLM 3 is a dense, decoder-only large language model of 3 billion parameters that combines four headline capabilities in one release: state-of-the-art performance for its size class, native long context up to 128k tokens, multilingual support across six languages, and switchable reasoning on a single instruct checkpoint.^[1]^[2] Hugging Face frames these as the model's main differentiators, summarising the launch as "SmolLM3: smol, multilingual, long-context reasoner."^[1] The model targets the cost-sensitive deployment niche where a 3 billion parameter footprint can run on an 8 to 12 gigabyte GPU at half precision while still handling tool use, multi-step reasoning, and several European languages.^[1]

The SmolLM line was started in summer 2024 as a research bet that a careful pretraining recipe could produce a competitive language model in the 100 million to 1 billion parameter range, sizes that had been treated as too small to matter for chat and instruction following. The first SmolLM release in July 2024 shipped three checkpoints at 135 million, 360 million, and 1.7 billion parameters, trained on the SmolLM-Corpus, an English language pretraining dataset put together from Cosmopedia, Python-Edu, and FineWeb-Edu. The headline finding was that the 1.7 billion parameter SmolLM outperformed several other open small models of similar size on aggregate evaluations, an unusual outcome for a first release from a team that had not previously published a frontier pretraining run.

SmolLM 2 followed in November 2024 with the same three parameter scales but a substantially expanded training budget. The 1.7 billion parameter SmolLM 2 was trained on 11 trillion tokens, an order of magnitude more than SmolLM 1.7B, and added math and code corpora on top of the SmolLM-Corpus.^[10] The model was also released with chat and instruct variants that used a new post-training recipe built on supervised fine tuning and direct preference optimisation. SmolLM 2 was widely reported as the strongest open 1.7 billion parameter chat model of its time and remained competitive into mid 2025.

By the first half of 2025 the small open model landscape had shifted in three ways. Reasoning models such as DeepSeek R1, OpenAI o1, and Qwen 3 Thinking had popularised the idea that test-time compute spent on visible chain of thought traces could lift small models well above their static benchmark scores. Long context windows of 32,000 to 128,000 tokens had become a standard expectation rather than a niche feature. Multilingual coverage in the European long tail was being added even to small open models by Mistral, Qwen, and Google's Gemma family. SmolLM 3 was conceived as the SmolLM team's response to these three trends at a slightly larger parameter scale, with the explicit goal of producing a 3 billion parameter model that could hold its own against the 4 billion parameter competition while running comfortably on a single 8 to 12 gigabyte GPU at half precision.^[1]

How is SmolLM 3 built?

SmolLM 3 is a dense decoder only Transformer built in the same general family as Meta's Llama 3 and Llama 3.2. The model uses tied input and output embeddings, Grouped-Query Attention with a four to one ratio of query to key value heads, RMSNorm pre-normalisation, and the SwiGLU activation function used by most modern Llama style models. The total parameter count is 3 billion at BF16 precision.^[2]

The most distinctive architectural choice in SmolLM 3 is the use of NoPE, a no positional encoding variant introduced in earlier research on length generalisation. Rather than applying a single positional encoding scheme uniformly across all layers, SmolLM 3 keeps standard rotary position embedding (RoPE) on three quarters of its layers and strips position encoding entirely on the remaining quarter, following a 3 to 1 ratio in which every fourth layer is a NoPE layer.^[1] The Hugging Face team reports that this selective removal improves the model's ability to generalise to context lengths beyond those seen during training without harming short context accuracy.^[1] The RoPE theta is set to 1.5 million during the standard 32,000 token training stage and increased to 5 million for the 64,000 token long context stage.

The tokenizer is derived from Llama 3.2 with the beginning of sequence token removed. The vocabulary is 128,000 tokens, combining the 100,000 tiktoken3 token base used by Llama 3 with roughly 28,000 additional tokens added for non-English language coverage. Training uses intra-document masking to prevent attention bleed between documents packed into the same sequence, and the embedding layers are excluded from weight decay during the optimiser step, a small change that the team found stabilised the early phase of training. Long context support up to 128,000 tokens at inference time is achieved through YaRN rope scaling with a factor of 2.0 over the 65,536 token trained window.^[2]

Specification	Value
Parameters	3 billion
Layers	dense decoder only Transformer
Attention	grouped query, 4 to 1 query to key value ratio
Positional encoding	RoPE on 3 of every 4 layers, NoPE on the remaining layer
Tokenizer	Llama 3.2 base with bos removed, 128k vocabulary
Precision	BF16
Trained context length	64,000 tokens
Maximum context with YaRN	128,000 tokens
Activation	SwiGLU
Normalisation	RMSNorm with pre normalisation

How was SmolLM 3 trained?

The pretraining stack uses 11.2 trillion tokens, split across three stages with progressively more code and mathematics mixed into the diet as the run advances.^[1] The framework is Hugging Face's open source Nanotron trainer and the hardware budget is 384 NVIDIA H100 GPUs for 24 days, with a global batch size of 2.36 million tokens at a 4,096 token sequence length.^[1] The optimiser is AdamW with beta1 of 0.9, beta2 of 0.95, a peak learning rate of 2e-4, and weight decay of 0.1, on a warmup stable decay schedule with 2,000 warmup steps and a linear decay applied over the final 10 per cent of the run.

The stage one mixture, used from 0 to 8 trillion tokens, is 85 per cent web text drawn from FineWeb-Edu, DCLM, FineWeb 2, and FineWeb 2 HQ, with 12 per cent of the web portion in non-English European languages.^[9] The remaining mass is split between 12 per cent code from The Stack v2, StarCoder 2, Jupyter notebooks, Kaggle, and Stack Exchange, and 3 per cent math from FineMath 3 plus and InfiWebMath 3 plus.

The stage two mixture from 8 to 10 trillion tokens lowers the web fraction to 75 per cent, raises code to 15 per cent with the addition of the curated Stack-Edu set, and bumps math to 10 per cent with the higher quality FineMath 4 plus, InfiWebMath 4 plus, and MegaMath subsets. The multilingual share of the web data is preserved at 12 per cent throughout.

The stage three decay mixture from 10 to 11.1 trillion tokens shifts more weight onto code, at 24 per cent, and math, at 13 per cent, with the web fraction dropping to 63 per cent. The OpenMathReasoning dataset is added at this stage to expose the model to step by step mathematical reasoning traces during the final, lowest learning rate phase of the pretraining run. A separate mid-training stage of 100 billion tokens is then used to extend the context from 4,096 first to 32,000 tokens and then to 64,000 tokens. A further reasoning specific mid-training of 35 billion unique tokens, repeated for four epochs to roughly 140 billion total tokens, is sourced from the OpenThoughts3 1.2 million example set and a subset of NVIDIA's Llama-Nemotron reasoning data, all templated in the ChatML format.^[11]^[12]

What languages does SmolLM 3 support?

SmolLM 3 is natively trained on six languages: English, French, Spanish, German, Italian, and Portuguese.^[1] Twelve per cent of the web portion of the pretraining mixture is in non-English European languages throughout all three pretraining stages, which the Hugging Face team identifies as the share that allowed the model to retain strong English benchmark scores while picking up meaningful multilingual capability.^[1] Additional smaller exposure was given during training to Arabic, Chinese, and Russian, which the team describes as supported on a best effort basis with fewer training tokens and correspondingly lower expected quality.^[1]

On the base model the team reports MLMM HellaSwag scores of 63.94 for French, 65.85 for Spanish, and 59.56 for German, putting SmolLM 3 in front of similarly sized open competitors on those languages.^[1] Flores 200 five-shot translation accuracy reaches 62.85 on French, 48.25 on Spanish, and 56.60 on German. Belebele reading comprehension is reported at 51.00 for French. The release blog identifies multilingual European coverage as one of the three main differentiators of SmolLM 3 relative to other 3 billion parameter open models, alongside long context and dual reasoning behaviour.^[1]

What are SmolLM 3's reasoning modes?

Unlike most other small open models, SmolLM 3 ships with a single instruct checkpoint that can switch between two behavioural modes at inference time.^[1]^[2] Extended thinking mode is enabled by default and is also activated by inserting the flag /think in the system prompt. In this mode the model first emits a visible chain of thought inside a dedicated <think>...</think> block before producing the user facing answer. Non-thinking mode is activated by the flag /no_think, in which case the model pre-fills an empty think block and proceeds directly to the answer.^[2] Both modes share the same weights; the switch is encoded entirely in the chat template and the system prompt.

The reasoning behaviour comes from the combination of three post-training stages.^[1] The first is the 140 billion token reasoning mid-training described above, which conditions the model on long form chain of thought traces. The second is supervised fine tuning on a 1.8 billion token mixture, split between 1 billion tokens of conventional non-reasoning conversational data drawn from 12 datasets and 0.8 billion tokens of reasoning data drawn from 10 datasets, all with explicit chain of thought traces, trained for four epochs to roughly 8 billion total tokens. The team uses the best-fit decreasing packing strategy and reports that a significant portion of the SFT data was generated synthetically by prompting Qwen3 32B.

The third stage is preference alignment with Anchored Preference Optimization, a variant of direct preference optimization (DPO) that adds an anchor term to stabilise training on long context preference data.^[1] The non-reasoning preference data is drawn from the public Tulu 3 preference set, while the reasoning preference pairs are synthetically constructed by treating Qwen3 32B completions as the chosen response and Qwen3 0.6B completions as the rejected response. The maximum context length seen during alignment training is 24,000 tokens. The final shipped checkpoint is a linear merge of the APO checkpoint at 0.9 weight and the mid-training checkpoint at 0.1 weight, a step the team added to recover a small amount of base model behaviour that had been over-erased by post-training.^[1]

The practical effect of the dual mode design is most visible on hard reasoning benchmarks. AIME 2025 jumps from 9.3 in non-thinking mode to 36.7 in extended thinking mode, a roughly four-fold improvement.^[1] LiveCodeBench v4 doubles from 15.2 to 30.0, and GPQA Diamond rises from 35.7 to 41.7. The trade-off is latency and token cost, since the thinking trace can run to several thousand tokens before the answer begins.

How does SmolLM 3 perform on benchmarks?

The Hugging Face team publishes evaluations against three different cohorts: base model scores against other 3 billion parameter base models, instruct model scores in non-thinking mode against other instruct models, and extended thinking scores against the Qwen 3 reasoning family at 1.7 billion and 4 billion parameters.^[1] All benchmark figures and the win-rate claims below are reported by Hugging Face.^[1]

Base model English benchmarks

Benchmark	SmolLM 3 3B	Qwen 2.5 3B	Llama 3.2 3B
HellaSwag	76.15	74.19	75.52
ARC CF	65.61	59.81	58.58
MMLU CF	44.13	42.93	41.32
HumanEval Plus	30.48	34.14	25.00
MBPP Plus	52.91	52.11	38.88
RULER 64k	67.85	64.90	72.93

Instruct model non-thinking mode

Benchmark	SmolLM 3 3B	Qwen 2.5 3B Instruct	Llama 3.1 3B Instruct
AIME 2025	9.3	2.9	0.3
GSM-Plus	72.8	74.1	59.2
LiveCodeBench v4	15.2	10.5	3.4
GPQA Diamond	35.7	32.2	29.4
IFEval	76.7	65.6	71.6
BFCL tool calling	92.3	not reported	92.3
Global MMLU	53.5	50.54	46.8

Instruct model extended thinking mode

Benchmark	SmolLM 3 3B	Qwen 3 1.7B	Qwen 3 4B
AIME 2025	36.7	30.7	58.8
GSM-Plus	83.4	79.4	88.2
LiveCodeBench v4	30.0	34.4	52.9
GPQA Diamond	41.7	39.9	55.3

The pattern across these tables is consistent with the team's framing. The base 3B model is competitive with or beats both Qwen 2.5 3B and Llama 3.2 3B on English knowledge and code, while sitting behind Llama 3.2 3B on long context retrieval as measured by RULER 64k. In non-thinking mode the instruct model leads its 3 billion parameter peers on most benchmarks, with particularly large margins on IFEval and AIME 2025. In extended thinking mode the same checkpoint approaches but does not match the 4 billion parameter Qwen 3 4B, while outperforming the smaller Qwen 3 1.7B on three of the four reported benchmarks.

Is SmolLM 3 open source?

The full SmolLM 3 release is published under the Apache License 2.0.^[1]^[2] This applies to the base 3 billion parameter checkpoint, the instruct checkpoint, the published intermediate training checkpoints, the pretraining and post-training datasets that Hugging Face owns or co-curates, and the Nanotron training configurations released alongside the model.^[6] The Apache 2.0 grant permits commercial use, modification, and redistribution provided the licence notice and any patent grants are preserved, and it does not impose acceptable use restrictions or per-user thresholds of the kind found in Meta's Llama Community Licence or Google's Gemma Terms of Use.

Beyond the weights, Hugging Face describes the release as a complete engineering blueprint, stating, "We're releasing SmolLM3 with our engineering blueprint. It includes our architecture details, exact data mixtures showing how we progressively boost performance across domains, and the complete methodology for building hybrid reasoning models."^[1] This full-recipe disclosure, covering the data mixture, the staged curriculum, the synthetic reasoning data, the training logs, and the model merging step, is what distinguishes a "fully open" release from the more common open-weights-only releases of comparable small models.^[1]^[6]^[7]

The team notes in the release blog that a small number of datasets used in the mixture, including portions of OpenThoughts3 and NVIDIA's Llama-Nemotron reasoning subset, carry their own upstream licences and that users redistributing those datasets should consult the original sources.^[11]^[12] The model weights themselves are not subject to any such downstream restriction.

How does SmolLM 3 compare to Llama 3.2 and Qwen?

SmolLM 3 sits in a crowded 3 to 4 billion parameter open weight tier that includes Llama 3.2 3B from Meta, Qwen 2.5 3B and Qwen 3 4B from Alibaba, and Gemma 3 4B from Google. The table below summarises the headline specifications.

Model	Parameters	Released	Context	Languages	Reasoning mode	Licence
SmolLM 3	3B	July 2025	128k via YaRN	6 native, 3 additional	Dual think and no-think	Apache 2.0
Llama 3.2 3B	3B	September 2024	128k	English plus 7 official	No	Llama 3.2 Community
Qwen 2.5 3B	3B	September 2024	32k	29 listed	No	Qwen Research
Qwen 3 4B	4B	April 2025	128k	119 listed	Dual think and no-think	Apache 2.0
Gemma 3 4B	4B	March 2025	128k	35 plus	No	Gemma Terms of Use

On parameter efficiency and licensing SmolLM 3 is the more permissive choice in the 3 billion parameter slot, since Apache 2.0 has fewer restrictions than the Llama 3.2 Community licence and avoids the research only constraint of the original Qwen 2.5 3B release. On capability the team's own win-rate tables show SmolLM 3 outperforming both Llama 3.2 3B and Qwen 2.5 3B on a basket of HellaSwag, ARC, Winogrande, CommonsenseQA, MMLU, MMLU Pro, PIQA, OpenBookQA, GSM8K, MATH, HumanEval Plus, and MBPP Plus.^[1] At the 4 billion parameter tier the Qwen 3 4B model retains an edge on most reasoning benchmarks under extended thinking, while Gemma 3 4B is the strongest of the 4 billion parameter competitors on certain multilingual and image-text tasks, although Gemma 3 ships as a multimodal model and so is not a like-for-like comparison.

Reception

Reception in the open model community was broadly positive. The dual reasoning mode was singled out as the most interesting feature, since it represented one of the first times that a 3 billion parameter open model had shipped with switchable thinking on a single checkpoint rather than as two separate models. The roughly four-fold AIME 2025 lift from non-thinking to extended thinking was widely cited as evidence that the post-training recipe, in particular the 140 billion token reasoning mid-training and the APO alignment step, was doing useful work.

A second talking point was efficiency. The full pretraining run on 384 H100 GPUs for 24 days was an order of magnitude smaller than the budgets reported for some 7 to 8 billion parameter open models, and the model's strong showing against Llama 3.2 3B and Qwen 2.5 3B was read as evidence that the SmolLM team's pretraining recipe was unusually data-efficient.^[1] The release of the training configurations and the SmolTalk 2 post-training dataset on Hugging Face allowed third parties to reproduce sections of the run within hours of the announcement.^[5]^[6]

The 128k context length, while standard for late 2025 frontier models, was noted as a genuine novelty at the 3 billion parameter scale, where 32,000 token windows had still been common. Quantised versions of SmolLM 3 in GGUF and MLX formats appeared on Hugging Face within days of release, and integrations with llama.cpp, Ollama, LM Studio, and Jan were live in the first week. Several reviewers noted that the choice of a single chat template covering both reasoning and non-reasoning behaviour made the model unusually easy to deploy compared with families that ship separate base, instruct, and reasoning checkpoints.

Criticism focused on three points. The model trails the slightly larger Qwen 3 4B on most extended thinking benchmarks, which several reviewers said was the more meaningful peer comparison given that the two models share a similar dual-mode design and licence. Languages outside the six natively supported ones receive only limited exposure during pretraining, which the team itself flagged as a known limitation.^[1] And while the base 3 billion parameter model is competitive on RULER at 64,000 tokens, the gap to Llama 3.2 3B on the same benchmark suggests that Meta's long context recipe still has some advantage in needle-in-a-haystack style retrieval tasks.

ELI5: SmolLM 3 in plain terms

SmolLM 3 is a small artificial intelligence that can read and write text. "Small" here means it has about 3 billion adjustable knobs inside it, which is tiny compared with the biggest chatbots that have hundreds of billions, so it can run on a single ordinary graphics card or even a beefy laptop instead of needing a data centre. Even though it is small, it learned from a huge amount of text (about 11 trillion words and word-pieces), so it knows a lot. It can read very long documents (up to roughly 100 pages at once), it speaks six languages, and it has a special switch: in "think" mode it works through a problem step by step out loud before answering, which makes it better at hard maths and coding puzzles, and in "no-think" mode it just answers quickly. Hugging Face, the company that made it, gave away not only the finished model but also the full instructions for how they built it, so anyone can use it for free or even rebuild their own copy.

References

Bakouch, Elie; Ben Allal, Loubna; Lozhkov, Anton; Tazi, Nouamane; Tunstall, Lewis et al. "SmolLM3: smol, multilingual, long-context reasoner." Hugging Face Blog. July 8, 2025. https://huggingface.co/blog/smollm3 ↩
Hugging Face. "HuggingFaceTB/SmolLM3-3B." Hugging Face model card. https://huggingface.co/HuggingFaceTB/SmolLM3-3B ↩
Hugging Face. "HuggingFaceTB/SmolLM3-3B-Base." Hugging Face model card. https://huggingface.co/HuggingFaceTB/SmolLM3-3B-Base
Hugging Face. "SmolLM3 pretraining datasets collection." Hugging Face. https://huggingface.co/collections/HuggingFaceTB/smollm3-pretraining-datasets-685a7353fdc01aecde51b1d9
Hugging Face. "SmolTalk 2 post-training dataset." https://huggingface.co/datasets/HuggingFaceTB/smoltalk2 ↩
Hugging Face. "smollm3-configs training configurations." https://huggingface.co/datasets/HuggingFaceTB/smollm3-configs ↩
Hugging Face. "SmolLM3 training logs on Weights and Biases." https://wandb.ai/huggingface/SmolLM3-training-logs ↩
Hugging Face. "smollm GitHub repository." https://github.com/huggingface/smollm
Penedo, Guilherme; Kydlicek, Hynek; Ben Allal, Loubna et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." Hugging Face, 2024. ↩
Allal, Loubna Ben; Lozhkov, Anton et al. "SmolLM2: When Smol Goes Big." Hugging Face Blog, November 2024. https://huggingface.co/blog/smollm2 ↩
NVIDIA. "Llama-Nemotron reasoning dataset." Hugging Face. https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset ↩
Open Thoughts Team. "OpenThoughts3 reasoning corpus." 2025. https://huggingface.co/datasets/open-thoughts/OpenThoughts3-1.2M ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Best Small Language Models Jamba2 Jet-Nemotron SmolLM SmolLM 2

What is SmolLM 3?

How is SmolLM 3 built?

How was SmolLM 3 trained?

What languages does SmolLM 3 support?

What are SmolLM 3's reasoning modes?

How does SmolLM 3 perform on benchmarks?

Base model English benchmarks

Instruct model non-thinking mode

Instruct model extended thinking mode

Is SmolLM 3 open source?

How does SmolLM 3 compare to Llama 3.2 and Qwen?

Reception

ELI5: SmolLM 3 in plain terms

See also

References

Improve this article

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini

Phi-4-mini-flash-reasoning

What links here

Related Articles

Phi-3

Phi-4

Gemma 2

Gemma 3

Phi-4-mini

Phi-4-mini-flash-reasoning

What links here