Minitron
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,622 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,622 words
Add missing citations, update stale details, or suggest a clearer explanation.
Minitron is a family of compact language models produced by NVIDIA, together with the model compression methodology used to create them. The core idea is to take a large, already pretrained large language model and shrink it into one or more smaller models through structured pruning (removing parts of the network along its width and/or depth) followed by lightweight retraining via knowledge distillation, rather than training each smaller model from scratch. The approach was introduced in the 2024 paper "Compact Language Models via Pruning and Knowledge Distillation" and refined in a follow-up report, "LLM Pruning and Distillation in Practice: The Minitron Approach." Its headline result is that a smaller model derived this way can match or exceed a model trained conventionally while using up to roughly 40 times fewer training tokens. [1][2]
The original Minitron 8B and 4B models were derived from NVIDIA's Nemotron 4 15B base model. The same recipe was later applied to third-party open models, producing Llama-3.1-Minitron-4B from Meta's Llama 3.1 8B and Mistral-NeMo-Minitron-8B (also written MN-Minitron-8B) from the Mistral NeMo 12B model jointly built by NVIDIA and Mistral AI. [3][4][5]
Model families such as Llama, Gemma, and Nemotron are usually offered in several sizes so that users can trade off accuracy against inference cost. Producing each size by independent pretraining is extremely compute intensive, because every variant consumes its own trillions of training tokens. Minitron reframes the problem: a single large model is trained once, and the smaller members of the family are obtained by compressing that model and refreshing it on only a small slice of additional data (less than 3 percent of a full pretraining run in the original experiments). This amortizes the dominant cost of pretraining across the whole family. [1]
The Minitron recipe has three main components: importance estimation, structured pruning, and distillation-based retraining.
Importance estimation. Before anything is removed, the method runs a small calibration set (a few hundred batches) through the model and measures the contribution, or "importance," of individual structural elements. Importance is computed in a purely activation-based manner for several axes of the network at once: neurons in the multilayer perceptron (MLP) blocks, attention heads, embedding (hidden) channels, and entire transformer layers (depth). Because the scores are derived from forward-pass activations, the procedure is inexpensive and avoids the cost of gradient-based or search-based importance ranking. [1]
Structured pruning. Using these rankings, the least important components are deleted to reach a target architecture. The method distinguishes two regimes:
Width and depth pruning can be combined, and the paper conducts a structured search over candidate compressed architectures to choose a good configuration for a given parameter budget. [1]
Retraining by distillation. Pruning damages accuracy, so the compressed model is retrained. Instead of standard next-token training on hard labels, Minitron uses knowledge distillation in which the original uncompressed model acts as the teacher and the pruned model as the student; the student is trained to match the teacher's output probabilities and intermediate states. Distillation lets the smaller model recover most of its lost quality using only a small fraction of the tokens that pretraining from scratch would require. A practical refinement introduced in the follow-up report is teacher correction: when the original training data is unavailable, the teacher is first lightly fine-tuned on the new distillation dataset to correct for distribution shift before it supervises the student. [1][3]
The general workflow is therefore: optionally correct the teacher, estimate importance, prune to the target size, then distill. NVIDIA released both the model weights on Hugging Face and example code (including a NeMo-based pipeline) so the procedure can be reproduced. [2][6]
In the first paper, NVIDIA compressed its Nemotron-4 15B model into 8B and 4B variants. Deriving the 8B and 4B models this way required up to 40 times fewer training tokens per model than training them from scratch, and producing the full 15B, 8B, and 4B family cost about 1.8 times less compute than training all three independently. The compressed models showed up to a 16 percent improvement in MMLU score relative to equivalently sized models trained from scratch, and performed comparably to contemporaneous community models such as Mistral 7B, Gemma 7B, and Llama-3 8B while outperforming other published compression techniques. The Minitron-4B base model uses a hidden size of 3072, 32 attention heads, an MLP intermediate dimension of 9216, grouped-query attention, and rotary position embeddings, and was retrained on roughly 94 billion distillation tokens drawn from the Nemotron-4 pretraining corpus. [1][7]
The "Minitron Approach" follow-up report applied the recipe to two widely used open models, using NVIDIA's own pretraining corpus for distillation because the original training data was not available. [3]
Llama-3.1-Minitron-4B was distilled from Meta's Llama 3.1 8B. NVIDIA first ran teacher correction by fine-tuning the unpruned 8B model on 94 billion tokens, then produced two student variants. The depth-pruned variant removed 16 of the 32 layers (about 50 percent), guided by which layers least hurt downstream accuracy. The width-pruned variant kept all 32 layers but cut the hidden size from 4096 to 3072 and the MLP intermediate dimension from 14336 to 9216. Each pruned model was retrained with distillation on 94 billion tokens. The depth-pruned variant was the fastest, reaching roughly 2.7 times the inference throughput of Llama 3.1 8B on an NVIDIA H100 80 GB GPU. The width-pruned model scored about 60.5 on 5-shot MMLU and 41.2 on GSM8K, competitive with or ahead of small models such as Phi-2 2.7B, Gemma2 2.6B, and Qwen2-1.5B despite those models being trained on far more data. [3][4]
Mistral-NeMo-Minitron-8B (MN-Minitron-8B) was distilled from the Mistral NeMo 12B base model. NVIDIA used width-only pruning, reducing the hidden size from 5120 to 4096 and the MLP intermediate dimension from 14336 to 11520, then retrained with knowledge distillation on 380 billion tokens. The resulting model has a hidden size of 4096, 32 attention heads, an MLP intermediate dimension of 11520, and 40 layers, with grouped-query attention and rotary position embeddings. NVIDIA stated that, for a model of its size, Mistral-NeMo-Minitron 8B leads on nine popular language-model benchmarks, and at release it was positioned as a state-of-the-art small language model. The base model scored about 69.5 on 5-shot MMLU and 58.5 on GSM8K; an instruction-tuned variant reached about 70.4 on MMLU and 87.1 on GSM8K. Optimized with TensorRT-LLM on an H100 GPU, it delivered about 1.2 times the throughput of Mistral NeMo 12B. NVIDIA reported that, across a 12B, 8B, and 4B family, the pruning-and-distillation approach yielded up to about 1.95 times compute-cost savings versus training each model from scratch. [3][5][8]
The table below lists the principal Minitron releases. Distillation tokens refer to the data used to retrain the pruned student model.
| Minitron model | Base (teacher) model | Pruning type | Params | Distillation tokens | Origin |
|---|---|---|---|---|---|
| Minitron-8B | Nemotron-4 15B | width | ~8B | not separately stated | First Minitron paper |
| Minitron-4B | Nemotron-4 15B | width | ~4B | ~94B | First Minitron paper |
| Llama-3.1-Minitron-4B-Width | Llama 3.1 8B | width | ~4B (5B on disk) | 94B | Minitron Approach report |
| Llama-3.1-Minitron-4B-Depth | Llama 3.1 8B | depth (16 of 32 layers) | ~4B | 94B | Minitron Approach report |
| Mistral-NeMo-Minitron-8B | Mistral NeMo 12B | width | 8B | 380B | Minitron Approach report |
Selected benchmark scores for the base models (5-shot MMLU and 0-shot GSM8K, as reported by NVIDIA):
| Model | MMLU (5-shot) | GSM8K |
|---|---|---|
| Llama-3.1-Minitron-4B-Width-Base | 60.5 | 41.2 |
| Mistral-NeMo-Minitron-8B-Base | 69.5 | 58.5 |
| Mistral-NeMo-Minitron-8B-Instruct | 70.4 | 87.1 |
The Minitron weights are distributed on Hugging Face. The Nemotron-derived Minitron models were released under a research-oriented license, while the later Llama-3.1-Minitron and Mistral-NeMo-Minitron base models are covered by the NVIDIA Open Model License, which permits commercial use. The compression code and recipes are published in NVIDIA's public Minitron repository and integrated with the NeMo framework. [4][5][6]
Minitron demonstrated that structured pruning combined with distillation is a practical, repeatable way to populate an entire model family from a single large checkpoint at a small fraction of the usual compute. By showing the recipe transferring cleanly from NVIDIA's own Nemotron models to third-party models such as Llama 3.1 and Mistral NeMo, the work helped popularize "compress, then distill" as a standard tool for building efficient small language models. The techniques carried forward into NVIDIA's later efficient-model efforts, including the Nemotron-Nano line, and sit alongside related NVIDIA model programs such as Llama Nemotron. [1][3]