# Minitron

> Source: https://aiwiki.ai/wiki/minitron
> Updated: 2026-06-28
> Categories: NVIDIA, Small Language Models, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Minitron** is a family of compact language models from [NVIDIA](/wiki/nvidia), together with the model-compression method used to build them: take one large, already pretrained large language model and shrink it into smaller models by combining **structured pruning** (removing parts of the network along its width and depth) with lightweight retraining via **knowledge distillation**, instead of training each smaller model from scratch. NVIDIA introduced the technique in the 2024 paper "Compact Language Models via Pruning and Knowledge Distillation," reporting that a model derived this way needs up to 40 times fewer training tokens than training an equivalent model conventionally, while matching or exceeding its accuracy. [1][2]

The approach was refined in a follow-up report, "LLM Pruning and Distillation in Practice: The Minitron Approach." The original Minitron 8B and 4B models were derived from NVIDIA's [Nemotron](/wiki/nemotron_3) 4 15B base model. The same recipe was later applied to third-party open models, producing Llama-3.1-Minitron-4B from Meta's [Llama 3.1](/wiki/llama_3_1) 8B and Mistral-NeMo-Minitron-8B (also written MN-Minitron-8B) from the [Mistral](/wiki/mistral_ai) NeMo 12B model jointly built by NVIDIA and Mistral AI. [3][4][5]

## What is Minitron?

Minitron is both a method and a set of models. As a method, it is a recipe for producing a small, high-accuracy language model by compressing a larger "teacher" model rather than pretraining the smaller "student" from zero. As models, Minitron refers to the specific compressed checkpoints NVIDIA released: Minitron 8B and 4B (from Nemotron-4 15B), Llama-3.1-Minitron-4B (from Llama 3.1 8B), and Mistral-NeMo-Minitron-8B (from Mistral NeMo 12B). The name reflects the goal of a "mini" Nemotron: a much smaller model that retains most of the capability of its larger parent. [1][3]

## Why was Minitron created?

Model families such as Llama, Gemma, and Nemotron are usually offered in several sizes so that users can trade off accuracy against inference cost. Producing each size by independent pretraining is extremely compute intensive, because every variant consumes its own trillions of training tokens. Minitron reframes the problem: a single large model is trained once, and the smaller members of the family are obtained by compressing that model and refreshing it on only a small slice of additional data (less than 3 percent of a full pretraining run in the original experiments). This amortizes the dominant cost of pretraining across the whole family. [1]

## How does pruning plus distillation work?

The Minitron recipe has three main components: importance estimation, structured pruning, and distillation-based retraining.

**Importance estimation.** Before anything is removed, the method runs a small calibration set (a few hundred batches) through the model and measures the contribution, or "importance," of individual structural elements. Importance is computed in a purely activation-based manner for several axes of the network at once: neurons in the multilayer perceptron (MLP) blocks, attention heads, embedding (hidden) channels, and entire transformer layers (depth). Because the scores are derived from forward-pass activations, the procedure is inexpensive and avoids the cost of gradient-based or search-based importance ranking. [1]

**Structured pruning.** Using these rankings, the least important components are deleted to reach a target architecture. The method distinguishes two regimes:

- **Width pruning** trims the model along its hidden dimensions: it reduces the MLP intermediate size, the number of attention heads or query groups, and the embedding dimension, while keeping the number of layers fixed.
- **Depth pruning** removes whole transformer layers, leaving the per-layer width unchanged.

Width and depth pruning can be combined, and the paper conducts a structured search over candidate compressed architectures to choose a good configuration for a given parameter budget. [1]

**Retraining by distillation.** Pruning damages accuracy, so the compressed model is retrained. Instead of standard next-token training on hard labels, Minitron uses knowledge distillation in which the original uncompressed model acts as the teacher and the pruned model as the student; the student is trained to match the teacher's output probabilities and intermediate states. Distillation lets the smaller model recover most of its lost quality using only a small fraction of the tokens that pretraining from scratch would require. A practical refinement introduced in the follow-up report is **teacher correction**: when the original training data is unavailable, the teacher is first lightly fine-tuned on the new distillation dataset to correct for distribution shift before it supervises the student. [1][3]

The general workflow is therefore: optionally correct the teacher, estimate importance, prune to the target size, then distill. NVIDIA released both the model weights on Hugging Face and example code (including a NeMo-based pipeline) so the procedure can be reproduced. [2][6]

## How efficient is Minitron?

The central claim of the work is a large reduction in the data and compute needed to populate a model family. NVIDIA's first paper states: "Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch." [1] Producing the full 15B, 8B, and 4B family this way cost about 1.8 times less compute than training all three models independently. [1]

Efficiency does not come at the expense of quality. The compressed Nemotron-derived models showed up to a 16 percent improvement in MMLU score relative to equivalently sized models trained from scratch. [1] On the deployment side, the Llama-3.1-Minitron-4B depth-pruned variant reaches roughly 2.7 times the inference throughput of Llama 3.1 8B, and the width-pruned variant about 1.8 times, on a single NVIDIA H100 80 GB GPU in FP8. [3][4] The table below summarizes the headline efficiency figures NVIDIA reported.

| Efficiency metric | Reported figure | Source |
| --- | --- | --- |
| Training tokens per derived model vs from scratch | up to 40x fewer | Minitron paper [1] |
| Additional data used to retrain a pruned model | less than 3 percent of a full pretraining run | Minitron paper [1] |
| Compute to build the 15B/8B/4B family | about 1.8x less | Minitron paper [1] |
| MMLU gain vs same-size model trained from scratch | up to 16 percent | Minitron paper [1] |
| Llama-3.1-Minitron-4B (depth) inference throughput | about 2.7x vs Llama 3.1 8B | NVIDIA blog [4] |
| Mistral-NeMo-Minitron-8B inference throughput | about 1.2x vs Mistral NeMo 12B | NVIDIA blog [5] |

## What were the original Minitron models (from Nemotron-4 15B)?

In the first paper, NVIDIA compressed its Nemotron-4 15B model into 8B and 4B variants. Deriving the 8B and 4B models this way required up to 40 times fewer training tokens per model than training them from scratch, and producing the full 15B, 8B, and 4B family cost about 1.8 times less compute than training all three independently. The compressed models showed up to a 16 percent improvement in MMLU score relative to equivalently sized models trained from scratch, and performed comparably to contemporaneous community models such as Mistral 7B, Gemma 7B, and Llama-3 8B while outperforming other published compression techniques. The Minitron-4B base model uses a hidden size of 3072, 32 attention heads, an MLP intermediate dimension of 9216, grouped-query attention, and rotary position embeddings, and was retrained on roughly 94 billion distillation tokens drawn from the Nemotron-4 pretraining corpus. [1][7]

## How was Minitron applied to Llama 3.1 and Mistral NeMo?

The "Minitron Approach" follow-up report applied the recipe to two widely used open models, using NVIDIA's own pretraining corpus for distillation because the original training data was not available. [3]

**Llama-3.1-Minitron-4B** was distilled from Meta's Llama 3.1 8B. NVIDIA first ran teacher correction by fine-tuning the unpruned 8B model on 94 billion tokens, then produced two student variants. The depth-pruned variant removed 16 of the 32 layers (about 50 percent), guided by which layers least hurt downstream accuracy. The width-pruned variant kept all 32 layers but cut the hidden size from 4096 to 3072 and the MLP intermediate dimension from 14336 to 9216. Each pruned model was retrained with distillation on 94 billion tokens. The depth-pruned variant was the fastest, reaching roughly 2.7 times the inference throughput of Llama 3.1 8B on an NVIDIA H100 80 GB GPU. The width-pruned model scored about 60.5 on 5-shot MMLU and 41.2 on GSM8K, competitive with or ahead of small models such as Phi-2 2.7B, Gemma2 2.6B, and Qwen2-1.5B despite those models being trained on far more data. [3][4]

**Mistral-NeMo-Minitron-8B** (MN-Minitron-8B) was distilled from the Mistral NeMo 12B base model. NVIDIA first fine-tuned the unpruned Mistral NeMo 12B teacher on 127 billion tokens for teacher correction, then used width-only pruning, reducing the hidden size from 5120 to 4096 and the MLP intermediate dimension from 14336 to 11520, and retrained with knowledge distillation on 380 billion tokens. The resulting model has a hidden size of 4096, 32 attention heads, an MLP intermediate dimension of 11520, and 40 layers, with grouped-query attention and rotary position embeddings. NVIDIA stated that, for a model of its size, Mistral-NeMo-Minitron 8B leads on nine popular language-model benchmarks, and at release it was positioned as a state-of-the-art small language model. The base model scored about 69.5 on 5-shot MMLU and 58.5 on GSM8K; an instruction-tuned variant reached about 70.4 on MMLU and 87.1 on GSM8K. The distilled student even surpassed its 12B teacher on GSM8K (55.7 to 58.5) and HumanEval (23.8 to 36.2). Optimized with TensorRT-LLM on an H100 GPU, it delivered about 1.2 times the throughput of Mistral NeMo 12B. NVIDIA reported that, across a 12B, 8B, and 4B family, the pruning-and-distillation approach yielded up to about 1.95 times compute-cost savings versus training each model from scratch. [3][5][8]

## Model summary

The table below lists the principal Minitron releases. Distillation tokens refer to the data used to retrain the pruned student model.

| Minitron model | Base (teacher) model | Pruning type | Params | Distillation tokens | Origin |
| --- | --- | --- | --- | --- | --- |
| Minitron-8B | Nemotron-4 15B | width | ~8B | not separately stated | First Minitron paper |
| Minitron-4B | Nemotron-4 15B | width | ~4B | ~94B | First Minitron paper |
| Llama-3.1-Minitron-4B-Width | Llama 3.1 8B | width | ~4B (5B on disk) | 94B | Minitron Approach report |
| Llama-3.1-Minitron-4B-Depth | Llama 3.1 8B | depth (16 of 32 layers) | ~4B | 94B | Minitron Approach report |
| Mistral-NeMo-Minitron-8B | Mistral NeMo 12B | width | 8B | 380B | Minitron Approach report |

Selected benchmark scores for the base models (5-shot MMLU and 0-shot GSM8K, as reported by NVIDIA):

| Model | MMLU (5-shot) | GSM8K |
| --- | --- | --- |
| Llama-3.1-Minitron-4B-Width-Base | 60.5 | 41.2 |
| Mistral-NeMo-Minitron-8B-Base | 69.5 | 58.5 |
| Mistral-NeMo-Minitron-8B-Instruct | 70.4 | 87.1 |

## Is Minitron open source?

The Minitron weights are distributed on Hugging Face. The Nemotron-derived Minitron models were released under a research-oriented license, while the later Llama-3.1-Minitron and Mistral-NeMo-Minitron base models are covered by the NVIDIA Open Model License, which permits commercial use. The compression code and recipes are published in NVIDIA's public Minitron repository and integrated with the [NeMo](/wiki/nvidia_nemo) framework. [4][5][6]

## Why does Minitron matter?

Minitron demonstrated that structured pruning combined with distillation is a practical, repeatable way to populate an entire model family from a single large checkpoint at a small fraction of the usual compute. By showing the recipe transferring cleanly from NVIDIA's own Nemotron models to third-party models such as Llama 3.1 and Mistral NeMo, the work helped popularize "compress, then distill" as a standard tool for building efficient small language models. The techniques carried forward into NVIDIA's later efficient-model efforts, including the Nemotron-Nano line, and sit alongside related NVIDIA model programs such as [Llama Nemotron](/wiki/llama_nemotron). [1][3]

## References

[1] Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., Molchanov, P. "Compact Language Models via Pruning and Knowledge Distillation." arXiv:2407.14679, 2024. https://arxiv.org/abs/2407.14679

[2] NVlabs/Minitron GitHub repository. https://github.com/NVlabs/Minitron

[3] Turuvekere Sreenivas, S., Muralidharan, S., Joshi, R., et al. "LLM Pruning and Distillation in Practice: The Minitron Approach." arXiv:2408.11796, 2024. https://arxiv.org/abs/2408.11796

[4] NVIDIA. "How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model." NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/

[5] NVIDIA. "Mistral-NeMo-Minitron 8B Model Delivers Unparalleled Accuracy." NVIDIA Technical Blog, 2024. https://developer.nvidia.com/blog/mistral-nemo-minitron-8b-foundation-model-delivers-unparalleled-accuracy/

[6] nvidia/Llama-3.1-Minitron-4B-Width-Base model card. Hugging Face. https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base

[7] nvidia/Minitron-4B-Base model card. Hugging Face. https://huggingface.co/nvidia/Minitron-4B-Base

[8] nvidia/Mistral-NeMo-Minitron-8B-Base model card. Hugging Face. https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base

