Knowledge Distillation

Introduction

Knowledge distillation (also known as model distillation) is a model compression technique in machine learning and artificial intelligence in which a smaller model (called the student) is trained to reproduce the behavior of a larger, more complex model or ensemble of models (called the teacher). The core idea is that a well-trained teacher network encodes rich information in its output probability distributions, and this information can be transferred to a compact student network that would be difficult to train from scratch to the same level of performance. The resulting student model is typically faster, smaller, and cheaper to deploy while retaining most of the teacher's accuracy, achieving roughly 40 to 90 percent model size reduction with less than 5 percent performance loss across computer vision, natural language processing, and speech recognition applications.

The technique fundamentally reimagines how neural networks learn: rather than training solely on labeled data, smaller student models learn from both ground-truth labels and the probability distributions produced by larger teacher models. These soft targets contain valuable "dark knowledge" about inter-class similarities, information that standard one-hot labels cannot convey. Knowledge distillation is commonly used to reduce the size of deep neural network models (or ensembles of multiple models) so that they can be deployed on lower-power or edge devices while preserving most of the original model's performance. It has become especially important for compressing very large models like large language models in recent years.

The term "knowledge distillation" was popularized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper "Distilling the Knowledge in a Neural Network," though the underlying concept of training a small model to mimic a larger one dates back to earlier work by Buciluă, Caruana, and Niculescu-Mizil in 2006. Knowledge distillation has become one of the most widely used approaches for deploying deep learning models on resource-constrained devices such as smartphones, edge sensors, and embedded systems, and it plays a central role in the creation of efficient large language models such as DistilBERT and GPT-4o mini.

History

Early foundations and precursors

The intellectual roots of knowledge distillation extend back to early work on neural network compression and teacher-student learning paradigms. In the early 1990s, research on theoretical teacher-student models in statistical mechanics explored knowledge transfer mechanisms. Notably, Jürgen Schmidhuber in 1991 described a two-network system where one recurrent neural network learned from another, representing an early precursor to modern knowledge distillation concepts. Other researchers in the early 1990s studied theoretical teacher-student configurations in committee machines, exploring the statistical mechanics of knowledge transfer.

Early work on model compression (2006)

The idea of transferring knowledge from a large model to a smaller one was first explored by Cristian Buciluă, Rich Caruana, and Alexandru Niculescu-Mizil in their 2006 paper "Model Compression," presented at the ACM SIGKDD conference. This pioneering work demonstrated that large ensembles of hundreds or thousands of classifiers could be compressed into single neural networks with little loss in accuracy. Their method involved using the large ensemble to label a dataset and then training a smaller neural network on those soft labels, achieving a model "a thousand times smaller and faster" that matched the ensemble's performance. Caruana's team introduced the fundamental concept of using a complex model's predictions as training targets for a simpler model, and developed the MUNGE method for generating synthetic training data when the original data was unavailable. Their experiments showed that the compressed model could match or nearly match the accuracy of the original ensemble while being orders of magnitude smaller and faster. This paper established the basic principle that the outputs of a strong model contain more useful training signal than raw ground-truth labels alone.

Hinton et al. and the formalization of distillation (2015)

The modern formulation of knowledge distillation was introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in a March 2015 paper titled "Distilling the Knowledge in a Neural Network" (arXiv:1503.02531), which was also presented at the NIPS 2014 Deep Learning Workshop. Hinton et al. drew an analogy from biology: insects have distinct larval and adult forms that are optimized for different requirements (extracting energy from the environment versus reproduction and dispersal), and similarly, machine learning models may benefit from having one form optimized for training (a large, cumbersome model that extracts structure from data) and another optimized for deployment (a smaller, faster model).

The paper's central technical contribution was the introduction of temperature scaling in the softmax function and the concept of soft targets. By raising the temperature parameter, the teacher's output probability distribution becomes softer, revealing inter-class relationships that hard labels (one-hot vectors) do not capture. Hinton called this implicit information "dark knowledge," because it resides in the small probabilities assigned to incorrect classes. For example, when a digit classifier trained on MNIST classifies an image of a "7," it might assign a small but non-negligible probability to "1" and "9," reflecting visual similarity. This relational information helps the student generalize better than it would from hard labels alone.

The seminal paper introduced three transformative concepts: the formalization of "distillation" as a knowledge transfer mechanism; the temperature-based softmax approach for creating soft targets; and the notion of "dark knowledge" residing in the probability distributions over incorrect answers. Hinton's team demonstrated dramatic results on MNIST, where a distilled student with 800 hidden units achieved 74 test errors compared to 146 errors when trained conventionally, a 49 percent error reduction purely from learning from a larger teacher. On large-scale speech recognition, they showed that a single distilled model captured 80 percent of the performance improvement from a 10x ensemble and produced significant improvements on the acoustic model of a heavily used commercial speech system at Google. The paper further proposed a new type of ensemble consisting of a generalist model and multiple specialist models that focus on fine-grained distinctions between easily confused classes.

FitNets and feature-based distillation (2015)

Shortly after Hinton's paper, Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio proposed FitNets in their 2015 ICLR paper "FitNets: Hints for Thin Deep Nets." FitNets extended knowledge distillation beyond the teacher's final output layer by using intermediate representations (called "hints") from the teacher's hidden layers to guide the student's training. This allowed the student to be thinner and deeper than the teacher. On CIFAR-10, a deep student network with roughly 10.4 times fewer parameters outperformed a larger teacher network, demonstrating that intermediate feature matching could be a powerful complement to output-level distillation.

Rapid methodological expansion (2015-2020)

Knowledge distillation rapidly diversified after 2015 into multiple research directions. Attention transfer mechanisms were introduced by Sergey Zagoruyko and Nikos Komodakis in their 2017 ICLR paper "Paying More Attention to Attention," demonstrating that transferring spatial attention maps between teacher and student networks improved CNN performance. Online distillation emerged through Ying Zhang and colleagues who published "Deep Mutual Learning" at CVPR 2018, showing that multiple student networks could teach each other simultaneously without requiring a pre-trained teacher. Self-distillation was established by Tommaso Furlanello's team, who introduced "Born-Again Neural Networks" at ICML 2018, demonstrating that distilling a network into an identically sized student could actually surpass the original teacher's performance. Relation-based distillation was advanced by Wonpyo Park and colleagues with "Relational Knowledge Distillation" at CVPR 2019, showing that transferring relational knowledge about how samples relate to each other could enable students to outperform teachers in metric learning tasks. By 2020, Jianping Gou and colleagues published "Knowledge Distillation: A Survey" in the International Journal of Computer Vision, organizing the field into systematic categories by knowledge type, training schemes, and applications.

How knowledge distillation works

The teacher-student framework

Knowledge distillation operates on the premise that the knowledge in a neural network resides not just in its learned parameters, but in the learned mapping from inputs to outputs. When a well-trained model classifies an image, it produces a probability distribution across all possible classes. These relative probabilities encode rich information about visual similarities and semantic relationships that a binary correct/incorrect label cannot capture.

The distillation framework involves two key participants: the teacher model, typically a large, complex network or ensemble trained to high accuracy but requiring substantial computational resources; and the student model, a smaller, more efficient architecture that learns to mimic the teacher's behavior while maintaining practical inference speed and memory footprint.

The standard knowledge distillation pipeline consists of three stages:

Train the teacher. A large, high-capacity model (or ensemble of models) is trained on the target dataset using conventional supervised learning until it achieves strong performance. The teacher may be a deep convolutional neural network like ResNet, a large Transformer like BERT, or any architecture with high accuracy.
Generate soft targets. The trained teacher is used to produce output probability distributions for each training example. These distributions are called "soft targets" or "soft labels" because, unlike one-hot ground-truth labels, they assign non-zero probabilities to multiple classes. The softness of these distributions is controlled by a temperature parameter T.
Train the student. A smaller model is trained on the same dataset using a loss function that combines two objectives: matching the teacher's soft target distribution (the distillation loss) and matching the ground-truth hard labels (the standard supervised loss). The student learns both from the teacher's knowledge and from the labeled data.

Temperature scaling and soft labels

The key mechanism that enables effective knowledge transfer is temperature scaling of the softmax function. In standard classification, the softmax function converts raw logits z_i into probabilities:

p_i = exp(z_i) / sum_j(exp(z_j))

In knowledge distillation, a temperature parameter T is introduced:

p_i(T) = exp(z_i / T) / sum_j(exp(z_j / T))

When T = 1, this reduces to the standard softmax. As T increases above 1, the probability distribution becomes "softer" (more uniform), revealing the relative magnitudes of the logits more clearly. Higher temperatures expose the teacher's learned similarity structure between classes, which would be nearly invisible in the sharp distributions produced at T = 1.

For example, consider a teacher classifying images of cats. At T = 1, the output might be [cat: 0.98, dog: 0.01, car: 0.005, ...]. At T = 5, the same logits might produce [cat: 0.45, dog: 0.20, tiger: 0.12, lion: 0.08, ...]. The softened distribution makes the teacher's implicit knowledge about inter-class relationships explicit: cats are more similar to dogs and tigers than to cars. This relational information provides a richer training signal than one-hot labels.

Practitioners typically use temperature values between 2 and 20, with common values around 5 to 10, depending on the task and dataset. The optimal temperature often requires empirical tuning. Too low a temperature produces distributions that are nearly identical to hard labels, offering little additional information. Excessively high temperatures flatten the distribution so much that meaningful distinctions between classes are washed out.

Loss function

The total loss for training the student model combines two components:

L_total = alpha * L_distill + (1 - alpha) * L_hard

where:

L_distill is the distillation loss (also called the soft loss), typically the Kullback-Leibler (KL) divergence between the teacher's and student's softened output distributions at temperature T:

L_distill = T^2 * KL(p_teacher(T) || p_student(T))

The T^2 factor compensates for the fact that gradients produced by soft targets scale as 1/T^2 when the temperature is raised, ensuring that the distillation gradients remain at a comparable magnitude to the hard-label gradients regardless of the temperature setting.
L_hard is the standard cross-entropy loss (also called the student loss or hard loss) between the student's predictions (at T = 1) and the ground-truth labels:

L_hard = CrossEntropy(p_student(T=1), y_true)
alpha is a weighting hyperparameter (typically between 0.1 and 0.5) that balances the two loss terms. A typical starting point is alpha = 0.5, though this is tuned empirically.

Hinton et al. noted that when using both soft targets and hard targets, the best results were generally obtained by placing a relatively high weight on the distillation loss. The KL divergence is preferred over cross-entropy for the distillation loss because KL divergence equals zero when the student's distribution exactly matches the teacher's, providing a cleaner optimization signal. Including the hard-label term was found by Hinton et al. to be beneficial, helping when the teacher model is not perfectly accurate.

Dark knowledge

The dark knowledge concept captures information in the teacher's probability distribution over incorrect classes: the relative probabilities of wrong answers that standard cross-entropy training with hard labels cannot access. Even when the teacher assigns very low probabilities to incorrect classes, the relative magnitudes of these probabilities encode valuable information about similarities between classes (for example, recognizing that "7" is more similar to "1" than to "8" in digit recognition). This subtle information in the ratios of unlikely outcomes helps the student learn richer representations and generalize better.

Relationship between distillation and logit matching

Hinton et al. showed that in the high-temperature limit, minimizing the KL divergence between softened teacher and student distributions is approximately equivalent to minimizing the mean squared error between the teacher's and student's logits. Specifically, the gradient of the distillation loss with respect to the student's logits approximates:

(z_i^student - z_i^teacher) / (N * T^2)

where N is the number of classes. This connection to the earlier model compression work of Buciluă et al. (which essentially performed logit matching) provides theoretical grounding for why temperature-based distillation works: it generalizes and smooths the logit matching approach.

Types of knowledge distillation

Knowledge distillation methods can be categorized by what type of knowledge is transferred from teacher to student. The choice of knowledge source represents a trade-off between simplicity and the richness of the supervisory signal.

Response-based distillation

Response-based distillation transfers knowledge from the teacher's final output layer, the probability distributions over classes or the logits themselves. The student is trained to match the teacher's output probability distribution, typically using KL divergence at a raised temperature. This is the original and simplest form of knowledge distillation, as proposed by Hinton et al. (2015), and is sometimes called logit distillation. This approach is highly versatile because it treats the teacher as a "black box"; it does not require access to the teacher's internal architecture or intermediate representations, making it applicable even when the teacher is a proprietary model accessible only through an API.

Aspect	Details
Knowledge source	Teacher's final softmax output (soft labels)
Loss function	KL divergence between softened teacher and student distributions
Advantages	Simple to implement; architecture-agnostic; works across different model families; teacher can be a black box
Limitations	Only captures what the teacher "thinks" at the output; misses intermediate reasoning
Examples	Hinton et al. (2015), DistilBERT (Sanh et al., 2019)

Feature-based distillation

Feature-based distillation (also known as hint learning) goes beyond the output layer and transfers knowledge from the teacher's intermediate hidden layers (feature maps, activations, or representations). The rationale is that these layers learn rich, hierarchical feature representations of the data that are crucial to the teacher's performance. The student is trained so that its intermediate representations match or approximate those of the teacher by minimizing a loss function (for example, L2 loss or L1 loss) between the feature activations of a teacher's "hint" layer and a student's "guided" layer. Because the teacher and student often have different architectures and layer widths, an adapter or regressor layer is typically inserted to align the dimensions.

Aspect	Details
Knowledge source	Teacher's intermediate hidden layer outputs ("hints")
Loss function	Mean squared error, cosine similarity, or other distance metrics between teacher and student feature maps
Advantages	Captures richer information than output-only distillation; can train deeper, thinner students; forces student to learn the teacher's internal problem-solving process
Limitations	Requires choosing which teacher and student layers to match; alignment is non-trivial when architectures differ significantly; requires access to teacher internals
Examples	FitNets (Romero et al., 2015), TinyBERT (Jiao et al., 2020)

Relation-based distillation

Relation-based distillation transfers knowledge about the relationships between data samples or between layers, rather than individual sample representations. The core idea, inspired by structuralism, is that the meaning of a representation is defined by its relations to other representations. This relational knowledge can be captured by modeling the correlations between feature maps, constructing similarity matrices, distance matrices, or building instance relationship graphs. For example, it might train the student to preserve the same pairwise similarity structure among a batch of inputs as the teacher, or to replicate the correlation patterns between different layers. This approach captures structural knowledge that neither output-level nor feature-level matching alone can convey and has proven particularly effective for tasks like metric learning.

Aspect	Details
Knowledge source	Pairwise or higher-order relationships between samples, or between layers within the teacher
Loss function	Distance or angle-based losses on relational structures (e.g., pairwise distance matrices, Gram matrices)
Advantages	Captures structural and relational patterns; less sensitive to absolute activation magnitudes
Limitations	More complex to implement; may require large batch sizes to capture meaningful relationships
Examples	Relational Knowledge Distillation (Park et al., 2019), A Gift from Knowledge Distillation (Yim et al., 2017)

Attention-based distillation

Attention-based distillation transfers attention maps showing where the teacher focuses in the input. This proves particularly valuable for object detection, segmentation, and vision tasks where spatial relationships carry semantic meaning. By learning which regions of an image or which parts of a sequence the teacher model considers important, the student can better allocate its limited capacity to the most relevant features.

Comparison of knowledge types

Method	Knowledge Source	Mechanism	Advantages	Disadvantages
Response-Based	Final layer logits/probabilities	Student mimics the teacher's final output distribution	Simple, universal, architecture-agnostic, teacher can be a black box	Ignores rich information present in the teacher's intermediate layers
Feature-Based	Intermediate layer feature maps (activations)	Student mimics the teacher's hidden layer representations	Provides deeper, more detailed supervision on feature extraction	Complex to match layers between heterogeneous architectures; needs teacher internals
Relation-Based	Relationships between data points or feature maps	Student mimics structural properties of the teacher's embedding space	Captures higher-order, structural knowledge; powerful for metric learning	Computationally intensive; defining relations can be complex
Attention-Based	Spatial or temporal attention maps	Student learns to focus on the same regions or features as the teacher	Particularly effective for vision and sequence tasks	Requires teacher to have explicit attention mechanisms or spatial structure

Distillation schemes

Beyond the type of knowledge transferred, distillation methods also differ in how the teacher and student models are trained and interact with each other.

Offline distillation

In offline distillation (the most common setup), the teacher model is fully pre-trained before any distillation takes place. During student training, the teacher's weights are frozen, and only the student's weights are updated. This is the approach used in the original Hinton et al. paper and in most practical applications.

Offline distillation is straightforward to implement and allows the teacher's soft targets to be pre-computed and cached, reducing the computational cost during student training. The main advantage is the ability to leverage powerful, publicly available pre-trained models as off-the-shelf teachers. The downside is that the student cannot influence or adapt the teacher's behavior.

Online distillation

In online distillation, the teacher and student models are trained simultaneously in a single, end-to-end process, with no pre-trained teacher. Instead, a group of models (peers) are trained from scratch, learning collaboratively and teaching each other. During training, the supervisory "teacher" signal for any given student is typically generated by an ensemble of the other peer models. The teacher's parameters are updated alongside the student's, which means the quality of the soft targets improves over the course of training. Online distillation is particularly useful when a pre-trained teacher is not available, or when the teacher itself benefits from the co-training process.

One common approach is mutual learning (Zhang et al., 2018), where two or more networks of similar or different sizes are trained in parallel, with each network acting as both teacher and student to the others. Deep Mutual Learning showed that even networks of the same architecture can benefit from teaching each other, with the final models outperforming independently trained counterparts.

Self-distillation

In self-distillation, a single network serves as both teacher and student. Knowledge is transferred from the network's own deeper layers to its shallower layers, or from one training run of the model to a subsequent run of an identical architecture. Another form is training a model on its own high-confidence predictions alongside ground truth. Counterintuitively, this process has been shown to improve the model's own generalization performance even without an external teacher, acting as a form of implicit regularization that encourages the model to find flatter minima in the loss landscape, which is correlated with better performance on unseen data.

Born-Again Neural Networks (Furlanello et al., 2018) demonstrated that training a student model with identical architecture to the teacher, using the teacher's soft targets, yields a student that outperforms the original teacher. By repeating this process over multiple generations, performance continues to improve. This surprising result suggests that the knowledge encoded in soft targets provides a training signal that is qualitatively different from (and often better than) what ground-truth labels provide.

Be Your Own Teacher (Zhang et al., 2019) proposed a self-distillation framework where knowledge is distilled from the deeper sections of a network to its shallower sections during a single training run. Auxiliary classifiers attached to intermediate layers are trained to match the output of the network's deepest layer. This approach improves the performance of the overall network without increasing its size.

Comparison of training schemes

Scheme	Teacher Status	Training Process	Key Advantage	Primary Use Case
Offline	Pre-trained and frozen	Two-stage: train teacher, then distill to student	Simple; can leverage powerful publicly available teachers	Standard model compression where a strong teacher exists
Online	Trained simultaneously with student(s)	Single-stage: all models trained together, learning from peer ensemble	No need for pre-trained teacher; single-stage pipeline	Training a group of models from scratch
Self-Distillation	The student is its own teacher	Single-stage: deeper layers or previous states teach shallower layers or current state	Improves generalization without external teacher; acts as regularizer	Improving performance of an architecture without changing it

Advanced variants

Several advanced variants extend the basic knowledge distillation framework:

Adversarial distillation uses GAN-like setups to generate challenging samples or discriminate outputs, where a student and a discriminator network are trained to better match the teacher's output distribution.
Multi-teacher distillation leverages multiple teacher models to transfer diverse knowledge to a single student, with the student learning a weighted combination or ensemble of teacher outputs. This can capture complementary expertise from different teachers.
Cross-modal distillation transfers knowledge between models that handle different data modalities (for example, distilling knowledge from a model trained on images into a model for text, or from RGB to depth images).
Data-free distillation addresses scenarios where original training data is unavailable due to privacy constraints or proprietary restrictions, using techniques like generating synthetic inputs to query the teacher model.
Quantized distillation distills from high-precision (for example, FP32) teacher models to low-precision (for example, INT8) quantized student models, combining knowledge distillation with quantization.
Lifelong distillation accumulates knowledge over continual learning scenarios, enabling models to learn new tasks while retaining knowledge from previous tasks.
Graph-based distillation uses graph structures to model intra-data relationships, particularly useful for graph neural networks and relational data.
Specialist ensembles involve using a generalist teacher model along with specialist models that focus on distinguishing confusable classes, as introduced in Hinton's original 2015 paper. Specialists are trained on data from confusable classes and provide additional supervision for those specific cases.

Mathematical formulation

The following table summarizes the key mathematical components of knowledge distillation.

Component	Formula	Description
Standard softmax	p_i = exp(z_i) / sum_j(exp(z_j))	Converts logits to probabilities at T = 1
Temperature-scaled softmax	p_i(T) = exp(z_i / T) / sum_j(exp(z_j / T))	Softens the probability distribution; higher T produces more uniform distributions
Distillation loss (KL divergence)	L_distill = T^2 * sum_i(p_teacher_i * log(p_teacher_i / p_student_i))	Measures how well the student matches the teacher's soft targets
Hard-label loss (cross-entropy)	L_hard = -sum_i(y_i * log(p_student_i))	Standard supervised loss with ground-truth labels
Total loss	L_total = alpha * L_distill + (1 - alpha) * L_hard	Weighted combination of distillation and supervised losses
High-temperature approximation	L_distill approx (1 / 2N) * sum_i(z_i^s - z_i^t)^2	At high T, distillation reduces to logit matching (MSE between logits)

Notable applications and models

DistilBERT (2019)

DistilBERT, developed by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf at Hugging Face, is one of the most widely cited examples of knowledge distillation applied to natural language processing. DistilBERT distills BERT-base into a model that is 40 percent smaller (66 million parameters vs. BERT's 110 million) and 60 percent faster at inference while retaining 97 percent of BERT's language understanding performance on the GLUE benchmark.

The training procedure combined three loss functions:

Loss component	Description
Distillation loss	KL divergence between the teacher's and student's softened output distributions
Masked language modeling loss	The standard BERT pre-training objective applied to the student
Cosine embedding loss	Aligns the hidden state vectors of the student and teacher

DistilBERT uses 6 Transformer layers (compared to BERT-base's 12), with each student layer initialized from every other layer of the teacher. On the SQuAD question answering benchmark, DistilBERT achieves 86.2 F1 and 78.1 EM (exact match), within 3 points of the full BERT model. The model's compact size (207 MB) makes it practical for on-device NLP applications.

TinyBERT (2020)

TinyBERT, proposed by Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xinlei Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu from Huawei Noah's Ark Lab, extends distillation to multiple layers of the Transformer architecture. TinyBERT introduces a two-stage learning framework that performs distillation at both the pre-training stage and the task-specific fine-tuning stage.

The distillation method targets three types of representations:

The output of the embedding layer
Hidden states and attention matrices from Transformer layers
The logits from the prediction layer

The attention-based fitting is motivated by the finding that BERT's attention weights capture substantial linguistic knowledge. TinyBERT (4 layers) achieves comparable results to BERT on the GLUE benchmark while being 7.5 times smaller and 9.4 times faster, retaining 96.8 percent of BERT's accuracy.

DistilGPT2 and MobileBERT

DistilGPT2 is a distilled version of OpenAI's GPT-2 language model. DistilGPT2 was trained using the smallest GPT-2 (124 million parameters) as the teacher, resulting in a model with 82 million parameters (about 33 percent fewer parameters) that is nearly 2x faster in inference. The distilled model does sacrifice some generation quality: on the WikiText-103 text generation benchmark, GPT-2 achieves a test perplexity of 16.3 whereas DistilGPT2 has a perplexity of 21.1 (lower perplexity is better, indicating the model is more confident in its predictions).

MobileBERT, developed at Google, was optimized specifically for mobile deployment with 25.3 million parameters, achieving 62 millisecond inference latency on Pixel 4 phones for 128-token sequences.

The table below summarizes these distilled NLP models:

Teacher model (size)	Student model (size)	Compression	Student performance
BERT-base (110M parameters)	DistilBERT (66M)	~40% fewer params, >60% faster inference	~97% of teacher accuracy on GLUE benchmark
GPT-2 (124M parameters)	DistilGPT2 (82M)	~33% fewer params, ~2x faster inference	Perplexity 21.1 vs 16.3 (teacher) on WikiText-103
BERT-base (110M parameters)	TinyBERT-4 (14.5M)	7.5x smaller, 9.4x faster inference	96.8% of teacher accuracy
BERT-base (110M parameters)	MobileBERT (25.3M)	4.3x smaller, 62ms latency on mobile	Comparable to BERT-base on downstream tasks

Distillation has also been used for multilingual models and low-resource languages, enabling efficient deployment for tasks like question answering, machine translation, and text generation on consumer devices or for handling high request volumes.

Knowledge distillation for large language models

Knowledge distillation has become increasingly important for making large language models more accessible and deployable. With the advent of extremely large LLMs like GPT-3, GPT-4, and LLaMA, two main approaches are used:

Black-Box KD is common when the teacher is a proprietary model accessible only via an API (for example, GPT-4). The student model does not have access to the teacher's internal parameters or logits. Instead, it is trained on a dataset of high-quality prompt-response pairs generated by the teacher. The Stanford Alpaca model is an example, having been fine-tuned on 52,000 instruction-following examples generated by OpenAI's text-davinci-003 model.

White-Box KD is used with open-source LLMs where the full model, including its output distributions and hidden states, is available. This allows for the use of the classic distillation loss on soft targets, which provides a much richer training signal than just the generated text. MiniLLM for LLM distillation uses reverse KL divergence rather than forward KLD to prevent students from overestimating low-probability regions. Researchers have found that for generative tasks, it is often better to use a loss function like reverse KL-divergence, which encourages the student to focus on matching the high-probability outputs (modes) of the teacher's distribution, ensuring correctness and faithfulness.

Notable examples of distilled LLMs include:

Model	Teacher	Compression	Key result
DistilBERT	BERT-base (110M)	40% size reduction	97% of BERT performance on GLUE
TinyBERT	BERT-base (110M)	7.5x smaller	96.8% of BERT accuracy on GLUE
GPT-4o mini	GPT-4o	Significantly smaller	Maintains strong performance at much lower cost
DistilGPT-2	GPT-2 (124M)	2x faster	Retains most generation quality
TinyR1-32B-Preview	DeepSeek-R1	Much smaller	Near-equal performance on AIME 2024 math benchmark

Recent advances in LLM distillation have incorporated techniques beyond simple output matching. DeepSeek integrated chain-of-thought prompting and reinforcement learning to guide smaller models in solving complex reasoning tasks. RLKD (Reinforcement Learning Knowledge Distillation) introduced a structure-aware reward model to capture reasoning patterns that distributional matching alone cannot transfer. DIVERSEDISTILL addresses teacher-student heterogeneity by using a teaching committee of multiple teachers, dynamically weighting each teacher's contributions based on the student's current understanding.

Computer vision applications

Knowledge distillation is widely used in computer vision for deploying efficient models on edge devices. Many state-of-the-art vision models (for image classification, object detection, and other tasks) are too computationally heavy for real-time use on devices.

Image classification

Knowledge distillation was originally demonstrated on image classification tasks. In their 2015 paper, Hinton et al. showed that a student network with 800 rectified linear units, when trained with distillation from a larger teacher, made only 74 errors on the MNIST test set, a significant improvement over the 146 errors it made when trained on the same data with hard labels, and close to the teacher's performance of 67 errors. Similar successes have been replicated on more complex datasets like ImageNet, CIFAR-10, and CIFAR-100. Research has applied distillation to tasks like distilling from a ResNet-152 to a ResNet-50 student, often in combination with techniques like hint training (matching intermediate feature maps) to further boost the student's performance. Large teacher models like ResNet-152 or EfficientNet are also distilled into compact architectures like MobileNet for mobile deployment. Beyer et al. (2022) showed that patient and consistent distillation can transfer state-of-the-art performance from large ResNet models to more efficient architectures.

Object detection

Applying knowledge distillation to object detection is more challenging than to classification because the task is more complex, involving both the classification of objects and the regression of their bounding box coordinates. Chen et al. (2017) showed that a deep detection model (teacher) could distill its knowledge to a faster student detector, achieving better accuracy-speed trade-offs. Successful techniques for object detection include:

Feature-Based Distillation: Forcing the student to mimic feature maps from the teacher's backbone network, especially for regions corresponding to foreground objects
Localization Distillation: Explicitly transferring knowledge about bounding box regression, for example, by having the student mimic the teacher's predicted box parameters
Relation-Based Distillation: Transferring relational knowledge, such as the rank distribution of candidate anchor boxes, to teach the student how the teacher prioritizes potential objects
Handling Class Imbalance: Using weighted loss functions to address the overwhelming number of background examples compared to foreground objects

Distillation has enabled compact models suitable for mobile devices and embedded systems with minimal loss in accuracy, including compressed YOLO models for real-time autonomous driving and surveillance applications.

Other vision tasks

Complex image segmentation models can be distilled to produce lightweight models for medical imaging and mobile applications. Subsequent research has applied distillation to semantic segmentation for medical imaging and autonomous driving. Vision Transformers have been distilled to efficient CNNs for improved deployment.

Speech and audio processing

Automatic speech recognition and audio processing have embraced distillation extensively to adapt large acoustic models to smaller footprints. Asami et al. (2017) demonstrated that a large speech acoustic model could teach a smaller student model for a new domain, improving the student's performance in that domain via distillation. An ensemble of multilingual speech recognition models has been distilled into a single model to support under-resourced languages.

Amazon Alexa used knowledge distillation with 1 million hours of unlabeled speech to create efficient on-device acoustic models. These techniques help deploy speech recognition on devices with limited hardware, such as smartphones or IoT devices, by reducing model size and latency while maintaining accuracy. Speech emotion recognition benefits from distillation with up to 44.9 percent model size reduction and 40.2 percent faster inference.

Edge computing and mobile deployment

Edge computing and IoT scenarios demand aggressive compression. The foremost application of knowledge distillation is in the field of Edge AI, which involves running AI computations directly on end-user devices like smartphones, autonomous vehicles, and IoT sensors. Large models are often impractical for these devices due to their high latency, large memory footprint, and significant power consumption. Knowledge distillation addresses this by compressing these models into lightweight student versions that can run efficiently on-device.

This approach offers several key benefits:

Reduced Latency: Processing data locally eliminates the round-trip time to a cloud server, which is critical for real-time applications like autonomous driving
Enhanced Privacy: Sensitive data, such as personal photos or medical information, remains on the user's device, improving privacy and security
Lower Operational Costs: Reduced reliance on cloud infrastructure lowers server and energy costs
Improved Reliability: Applications can function without a constant internet connection

Case studies in this area include deploying efficient vehicle detection models on platforms like the NVIDIA Jetson Nano, where a tiny student model is trained using knowledge from an ensemble of larger, more accurate detectors. KD is also a key enabling technology for Federated Learning on edge devices, where a global model's knowledge can be distilled into personalized on-device models, or where devices collaboratively train models without sharing raw data. For instance, it has been successfully applied to IoT traffic classification and network intrusion detection systems, where lightweight student models on IoT devices can achieve high accuracy by learning from a powerful centralized teacher model.

Cross-modal KD is an advanced application where knowledge is transferred between models trained on different data modalities. This is particularly useful in scenarios where one modality is rich in information but expensive to acquire or process (for example, LiDAR), while another is cheaper and more ubiquitous (for example, RGB cameras). The goal is to imbue the model operating on the cheaper modality with the knowledge from the more powerful one, which is only needed during the training phase.

Key applications include:

Autonomous Driving: A primary use case is distilling knowledge from a LiDAR-based or multi-modal (LiDAR plus camera) 3D object detector (teacher) into a camera-only (monocular) student detector. The student learns to infer 3D spatial information from 2D images by mimicking the teacher's highly accurate 3D predictions and feature representations, thus improving the performance of the cheaper sensor system.
Human Activity Recognition: Knowledge can be transferred from a vision-based model (teacher) to a model that uses data from wearable sensors (for example, accelerometers). This allows the sensor-based model to achieve higher accuracy without the privacy concerns or environmental limitations (for example, occlusion, poor lighting) of cameras during inference.
Other Modalities: The technique has been explored in a wide range of pairings, including audio-visual tasks, vision-language distillation, and RGB-to-depth distillation.

Recommendation systems

Recommender systems face strict latency requirements (typically under 100 milliseconds) while handling massive user and item catalogs, making compression critical. Techniques include topology distillation for graph-based recommendations and ranking distillation for maintaining ranking quality in real-time systems.

Healthcare and medical imaging

Healthcare applications demand both accuracy and efficiency for real-time diagnosis and deployment on portable medical devices. Applications include:

Chest X-ray classification for pneumonia detection
Cervical cell classification
Malaria diagnosis from blood smear images
Skin disease classification
Medical image segmentation
Breast ultrasound classification

Research has demonstrated significant compression with minimal accuracy loss in medical imaging tasks, enabling deployment on portable diagnostic devices.

Graph neural networks

Knowledge distillation has been applied to graph neural networks. Yang et al. (2020) distilled knowledge from a large graph convolutional network into a smaller one, enabling efficient graph analytics on non-Euclidean data. This is particularly useful for tasks involving social networks, molecular structures, and knowledge graphs.

Reinforcement learning

In reinforcement learning, distillation has been used to transfer policies from large ensembles of agents to single agents, enabling more efficient deployment of learned behaviors.

Applications beyond model compression

While model compression is the primary use case, knowledge distillation has been applied to several other problems:

Application	Description
Model compression	Reducing model size and inference cost for deployment on resource-constrained devices
Domain adaptation	Distilling knowledge from a model trained on a source domain to a student for a target domain
Data augmentation	Using teacher predictions on unlabeled data as pseudo-labels to expand the effective training set
Label smoothing	Soft targets act as a form of regularization, preventing the student from becoming overconfident
Federated learning	Reducing communication overhead by distilling a global model into client models
Multi-task learning	Distilling task-specific knowledge from separate teacher models into a single student
Privacy preservation	Training a student on the teacher's outputs without exposing the original training data
Ensemble compression	Replacing a computationally expensive ensemble of models with a single compact model

Advantages

Knowledge distillation offers distinctive advantages that make it a powerful technique for model compression and deployment:

Model Compression and Efficiency: The most significant advantage is the ability to create smaller, faster models that require less memory, computational power, and energy, enabling deployment of advanced AI on edge devices
Architecture Flexibility: Unlike pruning or quantization, distillation allows completely different student architectures optimized for specific hardware or latency requirements
Improved Generalization: By learning from the teacher's soft targets, the student model is exposed to a richer, more nuanced representation of the data's similarity structure. Soft targets act as strong regularizers, reducing overfitting and helping students generalize better to unseen data
Training Stability: Students benefit from structured knowledge of well-trained teachers, often making training more stable
Transfer of Specialized Knowledge: Students can inherit expertise from multiple teachers or specialist models, capturing complementary knowledge
Knowledge Transfer from Ensembles: Provides an effective way to transfer the knowledge from a powerful but computationally prohibitive ensemble of models into a single, deployable model
Privacy Preservation: Distilled models can be shared without exposing raw training data, as knowledge is transferred through model outputs rather than data
Energy Efficiency: Reduced inference time translates to lower energy consumption, crucial for battery-powered devices and environmental sustainability
Handles Unlabeled or Limited Data: Can use unlabeled data via transfer sets, expanding training data availability

Limitations and challenges

Knowledge distillation, despite its effectiveness, has several known limitations.

Capacity gap

When the student model is too small relative to the teacher, a capacity gap arises: the student lacks sufficient representational capacity to absorb the teacher's knowledge. Research has shown that excessively growing the teacher size eventually creates a knowledge gap that makes it harder for the student to learn from the teacher's predictions, sometimes leading to worse performance than a smaller teacher would provide. Counterintuitively, research has shown that a more accurate teacher is not always a better teacher; a slightly less accurate but "simpler" teacher may provide a more digestible learning signal for a small student. Mirzadeh et al. (2020) proposed using intermediate-sized "teacher assistant" models to bridge this gap.

Teacher quality dependency

The student's performance is fundamentally bound by the teacher's capabilities. A poorly trained, biased, or suboptimal teacher will inevitably pass on its flaws to students, limiting the potential of the distilled model.

Negative distillation

In some scenarios, particularly in zero-shot cross-lingual transfer, distillation can be detrimental. The student model may perform worse than if it had been trained from scratch on the available data. This can occur if the teacher's knowledge is not well-aligned with the student's task or architecture.

Information loss

The student model may not capture all the nuances, fine-grained knowledge, or complex reasoning capabilities of the teacher. There is typically a trade-off between model size and accuracy. If the student is too small, it cannot effectively learn from the teacher, and some performance degradation is unavoidable.

Complex hyperparameter tuning

Knowledge distillation introduces several new hyperparameters, including the temperature (T) and the weights for the soft and hard loss terms (alpha, beta). Finding the optimal values for these parameters can be a challenging and computationally expensive process that requires extensive experimentation. See hyperparameter tuning.

Fidelity vs. accuracy trade-off

Empirical studies have revealed that achieving high student accuracy does not necessarily require high fidelity (i.e., a close match between the student's and teacher's predictive distributions). There can often be a large discrepancy between the teacher's and student's outputs, even when the student performs well on the downstream task. This suggests that the optimization landscape of distillation is unusually challenging and that the benefits may stem more from the regularizing effect of the soft targets' gradients rather than from perfect imitation.

Training complexity

Distillation requires training (or having access to) a large teacher model first, which adds an extra step and additional computational cost compared to directly training a smaller model. In online distillation, the simultaneous training of teacher and student increases memory requirements and training time.

Bias transfer

If the teacher model contains biases, errors, or miscalibrated confidence scores, these issues are likely to be transferred to the student. The student inherits not just the teacher's knowledge but also its mistakes, making it important to verify teacher quality before distillation.

Optimization difficulties

Stoll et al. (2020) and others have shown that even when the student has sufficient capacity to perfectly match the teacher, there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and student after training. Difficulties in optimization, rather than capacity limitations, can be a key reason for this gap.

Task limitations

Distillation works best for discriminative tasks and has more limited effectiveness for certain spatial reasoning or highly complex tasks. It can face difficulties when there are large capacity gaps in very deep networks.

Challenges with large language models

Traditional knowledge distillation methods struggle with very large models because knowledge in LLMs is distributed across billions of parameters and complex attention patterns. White-box distillation (which requires access to the teacher's internal representations) is often impractical for proprietary models, and black-box distillation (using only the teacher's outputs) may not capture the full depth of the teacher's reasoning capabilities.

Layer and architecture mismatch

In feature-based distillation, selecting the appropriate layers for distillation, aligning corresponding features from the teacher and student, and matching feature sizes are all non-trivial engineering challenges. When the teacher and student use fundamentally different architectures (for example, a CNN teacher and a Transformer student), these alignment problems become even more difficult.

Comparison with other compression techniques

Knowledge distillation is one of several major paradigms for model compression. While they all aim to create more efficient models, they operate on fundamentally different principles. These techniques are often complementary and can be used in combination for maximum efficiency.

Pruning

Pruning operates by removing parts of an already trained network, redundant parameters (weights, neurons, or filters). This can be unstructured (removing individual weights) or structured (removing entire channels or neurons). Unlike distillation, it modifies an existing model rather than creating a new one. Research shows combining pruning with distillation provides superior results. Distillation offers architectural flexibility while pruning provides direct parameter reduction. Pruning typically requires fine-tuning the pruned model to recover accuracy.

Quantization

Quantization reduces the bit-precision of weights and activations from high-precision floating-point (for example, 32-bit) to lower-precision formats (for example, 8-bit integer), achieving 4x reduction for FP32 to INT8. The architecture remains unchanged, but the data types of parameters are modified. Can be applied post-training (PTQ) or simulated during training for better performance (QAT). Many works distill from full-precision to quantized student models, showing these techniques complement each other.

Low-rank factorization

Low-rank factorization decomposes large weight matrices into products of smaller, low-rank matrices to reduce parameters. Modifies specific layers (for example, fully connected) by replacing them with factorized versions. Requires fine-tuning the model after factorization to regain accuracy. This method excels when combined with other techniques like LoRA for parameter-efficient fine-tuning.

Neural architecture search

Neural architecture search (NAS) automates discovering optimal network architectures. NAS for LLM compression showed up to 9.85 percent average improvement on 11 diverse downstream tasks with 22 percent latency improvement. These techniques can combine effectively with distillation.

Hybrid approaches

The most effective compression strategies combine multiple techniques. Deep Compression combined pruning, quantization, and Huffman coding to achieve 49x compression for VGG-16 with minimal accuracy loss. Successful deployment requires comprehensive analysis and careful hybrid strategy selection. Many state-of-the-art compression pipelines use knowledge distillation in conjunction with pruning, quantization, and low-rank factorization.

Technique	Mechanism	Granularity	Impact on Architecture	Training Requirement
Knowledge Distillation	Trains a small student model to mimic a large teacher model	Model-level	Student architecture can be completely different from the teacher	Requires training a new student model from scratch or from a pre-trained checkpoint
Pruning	Removes redundant parameters (weights, neurons, or filters) from a network	Parameter/Channel-level	Reduces active parameters within the same or similar architecture	Typically requires fine-tuning the pruned model to recover accuracy
Quantization	Reduces the bit-precision of weights and activations	Parameter-level	Architecture unchanged, but data types modified	Can be applied post-training (PTQ) or during training (QAT)
Low-Rank Factorization	Decomposes large weight matrices into smaller, low-rank matrices	Layer-level	Modifies specific layers by replacing them with factorized versions	Requires fine-tuning after factorization

Recent developments (2023-2025)

The period 2023 to 2025 witnessed explosive growth in knowledge distillation research driven by large language models and foundation models.

Large language model distillation

LLM distillation emerged as a dominant research direction with two main paradigms:

White-box distillation assumes full access to teacher model internals (architecture, parameters, intermediate representations), allowing for rich distillation losses
Black-box distillation accesses only teacher outputs through APIs, growing in importance as proprietary models restrict internal access

Breakthrough techniques

Recent breakthrough papers include:

MINIPLM: Introduced offline distillation using "Difference Sampling" that reduces pre-training data requirements by 2.4x
MiniLLM: Replaced forward KLD with reverse KLD for generative models, preventing students from overestimating low-probability regions
Lion: Adversarial Distillation: Using only 70,000 training examples, achieved 55.4 percent improvement over Vicuna-13B on reasoning tasks
Compact Language Models: NVIDIA researchers combined depth, width, attention, and MLP pruning with KD-based retraining, compressing Nemotron-4 by 2 to 4x with minimal performance loss

Diffusion model distillation

Diffusion models received focused attention for their computational intensity during image generation. Techniques include progressive distillation reducing sampling steps, consistency models enabling direct noise-to-data mapping, and score distillation for text-to-3D generation.

Practical implementations

Frameworks and tools

Major deep learning platforms provide mature implementations for knowledge distillation:

PyTorch: Official tutorials covering output-based distillation, cosine loss for hidden states, and intermediate regressor-based distillation
Keras/TensorFlow: Official examples with custom Distiller classes, temperature-based prediction softening, and KL divergence distillation
Hugging Face: Production-ready distilled models including DistilBERT, DistilGPT2, and comprehensive Transformers library integration
torchdistill: Comprehensive research framework implementing 26+ KD methods from top conferences with configuration-driven, coding-free approach

Explain like I'm 5 (ELI5)

Imagine a really smart older kid who is amazing at solving math problems. A younger kid wants to learn from the older kid, but the younger kid's brain is smaller and cannot hold as much information. Instead of making the younger kid read every textbook the older kid has read, the older kid just shows the younger kid how to think about each problem: "This looks a bit like addition, a little like multiplication, and not at all like division." Those hints (which answers seem close to right and which seem far off) are more helpful than just being told "the answer is 42." That is knowledge distillation: a big, smart model teaches a small model by sharing not just the answers, but the thinking behind the answers.

External links

References

Buciluă, C., Caruana, R., and Niculescu-Mizil, A. (2006). "Model Compression." *Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*.
Hinton, G., Vinyals, O., and Dean, J. (2015). "Distilling the Knowledge in a Neural Network." *arXiv preprint arXiv:1503.02531*.
Schmidhuber, J. (1991). "Learning complex, extended sequences using the principle of history compression." Early two-network teacher-student formulation.
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. (2015). "FitNets: Hints for Thin Deep Nets." *Proceedings of ICLR 2015*.
Zagoruyko, S., and Komodakis, N. (2017). "Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer." *Proceedings of ICLR 2017*.
Yim, J., Joo, D., Bae, J., and Kim, J. (2017). "A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning." *Proceedings of CVPR 2017*.
Chen, G., Choi, W., Yu, X., Han, T., and Chandraker, M. (2017). "Learning Efficient Object Detection Models with Knowledge Distillation." *Proceedings of NeurIPS 2017*.
Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., and Anandkumar, A. (2018). "Born Again Neural Networks." *Proceedings of ICML 2018*.
Zhang, Y., Xiang, T., Hospedales, T. M., and Lu, H. (2018). "Deep Mutual Learning." *Proceedings of CVPR 2018*.
Park, W., Kim, D., Lu, Y., and Cho, M. (2019). "Relational Knowledge Distillation." *Proceedings of CVPR 2019*.
Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). "DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter." *arXiv preprint arXiv:1910.01108*.
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., and Ma, K. (2019). "Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation." *Proceedings of ICCV 2019*.
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., and Liu, Q. (2020). "TinyBERT: Distilling BERT for Natural Language Understanding." *Findings of EMNLP 2020*.
Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., and Ghasemzadeh, H. (2020). "Improved Knowledge Distillation via Teacher Assistant." *Proceedings of AAAI 2020*.
Gou, J., Yu, B., Maybank, S. J., and Tao, D. (2021). "Knowledge Distillation: A Survey." *International Journal of Computer Vision*, 129(6), 1789-1819.
Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., and Kolesnikov, A. (2022). "Knowledge Distillation: A Good Teacher Is Patient and Consistent." *Proceedings of CVPR 2022*.
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., and Zhou, D. (2020). "MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices." *Proceedings of ACL 2020*.
Gu, Y., Dong, L., Wei, F., and Huang, M. (2023). "MiniLLM: Knowledge Distillation of Large Language Models." *arXiv preprint*.
Jiang, Y., Chan, C., Chen, M., and Wang, W. (2023). "Lion: Adversarial Distillation of Proprietary Large Language Models." *arXiv preprint*.
Gu, Y., et al. (2024). "MINIPLM: Knowledge Distillation for Pre-Training Language Models." *arXiv preprint*.
Muralidharan, S., Sreenivas, S. T., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., and Molchanov, P. (2024). "Compact Language Models via Pruning and Knowledge Distillation." NVIDIA Research.
Han, S., Mao, H., and Dally, W. J. (2016). "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." *Proceedings of ICLR 2016*.
Xu, C., and McAuley, J. (2025). "A Survey on Knowledge Distillation of Large Language Models." *arXiv preprint arXiv:2402.13116*.