Knowledge distillation (also known as model distillation) is a model compression technique in machine learning and artificial intelligence in which a smaller model (called the student) is trained to reproduce the behavior of a larger, more complex model or ensemble of models (called the teacher). The core idea is that a well-trained teacher network encodes rich information in its output probability distributions, and this information can be transferred to a compact student network that would be difficult to train from scratch to the same level of performance. The resulting student model is typically faster, smaller, and cheaper to deploy while retaining most of the teacher's accuracy, achieving roughly 40 to 90 percent model size reduction with less than 5 percent performance loss across computer vision, natural language processing, and speech recognition applications.
The technique fundamentally reimagines how neural networks learn: rather than training solely on labeled data, smaller student models learn from both ground-truth labels and the probability distributions produced by larger teacher models. These soft targets contain valuable "dark knowledge" about inter-class similarities, information that standard one-hot labels cannot convey. Knowledge distillation is commonly used to reduce the size of deep neural network models (or ensembles of multiple models) so that they can be deployed on lower-power or edge devices while preserving most of the original model's performance. It has become especially important for compressing very large models like large language models in recent years.
The term "knowledge distillation" was popularized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper "Distilling the Knowledge in a Neural Network," though the underlying concept of training a small model to mimic a larger one dates back to earlier work by Buciluă, Caruana, and Niculescu-Mizil in 2006. Knowledge distillation has become one of the most widely used approaches for deploying deep learning models on resource-constrained devices such as smartphones, edge sensors, and embedded systems, and it plays a central role in the creation of efficient large language models such as DistilBERT and GPT-4o mini.
The intellectual roots of knowledge distillation extend back to early work on neural network compression and teacher-student learning paradigms. In the early 1990s, research on theoretical teacher-student models in statistical mechanics explored knowledge transfer mechanisms. Notably, Jürgen Schmidhuber in 1991 described a two-network system where one recurrent neural network learned from another, representing an early precursor to modern knowledge distillation concepts. Other researchers in the early 1990s studied theoretical teacher-student configurations in committee machines, exploring the statistical mechanics of knowledge transfer.
The idea of transferring knowledge from a large model to a smaller one was first explored by Cristian Buciluă, Rich Caruana, and Alexandru Niculescu-Mizil in their 2006 paper "Model Compression," presented at the ACM SIGKDD conference. This pioneering work demonstrated that large ensembles of hundreds or thousands of classifiers could be compressed into single neural networks with little loss in accuracy. Their method involved using the large ensemble to label a dataset and then training a smaller neural network on those soft labels, achieving a model "a thousand times smaller and faster" that matched the ensemble's performance. Caruana's team introduced the fundamental concept of using a complex model's predictions as training targets for a simpler model, and developed the MUNGE method for generating synthetic training data when the original data was unavailable. Their experiments showed that the compressed model could match or nearly match the accuracy of the original ensemble while being orders of magnitude smaller and faster. This paper established the basic principle that the outputs of a strong model contain more useful training signal than raw ground-truth labels alone.
The modern formulation of knowledge distillation was introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in a March 2015 paper titled "Distilling the Knowledge in a Neural Network" (arXiv:1503.02531), which was also presented at the NIPS 2014 Deep Learning Workshop. Hinton et al. drew an analogy from biology: insects have distinct larval and adult forms that are optimized for different requirements (extracting energy from the environment versus reproduction and dispersal), and similarly, machine learning models may benefit from having one form optimized for training (a large, cumbersome model that extracts structure from data) and another optimized for deployment (a smaller, faster model).
The paper's central technical contribution was the introduction of temperature scaling in the softmax function and the concept of soft targets. By raising the temperature parameter, the teacher's output probability distribution becomes softer, revealing inter-class relationships that hard labels (one-hot vectors) do not capture. Hinton called this implicit information "dark knowledge," because it resides in the small probabilities assigned to incorrect classes. For example, when a digit classifier trained on MNIST classifies an image of a "7," it might assign a small but non-negligible probability to "1" and "9," reflecting visual similarity. This relational information helps the student generalize better than it would from hard labels alone.
The seminal paper introduced three transformative concepts: the formalization of "distillation" as a knowledge transfer mechanism; the temperature-based softmax approach for creating soft targets; and the notion of "dark knowledge" residing in the probability distributions over incorrect answers. Hinton's team demonstrated dramatic results on MNIST, where a distilled student with 800 hidden units achieved 74 test errors compared to 146 errors when trained conventionally, a 49 percent error reduction purely from learning from a larger teacher. On large-scale speech recognition, they showed that a single distilled model captured 80 percent of the performance improvement from a 10x ensemble and produced significant improvements on the acoustic model of a heavily used commercial speech system at Google. The paper further proposed a new type of ensemble consisting of a generalist model and multiple specialist models that focus on fine-grained distinctions between easily confused classes.
Shortly after Hinton's paper, Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio proposed FitNets in their 2015 ICLR paper "FitNets: Hints for Thin Deep Nets." FitNets extended knowledge distillation beyond the teacher's final output layer by using intermediate representations (called "hints") from the teacher's hidden layers to guide the student's training. This allowed the student to be thinner and deeper than the teacher. On CIFAR-10, a deep student network with roughly 10.4 times fewer parameters outperformed a larger teacher network, demonstrating that intermediate feature matching could be a powerful complement to output-level distillation.
Knowledge distillation rapidly diversified after 2015 into multiple research directions. Attention transfer mechanisms were introduced by Sergey Zagoruyko and Nikos Komodakis in their 2017 ICLR paper "Paying More Attention to Attention," demonstrating that transferring spatial attention maps between teacher and student networks improved CNN performance. Online distillation emerged through Ying Zhang and colleagues who published "Deep Mutual Learning" at CVPR 2018, showing that multiple student networks could teach each other simultaneously without requiring a pre-trained teacher. Self-distillation was established by Tommaso Furlanello's team, who introduced "Born-Again Neural Networks" at ICML 2018, demonstrating that distilling a network into an identically sized student could actually surpass the original teacher's performance. Relation-based distillation was advanced by Wonpyo Park and colleagues with "Relational Knowledge Distillation" at CVPR 2019, showing that transferring relational knowledge about how samples relate to each other could enable students to outperform teachers in metric learning tasks. By 2020, Jianping Gou and colleagues published "Knowledge Distillation: A Survey" in the International Journal of Computer Vision, organizing the field into systematic categories by knowledge type, training schemes, and applications.
Knowledge distillation operates on the premise that the knowledge in a neural network resides not just in its learned parameters, but in the learned mapping from inputs to outputs. When a well-trained model classifies an image, it produces a probability distribution across all possible classes. These relative probabilities encode rich information about visual similarities and semantic relationships that a binary correct/incorrect label cannot capture.
The distillation framework involves two key participants: the teacher model, typically a large, complex network or ensemble trained to high accuracy but requiring substantial computational resources; and the student model, a smaller, more efficient architecture that learns to mimic the teacher's behavior while maintaining practical inference speed and memory footprint.
The standard knowledge distillation pipeline consists of three stages:
Train the teacher. A large, high-capacity model (or ensemble of models) is trained on the target dataset using conventional supervised learning until it achieves strong performance. The teacher may be a deep convolutional neural network like ResNet, a large Transformer like BERT, or any architecture with high accuracy.
Generate soft targets. The trained teacher is used to produce output probability distributions for each training example. These distributions are called "soft targets" or "soft labels" because, unlike one-hot ground-truth labels, they assign non-zero probabilities to multiple classes. The softness of these distributions is controlled by a temperature parameter T.
Train the student. A smaller model is trained on the same dataset using a loss function that combines two objectives: matching the teacher's soft target distribution (the distillation loss) and matching the ground-truth hard labels (the standard supervised loss). The student learns both from the teacher's knowledge and from the labeled data.
The key mechanism that enables effective knowledge transfer is temperature scaling of the softmax function. In standard classification, the softmax function converts raw logits z_i into probabilities:
p_i = exp(z_i) / sum_j(exp(z_j))
In knowledge distillation, a temperature parameter T is introduced:
p_i(T) = exp(z_i / T) / sum_j(exp(z_j / T))
When T = 1, this reduces to the standard softmax. As T increases above 1, the probability distribution becomes "softer" (more uniform), revealing the relative magnitudes of the logits more clearly. Higher temperatures expose the teacher's learned similarity structure between classes, which would be nearly invisible in the sharp distributions produced at T = 1.
For example, consider a teacher classifying images of cats. At T = 1, the output might be [cat: 0.98, dog: 0.01, car: 0.005, ...]. At T = 5, the same logits might produce [cat: 0.45, dog: 0.20, tiger: 0.12, lion: 0.08, ...]. The softened distribution makes the teacher's implicit knowledge about inter-class relationships explicit: cats are more similar to dogs and tigers than to cars. This relational information provides a richer training signal than one-hot labels.
Practitioners typically use temperature values between 2 and 20, with common values around 5 to 10, depending on the task and dataset. The optimal temperature often requires empirical tuning. Too low a temperature produces distributions that are nearly identical to hard labels, offering little additional information. Excessively high temperatures flatten the distribution so much that meaningful distinctions between classes are washed out.
The total loss for training the student model combines two components:
L_total = alpha * L_distill + (1 - alpha) * L_hard
where:
L_distill is the distillation loss (also called the soft loss), typically the Kullback-Leibler (KL) divergence between the teacher's and student's softened output distributions at temperature T:
L_distill = T^2 * KL(p_teacher(T) || p_student(T))
The T^2 factor compensates for the fact that gradients produced by soft targets scale as 1/T^2 when the temperature is raised, ensuring that the distillation gradients remain at a comparable magnitude to the hard-label gradients regardless of the temperature setting.
L_hard is the standard cross-entropy loss (also called the student loss or hard loss) between the student's predictions (at T = 1) and the ground-truth labels:
L_hard = CrossEntropy(p_student(T=1), y_true)
alpha is a weighting hyperparameter (typically between 0.1 and 0.5) that balances the two loss terms. A typical starting point is alpha = 0.5, though this is tuned empirically.
Hinton et al. noted that when using both soft targets and hard targets, the best results were generally obtained by placing a relatively high weight on the distillation loss. The KL divergence is preferred over cross-entropy for the distillation loss because KL divergence equals zero when the student's distribution exactly matches the teacher's, providing a cleaner optimization signal. Including the hard-label term was found by Hinton et al. to be beneficial, helping when the teacher model is not perfectly accurate.
The dark knowledge concept captures information in the teacher's probability distribution over incorrect classes: the relative probabilities of wrong answers that standard cross-entropy training with hard labels cannot access. Even when the teacher assigns very low probabilities to incorrect classes, the relative magnitudes of these probabilities encode valuable information about similarities between classes (for example, recognizing that "7" is more similar to "1" than to "8" in digit recognition). This subtle information in the ratios of unlikely outcomes helps the student learn richer representations and generalize better.
Hinton et al. showed that in the high-temperature limit, minimizing the KL divergence between softened teacher and student distributions is approximately equivalent to minimizing the mean squared error between the teacher's and student's logits. Specifically, the gradient of the distillation loss with respect to the student's logits approximates:
(z_i^student - z_i^teacher) / (N * T^2)
where N is the number of classes. This connection to the earlier model compression work of Buciluă et al. (which essentially performed logit matching) provides theoretical grounding for why temperature-based distillation works: it generalizes and smooths the logit matching approach.
Knowledge distillation methods can be categorized by what type of knowledge is transferred from teacher to student. The choice of knowledge source represents a trade-off between simplicity and the richness of the supervisory signal.
Response-based distillation transfers knowledge from the teacher's final output layer, the probability distributions over classes or the logits themselves. The student is trained to match the teacher's output probability distribution, typically using KL divergence at a raised temperature. This is the original and simplest form of knowledge distillation, as proposed by Hinton et al. (2015), and is sometimes called logit distillation. This approach is highly versatile because it treats the teacher as a "black box"; it does not require access to the teacher's internal architecture or intermediate representations, making it applicable even when the teacher is a proprietary model accessible only through an API.
| Aspect | Details |
|---|---|
| Knowledge source | Teacher's final softmax output (soft labels) |
| Loss function | KL divergence between softened teacher and student distributions |
| Advantages | Simple to implement; architecture-agnostic; works across different model families; teacher can be a black box |
| Limitations | Only captures what the teacher "thinks" at the output; misses intermediate reasoning |
| Examples | Hinton et al. (2015), DistilBERT (Sanh et al., 2019) |
Feature-based distillation (also known as hint learning) goes beyond the output layer and transfers knowledge from the teacher's intermediate hidden layers (feature maps, activations, or representations). The rationale is that these layers learn rich, hierarchical feature representations of the data that are crucial to the teacher's performance. The student is trained so that its intermediate representations match or approximate those of the teacher by minimizing a loss function (for example, L2 loss or L1 loss) between the feature activations of a teacher's "hint" layer and a student's "guided" layer. Because the teacher and student often have different architectures and layer widths, an adapter or regressor layer is typically inserted to align the dimensions.
| Aspect | Details |
|---|---|
| Knowledge source | Teacher's intermediate hidden layer outputs ("hints") |
| Loss function | Mean squared error, cosine similarity, or other distance metrics between teacher and student feature maps |
| Advantages | Captures richer information than output-only distillation; can train deeper, thinner students; forces student to learn the teacher's internal problem-solving process |
| Limitations | Requires choosing which teacher and student layers to match; alignment is non-trivial when architectures differ significantly; requires access to teacher internals |
| Examples | FitNets (Romero et al., 2015), TinyBERT (Jiao et al., 2020) |
Relation-based distillation transfers knowledge about the relationships between data samples or between layers, rather than individual sample representations. The core idea, inspired by structuralism, is that the meaning of a representation is defined by its relations to other representations. This relational knowledge can be captured by modeling the correlations between feature maps, constructing similarity matrices, distance matrices, or building instance relationship graphs. For example, it might train the student to preserve the same pairwise similarity structure among a batch of inputs as the teacher, or to replicate the correlation patterns between different layers. This approach captures structural knowledge that neither output-level nor feature-level matching alone can convey and has proven particularly effective for tasks like metric learning.
| Aspect | Details |
|---|---|
| Knowledge source | Pairwise or higher-order relationships between samples, or between layers within the teacher |
| Loss function | Distance or angle-based losses on relational structures (e.g., pairwise distance matrices, Gram matrices) |
| Advantages | Captures structural and relational patterns; less sensitive to absolute activation magnitudes |
| Limitations | More complex to implement; may require large batch sizes to capture meaningful relationships |
| Examples | Relational Knowledge Distillation (Park et al., 2019), A Gift from Knowledge Distillation (Yim et al., 2017) |
Attention-based distillation transfers attention maps showing where the teacher focuses in the input. This proves particularly valuable for object detection, segmentation, and vision tasks where spatial relationships carry semantic meaning. By learning which regions of an image or which parts of a sequence the teacher model considers important, the student can better allocate its limited capacity to the most relevant features.
| Method | Knowledge Source | Mechanism | Advantages | Disadvantages |
|---|---|---|---|---|
| Response-Based | Final layer logits/probabilities | Student mimics the teacher's final output distribution | Simple, universal, architecture-agnostic, teacher can be a black box | Ignores rich information present in the teacher's intermediate layers |
| Feature-Based | Intermediate layer feature maps (activations) | Student mimics the teacher's hidden layer representations | Provides deeper, more detailed supervision on feature extraction | Complex to match layers between heterogeneous architectures; needs teacher internals |
| Relation-Based | Relationships between data points or feature maps | Student mimics structural properties of the teacher's embedding space | Captures higher-order, structural knowledge; powerful for metric learning | Computationally intensive; defining relations can be complex |
| Attention-Based | Spatial or temporal attention maps | Student learns to focus on the same regions or features as the teacher | Particularly effective for vision and sequence tasks | Requires teacher to have explicit attention mechanisms or spatial structure |
Beyond the type of knowledge transferred, distillation methods also differ in how the teacher and student models are trained and interact with each other.
In offline distillation (the most common setup), the teacher model is fully pre-trained before any distillation takes place. During student training, the teacher's weights are frozen, and only the student's weights are updated. This is the approach used in the original Hinton et al. paper and in most practical applications.
Offline distillation is straightforward to implement and allows the teacher's soft targets to be pre-computed and cached, reducing the computational cost during student training. The main advantage is the ability to leverage powerful, publicly available pre-trained models as off-the-shelf teachers. The downside is that the student cannot influence or adapt the teacher's behavior.
In online distillation, the teacher and student models are trained simultaneously in a single, end-to-end process, with no pre-trained teacher. Instead, a group of models (peers) are trained from scratch, learning collaboratively and teaching each other. During training, the supervisory "teacher" signal for any given student is typically generated by an ensemble of the other peer models. The teacher's parameters are updated alongside the student's, which means the quality of the soft targets improves over the course of training. Online distillation is particularly useful when a pre-trained teacher is not available, or when the teacher itself benefits from the co-training process.
One common approach is mutual learning (Zhang et al., 2018), where two or more networks of similar or different sizes are trained in parallel, with each network acting as both teacher and student to the others. Deep Mutual Learning showed that even networks of the same architecture can benefit from teaching each other, with the final models outperforming independently trained counterparts.
In self-distillation, a single network serves as both teacher and student. Knowledge is transferred from the network's own deeper layers to its shallower layers, or from one training run of the model to a subsequent run of an identical architecture. Another form is training a model on its own high-confidence predictions alongside ground truth. Counterintuitively, this process has been shown to improve the model's own generalization performance even without an external teacher, acting as a form of implicit regularization that encourages the model to find flatter minima in the loss landscape, which is correlated with better performance on unseen data.
Born-Again Neural Networks (Furlanello et al., 2018) demonstrated that training a student model with identical architecture to the teacher, using the teacher's soft targets, yields a student that outperforms the original teacher. By repeating this process over multiple generations, performance continues to improve. This surprising result suggests that the knowledge encoded in soft targets provides a training signal that is qualitatively different from (and often better than) what ground-truth labels provide.
Be Your Own Teacher (Zhang et al., 2019) proposed a self-distillation framework where knowledge is distilled from the deeper sections of a network to its shallower sections during a single training run. Auxiliary classifiers attached to intermediate layers are trained to match the output of the network's deepest layer. This approach improves the performance of the overall network without increasing its size.
| Scheme | Teacher Status | Training Process | Key Advantage | Primary Use Case |
|---|---|---|---|---|
| Offline | Pre-trained and frozen | Two-stage: train teacher, then distill to student | Simple; can leverage powerful publicly available teachers | Standard model compression where a strong teacher exists |
| Online | Trained simultaneously with student(s) | Single-stage: all models trained together, learning from peer ensemble | No need for pre-trained teacher; single-stage pipeline | Training a group of models from scratch |
| Self-Distillation | The student is its own teacher | Single-stage: deeper layers or previous states teach shallower layers or current state | Improves generalization without external teacher; acts as regularizer | Improving performance of an architecture without changing it |
Several advanced variants extend the basic knowledge distillation framework:
The following table summarizes the key mathematical components of knowledge distillation.
| Component | Formula | Description |
|---|---|---|
| Standard softmax | p_i = exp(z_i) / sum_j(exp(z_j)) | Converts logits to probabilities at T = 1 |
| Temperature-scaled softmax | p_i(T) = exp(z_i / T) / sum_j(exp(z_j / T)) | Softens the probability distribution; higher T produces more uniform distributions |
| Distillation loss (KL divergence) | L_distill = T^2 * sum_i(p_teacher_i * log(p_teacher_i / p_student_i)) | Measures how well the student matches the teacher's soft targets |
| Hard-label loss (cross-entropy) | L_hard = -sum_i(y_i * log(p_student_i)) | Standard supervised loss with ground-truth labels |
| Total loss | L_total = alpha * L_distill + (1 - alpha) * L_hard | Weighted combination of distillation and supervised losses |
| High-temperature approximation | L_distill approx (1 / 2N) * sum_i(z_i^s - z_i^t)^2 | At high T, distillation reduces to logit matching (MSE between logits) |
DistilBERT, developed by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf at Hugging Face, is one of the most widely cited examples of knowledge distillation applied to natural language processing. DistilBERT distills BERT-base into a model that is 40 percent smaller (66 million parameters vs. BERT's 110 million) and 60 percent faster at inference while retaining 97 percent of BERT's language understanding performance on the GLUE benchmark.
The training procedure combined three loss functions:
| Loss component | Description |
|---|---|
| Distillation loss | KL divergence between the teacher's and student's softened output distributions |
| Masked language modeling loss | The standard BERT pre-training objective applied to the student |
| Cosine embedding loss | Aligns the hidden state vectors of the student and teacher |
DistilBERT uses 6 Transformer layers (compared to BERT-base's 12), with each student layer initialized from every other layer of the teacher. On the SQuAD question answering benchmark, DistilBERT achieves 86.2 F1 and 78.1 EM (exact match), within 3 points of the full BERT model. The model's compact size (207 MB) makes it practical for on-device NLP applications.
TinyBERT, proposed by Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xinlei Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu from Huawei Noah's Ark Lab, extends distillation to multiple layers of the Transformer architecture. TinyBERT introduces a two-stage learning framework that performs distillation at both the pre-training stage and the task-specific fine-tuning stage.
The distillation method targets three types of representations:
The attention-based fitting is motivated by the finding that BERT's attention weights capture substantial linguistic knowledge. TinyBERT (4 layers) achieves comparable results to BERT on the GLUE benchmark while being 7.5 times smaller and 9.4 times faster, retaining 96.8 percent of BERT's accuracy.
DistilGPT2 is a distilled version of OpenAI's GPT-2 language model. DistilGPT2 was trained using the smallest GPT-2 (124 million parameters) as the teacher, resulting in a model with 82 million parameters (about 33 percent fewer parameters) that is nearly 2x faster in inference. The distilled model does sacrifice some generation quality: on the WikiText-103 text generation benchmark, GPT-2 achieves a test perplexity of 16.3 whereas DistilGPT2 has a perplexity of 21.1 (lower perplexity is better, indicating the model is more confident in its predictions).
MobileBERT, developed at Google, was optimized specifically for mobile deployment with 25.3 million parameters, achieving 62 millisecond inference latency on Pixel 4 phones for 128-token sequences.
The table below summarizes these distilled NLP models:
| Teacher model (size) | Student model (size) | Compression | Student performance |
|---|---|---|---|
| BERT-base (110M parameters) | DistilBERT (66M) | ~40% fewer params, >60% faster inference | ~97% of teacher accuracy on GLUE benchmark |
| GPT-2 (124M parameters) | DistilGPT2 (82M) | ~33% fewer params, ~2x faster inference | Perplexity 21.1 vs 16.3 (teacher) on WikiText-103 |
| BERT-base (110M parameters) | TinyBERT-4 (14.5M) | 7.5x smaller, 9.4x faster inference | 96.8% of teacher accuracy |
| BERT-base (110M parameters) | MobileBERT (25.3M) | 4.3x smaller, 62ms latency on mobile | Comparable to BERT-base on downstream tasks |
Distillation has also been used for multilingual models and low-resource languages, enabling efficient deployment for tasks like question answering, machine translation, and text generation on consumer devices or for handling high request volumes.
Knowledge distillation has become increasingly important for making large language models more accessible and deployable. With the advent of extremely large LLMs like GPT-3, GPT-4, and LLaMA, two main approaches are used:
Black-Box KD is common when the teacher is a proprietary model accessible only via an API (for example, GPT-4). The student model does not have access to the teacher's internal parameters or logits. Instead, it is trained on a dataset of high-quality prompt-response pairs generated by the teacher. The Stanford Alpaca model is an example, having been fine-tuned on 52,000 instruction-following examples generated by OpenAI's text-davinci-003 model.
White-Box KD is used with open-source LLMs where the full model, including its output distributions and hidden states, is available. This allows for the use of the classic distillation loss on soft targets, which provides a much richer training signal than just the generated text. MiniLLM for LLM distillation uses reverse KL divergence rather than forward KLD to prevent students from overestimating low-probability regions. Researchers have found that for generative tasks, it is often better to use a loss function like reverse KL-divergence, which encourages the student to focus on matching the high-probability outputs (modes) of the teacher's distribution, ensuring correctness and faithfulness.
Notable examples of distilled LLMs include:
| Model | Teacher | Compression | Key result |
|---|---|---|---|
| DistilBERT | BERT-base (110M) | 40% size reduction | 97% of BERT performance on GLUE |
| TinyBERT | BERT-base (110M) | 7.5x smaller | 96.8% of BERT accuracy on GLUE |
| GPT-4o mini | GPT-4o | Significantly smaller | Maintains strong performance at much lower cost |
| DistilGPT-2 | GPT-2 (124M) | 2x faster | Retains most generation quality |
| TinyR1-32B-Preview | DeepSeek-R1 | Much smaller | Near-equal performance on AIME 2024 math benchmark |
Recent advances in LLM distillation have incorporated techniques beyond simple output matching. DeepSeek integrated chain-of-thought prompting and reinforcement learning to guide smaller models in solving complex reasoning tasks. RLKD (Reinforcement Learning Knowledge Distillation) introduced a structure-aware reward model to capture reasoning patterns that distributional matching alone cannot transfer. DIVERSEDISTILL addresses teacher-student heterogeneity by using a teaching committee of multiple teachers, dynamically weighting each teacher's contributions based on the student's current understanding.
Knowledge distillation is widely used in computer vision for deploying efficient models on edge devices. Many state-of-the-art vision models (for image classification, object detection, and other tasks) are too computationally heavy for real-time use on devices.
Knowledge distillation was originally demonstrated on image classification tasks. In their 2015 paper, Hinton et al. showed that a student network with 800 rectified linear units, when trained with distillation from a larger teacher, made only 74 errors on the MNIST test set, a significant improvement over the 146 errors it made when trained on the same data with hard labels, and close to the teacher's performance of 67 errors. Similar successes have been replicated on more complex datasets like ImageNet, CIFAR-10, and CIFAR-100. Research has applied distillation to tasks like distilling from a ResNet-152 to a ResNet-50 student, often in combination with techniques like hint training (matching intermediate feature maps) to further boost the student's performance. Large teacher models like ResNet-152 or EfficientNet are also distilled into compact architectures like MobileNet for mobile deployment. Beyer et al. (2022) showed that patient and consistent distillation can transfer state-of-the-art performance from large ResNet models to more efficient architectures.
Applying knowledge distillation to object detection is more challenging than to classification because the task is more complex, involving both the classification of objects and the regression of their bounding box coordinates. Chen et al. (2017) showed that a deep detection model (teacher) could distill its knowledge to a faster student detector, achieving better accuracy-speed trade-offs. Successful techniques for object detection include:
Distillation has enabled compact models suitable for mobile devices and embedded systems with minimal loss in accuracy, including compressed YOLO models for real-time autonomous driving and surveillance applications.
Complex image segmentation models can be distilled to produce lightweight models for medical imaging and mobile applications. Subsequent research has applied distillation to semantic segmentation for medical imaging and autonomous driving. Vision Transformers have been distilled to efficient CNNs for improved deployment.
Automatic speech recognition and audio processing have embraced distillation extensively to adapt large acoustic models to smaller footprints. Asami et al. (2017) demonstrated that a large speech acoustic model could teach a smaller student model for a new domain, improving the student's performance in that domain via distillation. An ensemble of multilingual speech recognition models has been distilled into a single model to support under-resourced languages.
Amazon Alexa used knowledge distillation with 1 million hours of unlabeled speech to create efficient on-device acoustic models. These techniques help deploy speech recognition on devices with limited hardware, such as smartphones or IoT devices, by reducing model size and latency while maintaining accuracy. Speech emotion recognition benefits from distillation with up to 44.9 percent model size reduction and 40.2 percent faster inference.
Edge computing and IoT scenarios demand aggressive compression. The foremost application of knowledge distillation is in the field of Edge AI, which involves running AI computations directly on end-user devices like smartphones, autonomous vehicles, and IoT sensors. Large models are often impractical for these devices due to their high latency, large memory footprint, and significant power consumption. Knowledge distillation addresses this by compressing these models into lightweight student versions that can run efficiently on-device.
This approach offers several key benefits:
Case studies in this area include deploying efficient vehicle detection models on platforms like the NVIDIA Jetson Nano, where a tiny student model is trained using knowledge from an ensemble of larger, more accurate detectors. KD is also a key enabling technology for Federated Learning on edge devices, where a global model's knowledge can be distilled into personalized on-device models, or where devices collaboratively train models without sharing raw data. For instance, it has been successfully applied to IoT traffic classification and network intrusion detection systems, where lightweight student models on IoT devices can achieve high accuracy by learning from a powerful centralized teacher model.
Cross-modal KD is an advanced application where knowledge is transferred between models trained on different data modalities. This is particularly useful in scenarios where one modality is rich in information but expensive to acquire or process (for example, LiDAR), while another is cheaper and more ubiquitous (for example, RGB cameras). The goal is to imbue the model operating on the cheaper modality with the knowledge from the more powerful one, which is only needed during the training phase.
Key applications include:
Recommender systems face strict latency requirements (typically under 100 milliseconds) while handling massive user and item catalogs, making compression critical. Techniques include topology distillation for graph-based recommendations and ranking distillation for maintaining ranking quality in real-time systems.
Healthcare applications demand both accuracy and efficiency for real-time diagnosis and deployment on portable medical devices. Applications include:
Research has demonstrated significant compression with minimal accuracy loss in medical imaging tasks, enabling deployment on portable diagnostic devices.
Knowledge distillation has been applied to graph neural networks. Yang et al. (2020) distilled knowledge from a large graph convolutional network into a smaller one, enabling efficient graph analytics on non-Euclidean data. This is particularly useful for tasks involving social networks, molecular structures, and knowledge graphs.
In reinforcement learning, distillation has been used to transfer policies from large ensembles of agents to single agents, enabling more efficient deployment of learned behaviors.
While model compression is the primary use case, knowledge distillation has been applied to several other problems:
| Application | Description |
|---|---|
| Model compression | Reducing model size and inference cost for deployment on resource-constrained devices |
| Domain adaptation | Distilling knowledge from a model trained on a source domain to a student for a target domain |
| Data augmentation | Using teacher predictions on unlabeled data as pseudo-labels to expand the effective training set |
| Label smoothing | Soft targets act as a form of regularization, preventing the student from becoming overconfident |
| Federated learning | Reducing communication overhead by distilling a global model into client models |
| Multi-task learning | Distilling task-specific knowledge from separate teacher models into a single student |
| Privacy preservation | Training a student on the teacher's outputs without exposing the original training data |
| Ensemble compression | Replacing a computationally expensive ensemble of models with a single compact model |
Knowledge distillation offers distinctive advantages that make it a powerful technique for model compression and deployment:
Knowledge distillation, despite its effectiveness, has several known limitations.
When the student model is too small relative to the teacher, a capacity gap arises: the student lacks sufficient representational capacity to absorb the teacher's knowledge. Research has shown that excessively growing the teacher size eventually creates a knowledge gap that makes it harder for the student to learn from the teacher's predictions, sometimes leading to worse performance than a smaller teacher would provide. Counterintuitively, research has shown that a more accurate teacher is not always a better teacher; a slightly less accurate but "simpler" teacher may provide a more digestible learning signal for a small student. Mirzadeh et al. (2020) proposed using intermediate-sized "teacher assistant" models to bridge this gap.
The student's performance is fundamentally bound by the teacher's capabilities. A poorly trained, biased, or suboptimal teacher will inevitably pass on its flaws to students, limiting the potential of the distilled model.
In some scenarios, particularly in zero-shot cross-lingual transfer, distillation can be detrimental. The student model may perform worse than if it had been trained from scratch on the available data. This can occur if the teacher's knowledge is not well-aligned with the student's task or architecture.
The student model may not capture all the nuances, fine-grained knowledge, or complex reasoning capabilities of the teacher. There is typically a trade-off between model size and accuracy. If the student is too small, it cannot effectively learn from the teacher, and some performance degradation is unavoidable.
Knowledge distillation introduces several new hyperparameters, including the temperature (T) and the weights for the soft and hard loss terms (alpha, beta). Finding the optimal values for these parameters can be a challenging and computationally expensive process that requires extensive experimentation. See hyperparameter tuning.
Empirical studies have revealed that achieving high student accuracy does not necessarily require high fidelity (i.e., a close match between the student's and teacher's predictive distributions). There can often be a large discrepancy between the teacher's and student's outputs, even when the student performs well on the downstream task. This suggests that the optimization landscape of distillation is unusually challenging and that the benefits may stem more from the regularizing effect of the soft targets' gradients rather than from perfect imitation.
Distillation requires training (or having access to) a large teacher model first, which adds an extra step and additional computational cost compared to directly training a smaller model. In online distillation, the simultaneous training of teacher and student increases memory requirements and training time.
If the teacher model contains biases, errors, or miscalibrated confidence scores, these issues are likely to be transferred to the student. The student inherits not just the teacher's knowledge but also its mistakes, making it important to verify teacher quality before distillation.
Stoll et al. (2020) and others have shown that even when the student has sufficient capacity to perfectly match the teacher, there often remains a surprisingly large discrepancy between the predictive distributions of the teacher and student after training. Difficulties in optimization, rather than capacity limitations, can be a key reason for this gap.
Distillation works best for discriminative tasks and has more limited effectiveness for certain spatial reasoning or highly complex tasks. It can face difficulties when there are large capacity gaps in very deep networks.
Traditional knowledge distillation methods struggle with very large models because knowledge in LLMs is distributed across billions of parameters and complex attention patterns. White-box distillation (which requires access to the teacher's internal representations) is often impractical for proprietary models, and black-box distillation (using only the teacher's outputs) may not capture the full depth of the teacher's reasoning capabilities.
In feature-based distillation, selecting the appropriate layers for distillation, aligning corresponding features from the teacher and student, and matching feature sizes are all non-trivial engineering challenges. When the teacher and student use fundamentally different architectures (for example, a CNN teacher and a Transformer student), these alignment problems become even more difficult.
Knowledge distillation is one of several major paradigms for model compression. While they all aim to create more efficient models, they operate on fundamentally different principles. These techniques are often complementary and can be used in combination for maximum efficiency.
Pruning operates by removing parts of an already trained network, redundant parameters (weights, neurons, or filters). This can be unstructured (removing individual weights) or structured (removing entire channels or neurons). Unlike distillation, it modifies an existing model rather than creating a new one. Research shows combining pruning with distillation provides superior results. Distillation offers architectural flexibility while pruning provides direct parameter reduction. Pruning typically requires fine-tuning the pruned model to recover accuracy.
Quantization reduces the bit-precision of weights and activations from high-precision floating-point (for example, 32-bit) to lower-precision formats (for example, 8-bit integer), achieving 4x reduction for FP32 to INT8. The architecture remains unchanged, but the data types of parameters are modified. Can be applied post-training (PTQ) or simulated during training for better performance (QAT). Many works distill from full-precision to quantized student models, showing these techniques complement each other.
Low-rank factorization decomposes large weight matrices into products of smaller, low-rank matrices to reduce parameters. Modifies specific layers (for example, fully connected) by replacing them with factorized versions. Requires fine-tuning the model after factorization to regain accuracy. This method excels when combined with other techniques like LoRA for parameter-efficient fine-tuning.
Neural architecture search (NAS) automates discovering optimal network architectures. NAS for LLM compression showed up to 9.85 percent average improvement on 11 diverse downstream tasks with 22 percent latency improvement. These techniques can combine effectively with distillation.
The most effective compression strategies combine multiple techniques. Deep Compression combined pruning, quantization, and Huffman coding to achieve 49x compression for VGG-16 with minimal accuracy loss. Successful deployment requires comprehensive analysis and careful hybrid strategy selection. Many state-of-the-art compression pipelines use knowledge distillation in conjunction with pruning, quantization, and low-rank factorization.
| Technique | Mechanism | Granularity | Impact on Architecture | Training Requirement |
|---|---|---|---|---|
| Knowledge Distillation | Trains a small student model to mimic a large teacher model | Model-level | Student architecture can be completely different from the teacher | Requires training a new student model from scratch or from a pre-trained checkpoint |
| Pruning | Removes redundant parameters (weights, neurons, or filters) from a network | Parameter/Channel-level | Reduces active parameters within the same or similar architecture | Typically requires fine-tuning the pruned model to recover accuracy |
| Quantization | Reduces the bit-precision of weights and activations | Parameter-level | Architecture unchanged, but data types modified | Can be applied post-training (PTQ) or during training (QAT) |
| Low-Rank Factorization | Decomposes large weight matrices into smaller, low-rank matrices | Layer-level | Modifies specific layers by replacing them with factorized versions | Requires fine-tuning after factorization |
The period 2023 to 2025 witnessed explosive growth in knowledge distillation research driven by large language models and foundation models.
LLM distillation emerged as a dominant research direction with two main paradigms:
Recent breakthrough papers include:
Diffusion models received focused attention for their computational intensity during image generation. Techniques include progressive distillation reducing sampling steps, consistency models enabling direct noise-to-data mapping, and score distillation for text-to-3D generation.
Major deep learning platforms provide mature implementations for knowledge distillation:
Imagine a really smart older kid who is amazing at solving math problems. A younger kid wants to learn from the older kid, but the younger kid's brain is smaller and cannot hold as much information. Instead of making the younger kid read every textbook the older kid has read, the older kid just shows the younger kid how to think about each problem: "This looks a bit like addition, a little like multiplication, and not at all like division." Those hints (which answers seem close to right and which seem far off) are more helpful than just being told "the answer is 42." That is knowledge distillation: a big, smart model teaches a small model by sharing not just the answers, but the thinking behind the answers.