DistilBERT
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v3 ยท 3,702 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
14 citations
Review status
Source-backed
Revision
v3 ยท 3,702 words
Add missing citations, update stale details, or suggest a clearer explanation.
DistilBERT is a compressed version of BERT released by Hugging Face in October 2019 that is 40% smaller and 60% faster than BERT-base while retaining 97% of its language-understanding performance on the GLUE benchmark [1]. Created by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf, it was built using knowledge distillation applied during pretraining, has 66 million parameters versus BERT-base's 110 million, and became one of the most widely downloaded models on the Hugging Face Hub. DistilBERT was one of the first practical demonstrations that knowledge distillation could compress a large pretrained transformer without major accuracy loss, and it became a landmark template for model compression and an entire family of compressed language models.
The model is trained with a triple loss that combines the standard masked language modeling objective with a soft-target distillation loss and a cosine embedding loss between student and teacher hidden states [1]. Released as part of Hugging Face's open-source transformers library and Model Hub, DistilBERT became one of the most widely downloaded NLP checkpoints on the platform: in one analysis of Hub statistics, the distilbert organization accounted for 44.6% of downloads across the most-downloaded entities studied [2]. It is used heavily for sentiment analysis, question answering, named entity recognition, and embedding pipelines, especially in resource-constrained environments such as web APIs, mobile apps, and edge devices [3].
BERT, short for Bidirectional Encoder Representations from Transformers, was introduced by Jacob Devlin and colleagues at Google AI Language in October 2018 [4]. It uses a stack of transformer encoder blocks pretrained on two unsupervised objectives: masked language modeling, where 15% of input tokens are randomly masked and the model predicts them, and next sentence prediction, a binary task asking whether one sentence follows another in the original text.
BERT-base, the smaller of the two models released in the original paper, uses 12 transformer layers, a hidden size of 768, 12 self-attention heads, and roughly 110 million parameters. BERT-large doubles the depth to 24 layers and reaches about 340 million parameters. At release, BERT achieved state-of-the-art results on 11 natural language understanding benchmarks, including the GLUE benchmark, SQuAD v1.1 and v2.0, and SWAG.
Despite its accuracy, BERT-base is expensive to deploy. The attention mechanism scales quadratically with sequence length, and on commodity CPUs the model can take hundreds of milliseconds per query. A single BERT-base checkpoint occupies more than 400 MB on disk, making on-device deployment difficult. As the DistilBERT authors put it, "operating these large models on-the-edge and/or under constrained computational training or inference budgets remains challenging" [1]. The NLP community quickly began searching for ways to reduce its inference cost, exploring weight pruning, quantization, parameter sharing as in ALBERT, and knowledge distillation.
Knowledge distillation was introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper Distilling the Knowledge in a Neural Network [5]. The basic idea is to train a smaller student network to imitate the predictions of a larger, more accurate teacher network. Rather than training the student only on hard labels, the student is also trained to match the teacher's full output distribution, which contains far richer information about the relative similarity between classes. Hinton called this extra signal the dark knowledge captured in the teacher's logits.
Formally, the teacher's logits are passed through a temperature-scaled softmax with temperature T > 1, producing softer probabilities that emphasize relative confidence between classes. The student is trained to match those softened probabilities using cross-entropy. By raising the temperature, low-probability classes still contribute meaningful gradient signal, allowing the student to learn similarity structure that one-hot labels cannot express.
DistilBERT was one of the first attempts to apply the technique at scale to a pretrained transformer language model. The Hugging Face team showed that the approach could be applied during pretraining itself rather than only during fine-tuning, producing a general-purpose distilled checkpoint that could then be fine-tuned on downstream tasks just like BERT.
DistilBERT-base preserves BERT-base's structure while halving depth and removing two components.
| Property | BERT-base | DistilBERT-base |
|---|---|---|
| Transformer layers | 12 | 6 |
| Hidden size | 768 | 768 |
| Attention heads | 12 | 12 |
| Feed-forward inner size | 3072 | 3072 |
| Total parameters | 110M | 66M |
| Token type embeddings | yes | no |
| Pooler layer | yes | no |
| Vocabulary | 30,522 WordPiece | 30,522 WordPiece |
| Maximum positions | 512 | 512 |
The student keeps the same hidden dimension and attention head count as the teacher because matching these dimensions makes it possible to copy weights from the teacher during initialization and to apply the cosine alignment loss to hidden states without projection. Reducing depth gives a roughly proportional reduction in inference latency.
The Hugging Face team removed the token type embeddings used in BERT to mark sentence A and sentence B inputs, reflecting the decision to drop next sentence prediction during distillation. The pooler layer is also removed because downstream classification heads typically use the raw [CLS] representation directly.
The overall reduction is approximately 40% fewer parameters than BERT-base [1]. The disk footprint of a distilbert-base-uncased checkpoint is around 250 MB, compared to roughly 440 MB for bert-base-uncased. The tokenizer is unchanged, so any pipeline that uses BERT's WordPiece tokenizer can swap DistilBERT in without touching the preprocessing code.
DistilBERT is trained with what the original paper calls a triple loss, described as "a triple loss combining language modeling, distillation and cosine-distance losses" [1]. The loss function combines three components, each capturing a different aspect of the teacher's behavior.
1. Distillation loss L_ce. This is the standard knowledge distillation cross-entropy applied to the masked language modeling outputs. For each masked position the student's logits and the teacher's logits are both passed through a softmax with temperature T greater than one, and the student is trained to match the teacher's softened distribution:
L_ce = - sum_i t_i * log(s_i)
where t_i = softmax(z_t / T)_i and s_i = softmax(z_s / T)_i are the temperature-scaled teacher and student probabilities. Following Hinton, the gradient with respect to the student logits is multiplied by T^2 to keep the gradient magnitude comparable to a hard-label cross-entropy. DistilBERT typically uses temperatures around T = 2 during training and T = 1 at inference.
2. Masked language modeling loss L_mlm. The standard BERT masked language modeling objective is also applied, where 15% of input tokens are masked and the student is trained to predict them using the ground-truth token identities. This anchors the student to the actual training distribution and prevents it from memorizing only the teacher's mistakes.
3. Cosine embedding loss L_cos. Student and teacher hidden states are aligned with a cosine similarity loss that encourages the angle between corresponding hidden vectors to be small:
L_cos = 1 - cos(h_student, h_teacher)
This loss operates directly on the geometry of the representation space. Because the student keeps the same hidden size as the teacher, the comparison can be done without learning a projection matrix.
The total loss is a weighted sum:
L = alpha * L_ce + beta * L_mlm + gamma * L_cos
The paper uses weights emphasizing distillation cross-entropy and cosine alignment, with the masked language modeling loss as a regularizer.
The authors found that initialization mattered substantially. Instead of training the 6-layer student from scratch, they initialized it from the teacher by taking one layer out of every two: student layer 0 from teacher layer 0, student layer 1 from teacher layer 2, and so on [1]. This warm start, which exploits the common dimensionality between teacher and student, shortens the distillation budget needed to reach competitive accuracy.
DistilBERT is trained on the same corpus as BERT-base: English Wikipedia and the Toronto BookCorpus dataset, totaling around 3.3 billion tokens. According to the paper, "DistilBERT was trained on 8 16GB V100 GPUs for approximately 90 hours" [1], an order of magnitude less compute than the original BERT pretraining.
Next sentence prediction is omitted entirely. This decision was consistent with later findings, including those in the RoBERTa paper, that next sentence prediction either has no benefit or is mildly harmful when training is otherwise well-tuned. Dropping it also matches the simpler architecture, which removes the segment embeddings.
The DistilBERT paper reports a series of comparisons against BERT-base on standard NLP tasks. The headline result is a roughly 97% retention of BERT-base's GLUE benchmark score using only 60% of the inference time and 40% fewer parameters [1]. For reference, the paper reports a GLUE macro score of 68.7 for ELMo, 79.5 for BERT-base, and 77.0 for DistilBERT [1].
| Task | BERT-base | DistilBERT-base |
|---|---|---|
| GLUE macro average (dev) | 79.5 | 77.0 |
| MNLI matched | 86.7 | 82.2 |
| QQP F1 | 88.6 | 88.5 |
| SST-2 | 91.7 | 91.3 |
| STS-B Pearson | 89.0 | 86.9 |
| QNLI | 92.7 | 89.2 |
| RTE | 69.3 | 59.9 |
| SQuAD v1.1 (EM / F1) | 81.2 / 88.5 | 77.7 / 85.8 |
| IMDb sentiment accuracy | 93.46% | 92.82% |
(SQuAD v1.1 and IMDb figures are from the paper's downstream-task table [1].)
Classification tasks are where DistilBERT shines. On SST-2, QQP, and IMDb the gap to the teacher is less than one point. On natural language inference and other reasoning-heavy tasks the gap is larger, and on extractive question answering over SQuAD v1.1 the F1 drops from 88.5 to 85.8, about 2.7 points, retaining roughly 97% of the teacher's F1 [1]. The paper attributes the remaining gaps to the depth reduction, since deeper transformers tend to support more complex compositional reasoning.
The paper reports a 60% speedup at inference, measuring a full pass over the STS-B development set in 410 seconds for DistilBERT versus 668 seconds for BERT-base on CPU [1]. The authors also demonstrated DistilBERT in an on-device proof-of-concept, running it on a mobile phone, showing it is fast enough for many on-device applications [1]. Memory usage at inference is roughly halved compared to BERT-base, important for browser deployments using ONNX Runtime or for edge computing environments.
The distilbert namespace on the Hugging Face Hub contains several official checkpoints, and the technique has been applied to other base models to produce a small family of distilled transformers.
distilbert-base-uncased. The flagship model, distilled from bert-base-uncased. English only, lowercased input, 30,522 token WordPiece vocabulary. This is the most downloaded variant.distilbert-base-cased. Distilled from bert-base-cased. Preserves casing, useful for named entity recognition where capitalization is an important feature.distilbert-base-multilingual-cased. Distilled from bert-base-multilingual-cased, covering 104 languages. Smaller and faster than mBERT while retaining the bulk of its cross-lingual transfer ability.distilbert-base-uncased-finetuned-sst-2-english. A ready-to-use sentiment classifier fine-tuned on SST-2. This checkpoint is the default model behind the pipeline("sentiment-analysis") shortcut in the Hugging Face library, which made it one of the most downloaded models on the platform.distilbert-base-cased-distilled-squad. A SQuAD-fine-tuned variant frequently used as a default question-answering pipeline.The Hugging Face team and the broader community produced several siblings using the same recipe:
multi-qa-distilbert are built on DistilBERT backbones for fast retrieval and clustering.DistilBERT was integrated into the transformers library at release. The library exposes a family of task-specific heads:
DistilBertModel: the bare encoder.DistilBertForMaskedLM: encoder plus masked language modeling head.DistilBertForSequenceClassification: sentiment, NLI, topic, and other text classification tasks.DistilBertForQuestionAnswering: extractive question answering head for SQuAD-style data.DistilBertForTokenClassification: token-level classifier for named entity recognition and other token classification tasks.DistilBertForMultipleChoice: multiple choice tasks such as SWAG.The simplest usage pattern relies on AutoTokenizer and AutoModel:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")
inputs = tokenizer("DistilBERT is a smaller BERT.", return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
For a sentiment classifier, the high-level pipeline API hides the model selection behind a single call:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("This article is comprehensive and clear.")
# [{'label': 'POSITIVE', 'score': 0.9998}]
Because distilbert-base-uncased-finetuned-sst-2-english is the default for that pipeline, every user who runs the code above downloads a DistilBERT checkpoint, which contributed to its dominant share of Hugging Face download counts.
DistilBERT was also one of the first models supported by ONNX Runtime, TensorFlow Lite, and Core ML conversion paths, making it a popular choice for production deployment outside the standard PyTorch runtime.
DistilBERT became a widely deployed default in industrial NLP systems. For years after its release it was one of the top models on the Hugging Face Hub by monthly downloads, often appearing alongside BERT-base, RoBERTa-base, and sentence-transformers MiniLM checkpoints; in one Hub-statistics analysis the distilbert organization made up 44.6% of downloads among the most-downloaded entities studied [2]. The reasons are practical:
DistilBERT is also a common default in Kaggle competitions and in introductory NLP courses, where its training speed lets users iterate many ideas in a single day. The paper has been cited several thousand times since publication, and its methodology section is frequently referenced as a canonical example of knowledge distillation applied to a pretrained transformer. Its release helped establish Hugging Face as a research organization rather than only a library maintainer.
Knowledge distillation is one of several techniques for shrinking a pretrained transformer. Each approach has different tradeoffs in accuracy, speed, hardware support, and engineering complexity.
| Method | Examples | Mechanism | Strength | Weakness |
|---|---|---|---|---|
| Knowledge distillation | DistilBERT, TinyBERT, MobileBERT, MiniLM | Train smaller student to mimic teacher outputs and hidden states | Strong accuracy retention, hardware-agnostic | Requires teacher and a separate training run |
| Pruning | Movement pruning, magnitude pruning | Remove weights, heads, or layers based on importance criteria | Can be combined with other methods | Sparsity often needs special kernels for actual speedup |
| Quantization | int8, int4, dynamic, static, GPTQ | Reduce numerical precision of weights and activations | Easy to apply post hoc, large memory savings | Accuracy loss at very low bit widths |
| Parameter sharing | ALBERT | Share weights across layers | Smaller checkpoint | Same FLOPs at inference, no speedup |
| Architecture redesign | ELECTRA, DeBERTa | Change pretraining objective or attention design | More efficient pretraining | Different model family, retraining required |
TinyBERT (Jiao et al. 2019) is a smaller cousin of DistilBERT with 4 layers, a hidden size of 312, and around 14.5 million parameters [6]. It uses a two-stage distillation that first distills general knowledge and then performs task-specific distillation, giving strong accuracy at much smaller sizes at the cost of a more elaborate pipeline.
MobileBERT (Sun et al. 2020) targets mobile inference using bottleneck architectures and inverted-bottleneck feed-forward layers [7]. It has 25.3 million parameters but matches BERT-base's GLUE score within one point.
MiniLM (Wang et al. 2020) introduces deep self-attention distillation that aligns the student's attention distributions and value-relation matrices with the teacher's [8]. Its checkpoints MiniLM-L6 and MiniLM-L12 became standard backbones for fast sentence embedding models in retrieval-augmented generation pipelines.
ALBERT (Lan et al. 2019) takes a different route, sharing parameters across layers and factorizing the embedding matrix [9]. While ALBERT-base has only 12 million parameters, it executes the same FLOPs at inference, so it does not provide the speed gains DistilBERT does.
DistilBERT inherits several limitations from BERT and adds a few of its own:
DistilBERT helped seed a generation of small encoder models that took the same compress-and-deploy philosophy further:
paraphrase- and multi-qa- families of sentence-transformers extended the DistilBERT lineage into the embedding world for retrieval pipelines.In the modern landscape, distilled small encoders coexist with large language models. LLMs handle open-ended tasks and reasoning, while distilled encoders such as DistilBERT and MiniLM handle high-volume, latency-sensitive jobs such as content classification, intent detection, retrieval, and reranking, often inside the retrieval-augmented generation stack of an LLM application.
Hugging Face was founded in 2016 by Clement Delangue, Julien Chaumond, and Thomas Wolf, originally as a chatbot company aimed at teenage users. After the chatbot pivot, the team began releasing PyTorch ports of new transformer models, starting with pytorch-pretrained-BERT in late 2018. This library was renamed pytorch-transformers and then simply transformers as it expanded to cover GPT-2, RoBERTa, XLNet, and other architectures.
DistilBERT was one of the first significant original research contributions from the Hugging Face team rather than a port of an external model. It was first described in a NeurIPS 2019 workshop paper [1] and announced in a blog post in October 2019. The release demonstrated that the company could conduct competitive applied research on top of its open-source platform, repositioning Hugging Face from a library maintainer to a research organization.
DistilBERT's popularity also fed back into the platform's growth. The default sentiment-analysis pipeline uses a fine-tuned DistilBERT, and millions of users who ran the basic example downloaded a DistilBERT checkpoint as part of their first interaction with the Hub. The recipe has continued to inform later projects, including the Zephyr, SmolLM, and SmolVLM lines.