DistilBERT
Last reviewed
Apr 28, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 3,499 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 28, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 ยท 3,499 words
Add missing citations, update stale details, or suggest a clearer explanation.
DistilBERT is a transformer-based language model created by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf at Hugging Face in 2019 [1]. It is a distilled version of BERT-base that retains approximately 97% of BERT's performance on the GLUE benchmark while being 40% smaller (66 million parameters versus 110 million) and 60% faster at inference. DistilBERT was one of the first practical demonstrations that knowledge distillation could compress a large pretrained transformer without major accuracy loss, and it became a template for an entire family of compressed language models.
The model is trained with a triple loss that combines the standard masked language modeling objective with a soft-target distillation loss and a cosine embedding loss between student and teacher hidden states. Released as part of Hugging Face's open-source transformers library and Model Hub, DistilBERT became one of the most widely downloaded NLP checkpoints on the platform, used heavily for sentiment analysis, question answering, named entity recognition, and embedding pipelines, especially in resource-constrained environments such as web APIs, mobile apps, and edge devices [2].
BERT, short for Bidirectional Encoder Representations from Transformers, was introduced by Jacob Devlin and colleagues at Google AI Language in October 2018 [3]. It uses a stack of transformer encoder blocks pretrained on two unsupervised objectives: masked language modeling, where 15% of input tokens are randomly masked and the model predicts them, and next sentence prediction, a binary task asking whether one sentence follows another in the original text.
BERT-base, the smaller of the two models released in the original paper, uses 12 transformer layers, a hidden size of 768, 12 self-attention heads, and roughly 110 million parameters. BERT-large doubles the depth to 24 layers and reaches about 340 million parameters. At release, BERT achieved state-of-the-art results on 11 natural language understanding benchmarks, including the GLUE benchmark, SQuAD v1.1 and v2.0, and SWAG.
Despite its accuracy, BERT-base is expensive to deploy. The attention mechanism scales quadratically with sequence length, and on commodity CPUs the model can take hundreds of milliseconds per query. A single BERT-base checkpoint occupies more than 400 MB on disk, making on-device deployment difficult. The NLP community quickly began searching for ways to reduce its inference cost, exploring weight pruning, quantization, parameter sharing as in ALBERT, and knowledge distillation.
Knowledge distillation was introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper Distilling the Knowledge in a Neural Network [4]. The basic idea is to train a smaller student network to imitate the predictions of a larger, more accurate teacher network. Rather than training the student only on hard labels, the student is also trained to match the teacher's full output distribution, which contains far richer information about the relative similarity between classes. Hinton called this extra signal the dark knowledge captured in the teacher's logits.
Formally, the teacher's logits are passed through a temperature-scaled softmax with temperature T > 1, producing softer probabilities that emphasize relative confidence between classes. The student is trained to match those softened probabilities using cross-entropy. By raising the temperature, low-probability classes still contribute meaningful gradient signal, allowing the student to learn similarity structure that one-hot labels cannot express.
DistilBERT was one of the first attempts to apply the technique at scale to a pretrained transformer language model. The Hugging Face team showed that the approach could be applied during pretraining itself rather than only during fine-tuning, producing a general-purpose distilled checkpoint that could then be fine-tuned on downstream tasks just like BERT.
DistilBERT-base preserves BERT-base's structure while halving depth and removing two components.
| Property | BERT-base | DistilBERT-base |
|---|---|---|
| Transformer layers | 12 | 6 |
| Hidden size | 768 | 768 |
| Attention heads | 12 | 12 |
| Feed-forward inner size | 3072 | 3072 |
| Total parameters | 110M | 66M |
| Token type embeddings | yes | no |
| Pooler layer | yes | no |
| Vocabulary | 30,522 WordPiece | 30,522 WordPiece |
| Maximum positions | 512 | 512 |
The student keeps the same hidden dimension and attention head count as the teacher because matching these dimensions makes it possible to copy weights from the teacher during initialization and to apply the cosine alignment loss to hidden states without projection. Reducing depth gives a roughly proportional reduction in inference latency.
The Hugging Face team removed the token type embeddings used in BERT to mark sentence A and sentence B inputs, reflecting the decision to drop next sentence prediction during distillation. The pooler layer is also removed because downstream classification heads typically use the raw [CLS] representation directly.
The overall reduction is approximately 40% fewer parameters than BERT-base. The disk footprint of a distilbert-base-uncased checkpoint is around 250 MB, compared to roughly 440 MB for bert-base-uncased. The tokenizer is unchanged, so any pipeline that uses BERT's WordPiece tokenizer can swap DistilBERT in without touching the preprocessing code.
DistilBERT is trained with what the original paper calls a triple loss. The loss function combines three components, each capturing a different aspect of the teacher's behavior [1].
1. Distillation loss L_ce. This is the standard knowledge distillation cross-entropy applied to the masked language modeling outputs. For each masked position the student's logits and the teacher's logits are both passed through a softmax with temperature T greater than one, and the student is trained to match the teacher's softened distribution:
L_ce = - sum_i t_i * log(s_i)
where t_i = softmax(z_t / T)_i and s_i = softmax(z_s / T)_i are the temperature-scaled teacher and student probabilities. Following Hinton, the gradient with respect to the student logits is multiplied by T^2 to keep the gradient magnitude comparable to a hard-label cross-entropy. DistilBERT typically uses temperatures around T = 2 during training and T = 1 at inference.
2. Masked language modeling loss L_mlm. The standard BERT masked language modeling objective is also applied, where 15% of input tokens are masked and the student is trained to predict them using the ground-truth token identities. This anchors the student to the actual training distribution and prevents it from memorizing only the teacher's mistakes.
3. Cosine embedding loss L_cos. Student and teacher hidden states are aligned with a cosine similarity loss that encourages the angle between corresponding hidden vectors to be small:
L_cos = 1 - cos(h_student, h_teacher)
This loss operates directly on the geometry of the representation space. Because the student keeps the same hidden size as the teacher, the comparison can be done without learning a projection matrix.
The total loss is a weighted sum:
L = alpha * L_ce + beta * L_mlm + gamma * L_cos
The paper uses weights emphasizing distillation cross-entropy and cosine alignment, with the masked language modeling loss as a regularizer.
The authors found that initialization mattered substantially. Instead of training the 6-layer student from scratch, they initialized it by taking every other layer of the 12-layer teacher: student layer 0 from teacher layer 0, student layer 1 from teacher layer 2, and so on. This warm start shortens the distillation budget needed to reach competitive accuracy.
DistilBERT is trained on the same corpus as BERT-base: English Wikipedia and the BookCorpus dataset, totaling around 3.3 billion tokens. Training was performed on 8 NVIDIA V100 16 GB GPUs for approximately 90 hours, an order of magnitude less compute than the original BERT pretraining.
Next sentence prediction is omitted entirely. This decision was consistent with later findings, including those in the RoBERTa paper, that next sentence prediction either has no benefit or is mildly harmful when training is otherwise well-tuned. Dropping it also matches the simpler architecture, which removes the segment embeddings.
The DistilBERT paper reports a series of comparisons against BERT-base on standard NLP tasks. The headline result is a roughly 97% retention of BERT-base's GLUE benchmark score using only 60% of the inference time and 40% fewer parameters.
| Task | BERT-base | DistilBERT-base |
|---|---|---|
| GLUE macro average (dev) | 79.5 | 77.0 |
| MNLI matched | 86.7 | 82.2 |
| QQP F1 | 88.6 | 88.5 |
| SST-2 | 91.7 | 91.3 |
| STS-B Pearson | 89.0 | 86.9 |
| QNLI | 92.7 | 89.2 |
| RTE | 69.3 | 59.9 |
| SQuAD v1.1 F1 | 88.5 | 79.1 |
| IMDb sentiment accuracy | 93.46% | 92.82% |
Classification tasks are where DistilBERT shines. On SST-2, QQP, and IMDb the gap to the teacher is less than one point. On natural language inference and other reasoning-heavy tasks the gap is larger, and on extractive question answering over SQuAD v1.1 the F1 drops by about 9 points, retaining around 89% of the teacher's score. The paper attributes this to the depth reduction, since deeper transformers tend to support more complex compositional reasoning.
The paper reports a 60% speedup on both CPU and GPU. The authors also demonstrated that DistilBERT can run inference for short inputs in under 100 ms on a Pixel-class mobile CPU, which is fast enough for many on-device applications. Memory usage at inference is roughly halved compared to BERT-base, important for browser deployments using ONNX Runtime or for edge computing environments.
The distilbert namespace on the Hugging Face Hub contains several official checkpoints, and the technique has been applied to other base models to produce a small family of distilled transformers.
distilbert-base-uncased. The flagship model, distilled from bert-base-uncased. English only, lowercased input, 30,522 token WordPiece vocabulary. This is the most downloaded variant.distilbert-base-cased. Distilled from bert-base-cased. Preserves casing, useful for named entity recognition where capitalization is an important feature.distilbert-base-multilingual-cased. Distilled from bert-base-multilingual-cased, covering 104 languages. Smaller and faster than mBERT while retaining the bulk of its cross-lingual transfer ability.distilbert-base-uncased-finetuned-sst-2-english. A ready-to-use sentiment classifier fine-tuned on SST-2. This checkpoint is the default model behind the pipeline("sentiment-analysis") shortcut in the Hugging Face library, which made it one of the most downloaded models on the platform.distilbert-base-cased-distilled-squad. A SQuAD-fine-tuned variant frequently used as a default question-answering pipeline.The Hugging Face team and the broader community produced several siblings using the same recipe:
multi-qa-distilbert are built on DistilBERT backbones for fast retrieval and clustering.DistilBERT was integrated into the transformers library at release. The library exposes a family of task-specific heads:
DistilBertModel: the bare encoder.DistilBertForMaskedLM: encoder plus masked language modeling head.DistilBertForSequenceClassification: sentiment, NLI, topic, and other text classification tasks.DistilBertForQuestionAnswering: extractive question answering head for SQuAD-style data.DistilBertForTokenClassification: token-level classifier for named entity recognition and other token classification tasks.DistilBertForMultipleChoice: multiple choice tasks such as SWAG.The simplest usage pattern relies on AutoTokenizer and AutoModel:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")
inputs = tokenizer("DistilBERT is a smaller BERT.", return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
For a sentiment classifier, the high-level pipeline API hides the model selection behind a single call:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
classifier("This article is comprehensive and clear.")
# [{'label': 'POSITIVE', 'score': 0.9998}]
Because distilbert-base-uncased-finetuned-sst-2-english is the default for that pipeline, every user who runs the code above downloads a DistilBERT checkpoint, which contributed to its dominant share of Hugging Face download counts.
DistilBERT was also one of the first models supported by ONNX Runtime, TensorFlow Lite, and Core ML conversion paths, making it a popular choice for production deployment outside the standard PyTorch runtime.
DistilBERT became a widely deployed default in industrial NLP systems. For years after its release it was one of the top models on the Hugging Face Hub by monthly downloads, often appearing in the top five along with BERT-base, RoBERTa-base, and sentence-transformers MiniLM checkpoints [2]. The reasons are practical:
DistilBERT is also a common default in Kaggle competitions and in introductory NLP courses, where its training speed lets users iterate many ideas in a single day. The paper has been cited several thousand times since publication, and its methodology section is frequently referenced as a canonical example of knowledge distillation applied to a pretrained transformer. Its release helped establish Hugging Face as a research organization rather than only a library maintainer.
Knowledge distillation is one of several techniques for shrinking a pretrained transformer. Each approach has different tradeoffs in accuracy, speed, hardware support, and engineering complexity.
| Method | Examples | Mechanism | Strength | Weakness |
|---|---|---|---|---|
| Knowledge distillation | DistilBERT, TinyBERT, MobileBERT, MiniLM | Train smaller student to mimic teacher outputs and hidden states | Strong accuracy retention, hardware-agnostic | Requires teacher and a separate training run |
| Pruning | Movement pruning, magnitude pruning | Remove weights, heads, or layers based on importance criteria | Can be combined with other methods | Sparsity often needs special kernels for actual speedup |
| Quantization | int8, int4, dynamic, static, GPTQ | Reduce numerical precision of weights and activations | Easy to apply post hoc, large memory savings | Accuracy loss at very low bit widths |
| Parameter sharing | ALBERT | Share weights across layers | Smaller checkpoint | Same FLOPs at inference, no speedup |
| Architecture redesign | ELECTRA, DeBERTa | Change pretraining objective or attention design | More efficient pretraining | Different model family, retraining required |
TinyBERT (Jiao et al. 2019) is a smaller cousin of DistilBERT with 4 layers, a hidden size of 312, and around 14.5 million parameters [5]. It uses a two-stage distillation that first distills general knowledge and then performs task-specific distillation, giving strong accuracy at much smaller sizes at the cost of a more elaborate pipeline.
MobileBERT (Sun et al. 2020) targets mobile inference using bottleneck architectures and inverted-bottleneck feed-forward layers [6]. It has 25.3 million parameters but matches BERT-base's GLUE score within one point.
MiniLM (Wang et al. 2020) introduces deep self-attention distillation that aligns the student's attention distributions and value-relation matrices with the teacher's [7]. Its checkpoints MiniLM-L6 and MiniLM-L12 became standard backbones for fast sentence embedding models in retrieval-augmented generation pipelines.
ALBERT (Lan et al. 2019) takes a different route, sharing parameters across layers and factorizing the embedding matrix. While ALBERT-base has only 12 million parameters, it executes the same FLOPs at inference, so it does not provide the speed gains DistilBERT does.
DistilBERT inherits several limitations from BERT and adds a few of its own:
DistilBERT helped seed a generation of small encoder models that took the same compress-and-deploy philosophy further:
paraphrase- and multi-qa- families of sentence-transformers extended the DistilBERT lineage into the embedding world for retrieval pipelines.In the modern landscape, distilled small encoders coexist with large language models. LLMs handle open-ended tasks and reasoning, while distilled encoders such as DistilBERT and MiniLM handle high-volume, latency-sensitive jobs such as content classification, intent detection, retrieval, and reranking, often inside the retrieval-augmented generation stack of an LLM application.
Hugging Face was founded in 2016 by Clement Delangue, Julien Chaumond, and Thomas Wolf, originally as a chatbot company aimed at teenage users. After the chatbot pivot, the team began releasing PyTorch ports of new transformer models, starting with pytorch-pretrained-BERT in late 2018. This library was renamed pytorch-transformers and then simply transformers as it expanded to cover GPT-2, RoBERTa, XLNet, and other architectures.
DistilBERT was one of the first significant original research contributions from the Hugging Face team rather than a port of an external model. It was first described in a NeurIPS 2019 workshop paper [1] and announced in a blog post in October 2019. The release demonstrated that the company could conduct competitive applied research on top of its open-source platform, repositioning Hugging Face from a library maintainer to a research organization.
DistilBERT's popularity also fed back into the platform's growth. The default sentiment-analysis pipeline uses a fine-tuned DistilBERT, and millions of users who ran the basic example downloaded a DistilBERT checkpoint as part of their first interaction with the Hub. The recipe has continued to inform later projects, including the Zephyr, SmolLM, and SmolVLM lines.