DistilBERT

AI Models Deep Learning Natural Language Processing Transformer Models

19 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v3 · 3,702 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DistilBERT is a compressed version of BERT released by Hugging Face in October 2019 that is 40% smaller and 60% faster than BERT-base while retaining 97% of its language-understanding performance on the GLUE benchmark ^[1]. Created by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf, it was built using knowledge distillation applied during pretraining, has 66 million parameters versus BERT-base's 110 million, and became one of the most widely downloaded models on the Hugging Face Hub. DistilBERT was one of the first practical demonstrations that knowledge distillation could compress a large pretrained transformer without major accuracy loss, and it became a landmark template for model compression and an entire family of compressed language models.

The model is trained with a triple loss that combines the standard masked language modeling objective with a soft-target distillation loss and a cosine embedding loss between student and teacher hidden states ^[1]. Released as part of Hugging Face's open-source transformers library and Model Hub, DistilBERT became one of the most widely downloaded NLP checkpoints on the platform: in one analysis of Hub statistics, the distilbert organization accounted for 44.6% of downloads across the most-downloaded entities studied ^[2]. It is used heavily for sentiment analysis, question answering, named entity recognition, and embedding pipelines, especially in resource-constrained environments such as web APIs, mobile apps, and edge devices ^[3].

What problem does DistilBERT solve?

BERT and the cost of pretrained transformers

BERT, short for Bidirectional Encoder Representations from Transformers, was introduced by Jacob Devlin and colleagues at Google AI Language in October 2018 ^[4]. It uses a stack of transformer encoder blocks pretrained on two unsupervised objectives: masked language modeling, where 15% of input tokens are randomly masked and the model predicts them, and next sentence prediction, a binary task asking whether one sentence follows another in the original text.

BERT-base, the smaller of the two models released in the original paper, uses 12 transformer layers, a hidden size of 768, 12 self-attention heads, and roughly 110 million parameters. BERT-large doubles the depth to 24 layers and reaches about 340 million parameters. At release, BERT achieved state-of-the-art results on 11 natural language understanding benchmarks, including the GLUE benchmark, SQuAD v1.1 and v2.0, and SWAG.

Despite its accuracy, BERT-base is expensive to deploy. The attention mechanism scales quadratically with sequence length, and on commodity CPUs the model can take hundreds of milliseconds per query. A single BERT-base checkpoint occupies more than 400 MB on disk, making on-device deployment difficult. As the DistilBERT authors put it, "operating these large models on-the-edge and/or under constrained computational training or inference budgets remains challenging" ^[1]. The NLP community quickly began searching for ways to reduce its inference cost, exploring weight pruning, quantization, parameter sharing as in ALBERT, and knowledge distillation.

What is knowledge distillation?

Knowledge distillation was introduced by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in their 2015 paper Distilling the Knowledge in a Neural Network ^[5]. The basic idea is to train a smaller student network to imitate the predictions of a larger, more accurate teacher network. Rather than training the student only on hard labels, the student is also trained to match the teacher's full output distribution, which contains far richer information about the relative similarity between classes. Hinton called this extra signal the dark knowledge captured in the teacher's logits.

Formally, the teacher's logits are passed through a temperature-scaled softmax with temperature T > 1, producing softer probabilities that emphasize relative confidence between classes. The student is trained to match those softened probabilities using cross-entropy. By raising the temperature, low-probability classes still contribute meaningful gradient signal, allowing the student to learn similarity structure that one-hot labels cannot express.

DistilBERT was one of the first attempts to apply the technique at scale to a pretrained transformer language model. The Hugging Face team showed that the approach could be applied during pretraining itself rather than only during fine-tuning, producing a general-purpose distilled checkpoint that could then be fine-tuned on downstream tasks just like BERT.

How is DistilBERT built? (Architecture)

DistilBERT-base preserves BERT-base's structure while halving depth and removing two components.

Property	BERT-base	DistilBERT-base
Transformer layers	12	6
Hidden size	768	768
Attention heads	12	12
Feed-forward inner size	3072	3072
Total parameters	110M	66M
Token type embeddings	yes	no
Pooler layer	yes	no
Vocabulary	30,522 WordPiece	30,522 WordPiece
Maximum positions	512	512

The student keeps the same hidden dimension and attention head count as the teacher because matching these dimensions makes it possible to copy weights from the teacher during initialization and to apply the cosine alignment loss to hidden states without projection. Reducing depth gives a roughly proportional reduction in inference latency.

The Hugging Face team removed the token type embeddings used in BERT to mark sentence A and sentence B inputs, reflecting the decision to drop next sentence prediction during distillation. The pooler layer is also removed because downstream classification heads typically use the raw [CLS] representation directly.

The overall reduction is approximately 40% fewer parameters than BERT-base ^[1]. The disk footprint of a distilbert-base-uncased checkpoint is around 250 MB, compared to roughly 440 MB for bert-base-uncased. The tokenizer is unchanged, so any pipeline that uses BERT's WordPiece tokenizer can swap DistilBERT in without touching the preprocessing code.

How is DistilBERT trained? (Distillation procedure)

DistilBERT is trained with what the original paper calls a triple loss, described as "a triple loss combining language modeling, distillation and cosine-distance losses" ^[1]. The loss function combines three components, each capturing a different aspect of the teacher's behavior.

Component losses

1. Distillation loss L_ce. This is the standard knowledge distillation cross-entropy applied to the masked language modeling outputs. For each masked position the student's logits and the teacher's logits are both passed through a softmax with temperature T greater than one, and the student is trained to match the teacher's softened distribution:

L_ce = - sum_i  t_i * log(s_i)

where t_i = softmax(z_t / T)_i and s_i = softmax(z_s / T)_i are the temperature-scaled teacher and student probabilities. Following Hinton, the gradient with respect to the student logits is multiplied by T^2 to keep the gradient magnitude comparable to a hard-label cross-entropy. DistilBERT typically uses temperatures around T = 2 during training and T = 1 at inference.

2. Masked language modeling loss L_mlm. The standard BERT masked language modeling objective is also applied, where 15% of input tokens are masked and the student is trained to predict them using the ground-truth token identities. This anchors the student to the actual training distribution and prevents it from memorizing only the teacher's mistakes.

3. Cosine embedding loss L_cos. Student and teacher hidden states are aligned with a cosine similarity loss that encourages the angle between corresponding hidden vectors to be small:

L_cos = 1 - cos(h_student, h_teacher)

This loss operates directly on the geometry of the representation space. Because the student keeps the same hidden size as the teacher, the comparison can be done without learning a projection matrix.

The total loss is a weighted sum:

L = alpha * L_ce + beta * L_mlm + gamma * L_cos

The paper uses weights emphasizing distillation cross-entropy and cosine alignment, with the masked language modeling loss as a regularizer.

Initialization and data

The authors found that initialization mattered substantially. Instead of training the 6-layer student from scratch, they initialized it from the teacher by taking one layer out of every two: student layer 0 from teacher layer 0, student layer 1 from teacher layer 2, and so on ^[1]. This warm start, which exploits the common dimensionality between teacher and student, shortens the distillation budget needed to reach competitive accuracy.

DistilBERT is trained on the same corpus as BERT-base: English Wikipedia and the Toronto BookCorpus dataset, totaling around 3.3 billion tokens. According to the paper, "DistilBERT was trained on 8 16GB V100 GPUs for approximately 90 hours" ^[1], an order of magnitude less compute than the original BERT pretraining.

Removed objectives

Next sentence prediction is omitted entirely. This decision was consistent with later findings, including those in the RoBERTa paper, that next sentence prediction either has no benefit or is mildly harmful when training is otherwise well-tuned. Dropping it also matches the simpler architecture, which removes the segment embeddings.

How well does DistilBERT perform? (Benchmarks)

The DistilBERT paper reports a series of comparisons against BERT-base on standard NLP tasks. The headline result is a roughly 97% retention of BERT-base's GLUE benchmark score using only 60% of the inference time and 40% fewer parameters ^[1]. For reference, the paper reports a GLUE macro score of 68.7 for ELMo, 79.5 for BERT-base, and 77.0 for DistilBERT ^[1].

Task	BERT-base	DistilBERT-base
GLUE macro average (dev)	79.5	77.0
MNLI matched	86.7	82.2
QQP F1	88.6	88.5
SST-2	91.7	91.3
STS-B Pearson	89.0	86.9
QNLI	92.7	89.2
RTE	69.3	59.9
SQuAD v1.1 (EM / F1)	81.2 / 88.5	77.7 / 85.8
IMDb sentiment accuracy	93.46%	92.82%

(SQuAD v1.1 and IMDb figures are from the paper's downstream-task table ^[1].)

Classification tasks are where DistilBERT shines. On SST-2, QQP, and IMDb the gap to the teacher is less than one point. On natural language inference and other reasoning-heavy tasks the gap is larger, and on extractive question answering over SQuAD v1.1 the F1 drops from 88.5 to 85.8, about 2.7 points, retaining roughly 97% of the teacher's F1 ^[1]. The paper attributes the remaining gaps to the depth reduction, since deeper transformers tend to support more complex compositional reasoning.

The paper reports a 60% speedup at inference, measuring a full pass over the STS-B development set in 410 seconds for DistilBERT versus 668 seconds for BERT-base on CPU ^[1]. The authors also demonstrated DistilBERT in an on-device proof-of-concept, running it on a mobile phone, showing it is fast enough for many on-device applications ^[1]. Memory usage at inference is roughly halved compared to BERT-base, important for browser deployments using ONNX Runtime or for edge computing environments.

What DistilBERT checkpoints and variants exist?

The distilbert namespace on the Hugging Face Hub contains several official checkpoints, and the technique has been applied to other base models to produce a small family of distilled transformers.

Official DistilBERT checkpoints

distilbert-base-uncased. The flagship model, distilled from bert-base-uncased. English only, lowercased input, 30,522 token WordPiece vocabulary. This is the most downloaded variant.
distilbert-base-cased. Distilled from bert-base-cased. Preserves casing, useful for named entity recognition where capitalization is an important feature.
distilbert-base-multilingual-cased. Distilled from bert-base-multilingual-cased, covering 104 languages. Smaller and faster than mBERT while retaining the bulk of its cross-lingual transfer ability.
distilbert-base-uncased-finetuned-sst-2-english. A ready-to-use sentiment classifier fine-tuned on SST-2. This checkpoint is the default model behind the pipeline("sentiment-analysis") shortcut in the Hugging Face library, which made it one of the most downloaded models on the platform.
distilbert-base-cased-distilled-squad. A SQuAD-fine-tuned variant frequently used as a default question-answering pipeline.

The Hugging Face team and the broader community produced several siblings using the same recipe:

DistilGPT2. A distilled version of GPT-2 small with 82M parameters, used for lightweight text generation.
DistilRoBERTa. Distilled from RoBERTa-base. Inherits RoBERTa's improved tokenization and dynamic masking, and tends to outperform DistilBERT on downstream tasks at comparable size.
DistilCamemBERT. A French distilled model derived from CamemBERT, built by the community to bring DistilBERT's efficiency benefits to French NLP.
DistilBERT-based sentence-transformers. Embedding models such as multi-qa-distilbert are built on DistilBERT backbones for fast retrieval and clustering.

How do you use DistilBERT in Hugging Face Transformers?

DistilBERT was integrated into the transformers library at release. The library exposes a family of task-specific heads:

DistilBertModel: the bare encoder.
DistilBertForMaskedLM: encoder plus masked language modeling head.
DistilBertForSequenceClassification: sentiment, NLI, topic, and other text classification tasks.
DistilBertForQuestionAnswering: extractive question answering head for SQuAD-style data.
DistilBertForTokenClassification: token-level classifier for named entity recognition and other token classification tasks.
DistilBertForMultipleChoice: multiple choice tasks such as SWAG.

The simplest usage pattern relies on AutoTokenizer and AutoModel:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("DistilBERT is a smaller BERT.", return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state

For a sentiment classifier, the high-level pipeline API hides the model selection behind a single call:

from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("This article is comprehensive and clear.")
# [{'label': 'POSITIVE', 'score': 0.9998}]

Because distilbert-base-uncased-finetuned-sst-2-english is the default for that pipeline, every user who runs the code above downloads a DistilBERT checkpoint, which contributed to its dominant share of Hugging Face download counts.

DistilBERT was also one of the first models supported by ONNX Runtime, TensorFlow Lite, and Core ML conversion paths, making it a popular choice for production deployment outside the standard PyTorch runtime.

Why was DistilBERT so widely adopted?

DistilBERT became a widely deployed default in industrial NLP systems. For years after its release it was one of the top models on the Hugging Face Hub by monthly downloads, often appearing alongside BERT-base, RoBERTa-base, and sentence-transformers MiniLM checkpoints; in one Hub-statistics analysis the distilbert organization made up 44.6% of downloads among the most-downloaded entities studied ^[2]. The reasons are practical:

It is fast enough to serve inside web request handlers without batching.
It is small enough to fit in tight container memory limits and to ship inside browser extensions or mobile apps.
Its accuracy is high enough to act as a strong baseline for classification, retrieval, and entity tagging.
Its API in the transformers library is identical to BERT, making it a drop-in replacement.

DistilBERT is also a common default in Kaggle competitions and in introductory NLP courses, where its training speed lets users iterate many ideas in a single day. The paper has been cited several thousand times since publication, and its methodology section is frequently referenced as a canonical example of knowledge distillation applied to a pretrained transformer. Its release helped establish Hugging Face as a research organization rather than only a library maintainer.

How does DistilBERT compare with other compression methods?

Knowledge distillation is one of several techniques for shrinking a pretrained transformer. Each approach has different tradeoffs in accuracy, speed, hardware support, and engineering complexity.

Method	Examples	Mechanism	Strength	Weakness
Knowledge distillation	DistilBERT, TinyBERT, MobileBERT, MiniLM	Train smaller student to mimic teacher outputs and hidden states	Strong accuracy retention, hardware-agnostic	Requires teacher and a separate training run
Pruning	Movement pruning, magnitude pruning	Remove weights, heads, or layers based on importance criteria	Can be combined with other methods	Sparsity often needs special kernels for actual speedup
Quantization	int8, int4, dynamic, static, GPTQ	Reduce numerical precision of weights and activations	Easy to apply post hoc, large memory savings	Accuracy loss at very low bit widths
Parameter sharing	ALBERT	Share weights across layers	Smaller checkpoint	Same FLOPs at inference, no speedup
Architecture redesign	ELECTRA, DeBERTa	Change pretraining objective or attention design	More efficient pretraining	Different model family, retraining required

TinyBERT (Jiao et al. 2019) is a smaller cousin of DistilBERT with 4 layers, a hidden size of 312, and around 14.5 million parameters ^[6]. It uses a two-stage distillation that first distills general knowledge and then performs task-specific distillation, giving strong accuracy at much smaller sizes at the cost of a more elaborate pipeline.

MobileBERT (Sun et al. 2020) targets mobile inference using bottleneck architectures and inverted-bottleneck feed-forward layers ^[7]. It has 25.3 million parameters but matches BERT-base's GLUE score within one point.

MiniLM (Wang et al. 2020) introduces deep self-attention distillation that aligns the student's attention distributions and value-relation matrices with the teacher's ^[8]. Its checkpoints MiniLM-L6 and MiniLM-L12 became standard backbones for fast sentence embedding models in retrieval-augmented generation pipelines.

ALBERT (Lan et al. 2019) takes a different route, sharing parameters across layers and factorizing the embedding matrix ^[9]. While ALBERT-base has only 12 million parameters, it executes the same FLOPs at inference, so it does not provide the speed gains DistilBERT does.

Limitations

DistilBERT inherits several limitations from BERT and adds a few of its own:

Fixed sequence length. The maximum input length is 512 tokens, the same as BERT-base. Long documents must be chunked, which can hurt accuracy on tasks that depend on long-range dependencies.
Encoder only. DistilBERT cannot generate text. It is suited to classification, tagging, and embedding tasks but not to free-form generation, where GPT-2 or modern large language models are appropriate.
Reasoning gap. The loss of half the encoder layers shows up as a measurable accuracy drop on tasks requiring complex multi-hop reasoning, particularly on natural language inference and span-based question answering. The drop from 88.5 to 85.8 F1 on SQuAD v1.1 is the clearest example.
Not the smallest possible BERT. At 66 million parameters DistilBERT is large for TinyML and microcontroller deployments. TinyBERT and MiniLM-L6 reach a fraction of the size, and quantized variants can fit in tens of megabytes.
English-centric defaults. Most downstream usage is on the English uncased variant; the multilingual checkpoint has been somewhat eclipsed by XLM-R for cross-lingual work.
Eclipsed for some tasks. In the LLM era, DistilBERT has been superseded for advanced tasks by larger generative models with in-context learning, but it remains competitive for classification and embedding pipelines where latency and cost matter.

Successors

DistilBERT helped seed a generation of small encoder models that took the same compress-and-deploy philosophy further:

ELECTRA-small (Clark et al. 2020) replaced masked language modeling with a discriminative replaced-token-detection objective, achieving strong accuracy at small sizes with much less compute ^[10].
MiniLM and MiniLMv2 produced extremely small encoders that became the backbone for fast sentence embedding models in the sentence-transformers library.
DeBERTa (He et al. 2020) introduced disentangled attention and improved both small and large encoder accuracy.
The paraphrase- and multi-qa- families of sentence-transformers extended the DistilBERT lineage into the embedding world for retrieval pipelines.

In the modern landscape, distilled small encoders coexist with large language models. LLMs handle open-ended tasks and reasoning, while distilled encoders such as DistilBERT and MiniLM handle high-volume, latency-sensitive jobs such as content classification, intent detection, retrieval, and reranking, often inside the retrieval-augmented generation stack of an LLM application.

Who created DistilBERT? (Hugging Face context)

Hugging Face was founded in 2016 by Clement Delangue, Julien Chaumond, and Thomas Wolf, originally as a chatbot company aimed at teenage users. After the chatbot pivot, the team began releasing PyTorch ports of new transformer models, starting with pytorch-pretrained-BERT in late 2018. This library was renamed pytorch-transformers and then simply transformers as it expanded to cover GPT-2, RoBERTa, XLNet, and other architectures.

DistilBERT was one of the first significant original research contributions from the Hugging Face team rather than a port of an external model. It was first described in a NeurIPS 2019 workshop paper ^[1] and announced in a blog post in October 2019. The release demonstrated that the company could conduct competitive applied research on top of its open-source platform, repositioning Hugging Face from a library maintainer to a research organization.

DistilBERT's popularity also fed back into the platform's growth. The default sentiment-analysis pipeline uses a fine-tuned DistilBERT, and millions of users who ran the basic example downloaded a DistilBERT checkpoint as part of their first interaction with the Hub. The recipe has continued to inform later projects, including the Zephyr, SmolLM, and SmolVLM lines.

References

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. *5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, NeurIPS 2019*. arXiv:1910.01108. https://arxiv.org/abs/1910.01108 ↩
Bourdois, L. Model statistics of the 50 most downloaded entities on Hugging Face. Hugging Face blog. https://huggingface.co/blog/lbourdois/huggingface-models-stats ↩
Hugging Face. distilbert-base-uncased model card. https://huggingface.co/distilbert/distilbert-base-uncased ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *NAACL 2019*. arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. *NIPS 2014 Deep Learning Workshop*. arXiv:1503.02531. https://arxiv.org/abs/1503.02531 ↩
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., & Liu, Q. (2019). TinyBERT: Distilling BERT for Natural Language Understanding. *Findings of EMNLP 2020*. arXiv:1909.10351. https://arxiv.org/abs/1909.10351 ↩
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., & Zhou, D. (2020). MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. *ACL 2020*. arXiv:2004.02984. https://arxiv.org/abs/2004.02984 ↩
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. *NeurIPS 2020*. arXiv:2002.10957. https://arxiv.org/abs/2002.10957 ↩
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. *ICLR 2020*. arXiv:1909.11942. https://arxiv.org/abs/1909.11942 ↩
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. *ICLR 2020*. arXiv:2003.10555. https://arxiv.org/abs/2003.10555 ↩
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692. https://arxiv.org/abs/1907.11692
Wolf, T., Debut, L., Sanh, V., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. *EMNLP 2020 System Demonstrations*. https://aclanthology.org/2020.emnlp-demos.6/
Sanh, V. (2019). Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT. Hugging Face blog. https://medium.com/huggingface/distilbert-8cf3380435b5
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. *EMNLP 2019*. arXiv:1908.10084. https://arxiv.org/abs/1908.10084

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

ALBERT BERT BioBERT Fill-Mask Models Hugging Face Transformers Inference optimization Jacob Devlin Knowledge Distillation Machine learning terms/Natural Language Processing NLU SPLADE Text Classification Models WordPiece

What problem does DistilBERT solve?

BERT and the cost of pretrained transformers

What is knowledge distillation?

How is DistilBERT built? (Architecture)

How is DistilBERT trained? (Distillation procedure)

Component losses

Initialization and data

Removed objectives

How well does DistilBERT perform? (Benchmarks)

What DistilBERT checkpoints and variants exist?

Official DistilBERT checkpoints

Related distilled models

How do you use DistilBERT in Hugging Face Transformers?

Why was DistilBERT so widely adopted?

How does DistilBERT compare with other compression methods?

Limitations

Successors

Who created DistilBERT? (Hugging Face context)

See also

References

Improve this article

Related Articles

Positional encoding

XLNet

RoBERTa

ELECTRA

ALBERT

DeBERTa

What links here

Related Articles

Positional encoding

XLNet

RoBERTa

ELECTRA

ALBERT

DeBERTa

What links here