# DistilBERT

> Source: https://aiwiki.ai/wiki/distilbert
> Updated: 2026-06-23
> Categories: AI Models, Deep Learning, Natural Language Processing, Transformer Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**DistilBERT** is a compressed version of [BERT](/wiki/bert) released by [Hugging Face](/wiki/hugging_face) in October 2019 that is 40% smaller and 60% faster than BERT-base while retaining 97% of its language-understanding performance on the [GLUE benchmark](/wiki/glue_benchmark) [1]. Created by Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf, it was built using [knowledge distillation](/wiki/knowledge_distillation) applied during pretraining, has 66 million parameters versus BERT-base's 110 million, and became one of the most widely downloaded models on the Hugging Face Hub. DistilBERT was one of the first practical demonstrations that knowledge distillation could compress a large pretrained [transformer](/wiki/transformer) without major accuracy loss, and it became a landmark template for [model compression](/wiki/model_compression) and an entire family of compressed language models.

The model is trained with a triple loss that combines the standard masked language modeling objective with a soft-target distillation loss and a cosine embedding loss between student and teacher hidden states [1]. Released as part of Hugging Face's open-source [transformers](/wiki/transformers) library and Model Hub, DistilBERT became one of the most widely downloaded NLP checkpoints on the platform: in one analysis of Hub statistics, the distilbert organization accounted for 44.6% of downloads across the most-downloaded entities studied [2]. It is used heavily for [sentiment analysis](/wiki/sentiment_analysis), [question answering](/wiki/question_answering), [named entity recognition](/wiki/named_entity_recognition), and embedding pipelines, especially in resource-constrained environments such as web APIs, mobile apps, and edge devices [3].

## What problem does DistilBERT solve?

### BERT and the cost of pretrained transformers

[BERT](/wiki/bert), short for Bidirectional Encoder Representations from Transformers, was introduced by Jacob Devlin and colleagues at Google AI Language in October 2018 [4]. It uses a stack of [transformer](/wiki/transformer) encoder blocks pretrained on two unsupervised objectives: [masked language modeling](/wiki/masked_language_model), where 15% of input tokens are randomly masked and the model predicts them, and next sentence prediction, a binary task asking whether one sentence follows another in the original text.

BERT-base, the smaller of the two models released in the original paper, uses 12 transformer layers, a hidden size of 768, 12 self-attention heads, and roughly 110 million parameters. BERT-large doubles the depth to 24 layers and reaches about 340 million parameters. At release, BERT achieved state-of-the-art results on 11 natural language understanding benchmarks, including the [GLUE benchmark](/wiki/glue_benchmark), [SQuAD](/wiki/squad) v1.1 and v2.0, and SWAG.

Despite its accuracy, BERT-base is expensive to deploy. The [attention mechanism](/wiki/attention_mechanism) scales quadratically with sequence length, and on commodity CPUs the model can take hundreds of milliseconds per query. A single BERT-base checkpoint occupies more than 400 MB on disk, making on-device deployment difficult. As the DistilBERT authors put it, "operating these large models on-the-edge and/or under constrained computational training or inference budgets remains challenging" [1]. The NLP community quickly began searching for ways to reduce its inference cost, exploring weight [pruning](/wiki/pruning), [quantization](/wiki/quantization), parameter sharing as in [ALBERT](/wiki/albert), and [knowledge distillation](/wiki/knowledge_distillation).

### What is knowledge distillation?

Knowledge distillation was introduced by [Geoffrey Hinton](/wiki/geoffrey_hinton), [Oriol Vinyals](/wiki/oriol_vinyals), and Jeff Dean in their 2015 paper *Distilling the Knowledge in a Neural Network* [5]. The basic idea is to train a smaller student network to imitate the predictions of a larger, more accurate teacher network. Rather than training the student only on hard labels, the student is also trained to match the teacher's full output distribution, which contains far richer information about the relative similarity between classes. Hinton called this extra signal the dark knowledge captured in the teacher's logits.

Formally, the teacher's logits are passed through a temperature-scaled [softmax](/wiki/softmax) with [temperature](/wiki/temperature) `T > 1`, producing softer probabilities that emphasize relative confidence between classes. The student is trained to match those softened probabilities using cross-entropy. By raising the temperature, low-probability classes still contribute meaningful gradient signal, allowing the student to learn similarity structure that one-hot labels cannot express.

DistilBERT was one of the first attempts to apply the technique at scale to a pretrained transformer language model. The Hugging Face team showed that the approach could be applied during pretraining itself rather than only during fine-tuning, producing a general-purpose distilled checkpoint that could then be fine-tuned on downstream tasks just like BERT.

## How is DistilBERT built? (Architecture)

DistilBERT-base preserves BERT-base's structure while halving depth and removing two components.

| Property | BERT-base | DistilBERT-base |
|---|---|---|
| Transformer layers | 12 | 6 |
| Hidden size | 768 | 768 |
| Attention heads | 12 | 12 |
| Feed-forward inner size | 3072 | 3072 |
| Total parameters | 110M | 66M |
| Token type embeddings | yes | no |
| Pooler layer | yes | no |
| Vocabulary | 30,522 [WordPiece](/wiki/wordpiece) | 30,522 [WordPiece](/wiki/wordpiece) |
| Maximum positions | 512 | 512 |

The student keeps the same hidden dimension and attention head count as the teacher because matching these dimensions makes it possible to copy weights from the teacher during initialization and to apply the cosine alignment loss to hidden states without projection. Reducing depth gives a roughly proportional reduction in inference latency.

The Hugging Face team removed the token type embeddings used in BERT to mark sentence A and sentence B inputs, reflecting the decision to drop next sentence prediction during distillation. The pooler layer is also removed because downstream classification heads typically use the raw `[CLS]` representation directly.

The overall reduction is approximately 40% fewer parameters than BERT-base [1]. The disk footprint of a `distilbert-base-uncased` checkpoint is around 250 MB, compared to roughly 440 MB for `bert-base-uncased`. The tokenizer is unchanged, so any pipeline that uses BERT's WordPiece tokenizer can swap DistilBERT in without touching the preprocessing code.

## How is DistilBERT trained? (Distillation procedure)

DistilBERT is trained with what the original paper calls a triple loss, described as "a triple loss combining language modeling, distillation and cosine-distance losses" [1]. The loss function combines three components, each capturing a different aspect of the teacher's behavior.

### Component losses

**1. Distillation loss `L_ce`.** This is the standard knowledge distillation cross-entropy applied to the masked language modeling outputs. For each masked position the student's logits and the teacher's logits are both passed through a softmax with temperature `T` greater than one, and the student is trained to match the teacher's softened distribution:

```
L_ce = - sum_i  t_i * log(s_i)
```

where `t_i = softmax(z_t / T)_i` and `s_i = softmax(z_s / T)_i` are the temperature-scaled teacher and student probabilities. Following Hinton, the gradient with respect to the student logits is multiplied by `T^2` to keep the gradient magnitude comparable to a hard-label cross-entropy. DistilBERT typically uses temperatures around `T = 2` during training and `T = 1` at inference.

**2. Masked language modeling loss `L_mlm`.** The standard BERT masked language modeling objective is also applied, where 15% of input tokens are masked and the student is trained to predict them using the ground-truth token identities. This anchors the student to the actual training distribution and prevents it from memorizing only the teacher's mistakes.

**3. Cosine embedding loss `L_cos`.** Student and teacher hidden states are aligned with a [cosine similarity](/wiki/cosine_similarity) loss that encourages the angle between corresponding hidden vectors to be small:

```
L_cos = 1 - cos(h_student, h_teacher)
```

This loss operates directly on the geometry of the representation space. Because the student keeps the same hidden size as the teacher, the comparison can be done without learning a projection matrix.

The total loss is a weighted sum:

```
L = alpha * L_ce + beta * L_mlm + gamma * L_cos
```

The paper uses weights emphasizing distillation cross-entropy and cosine alignment, with the masked language modeling loss as a regularizer.

### Initialization and data

The authors found that initialization mattered substantially. Instead of training the 6-layer student from scratch, they initialized it from the teacher by taking one layer out of every two: student layer 0 from teacher layer 0, student layer 1 from teacher layer 2, and so on [1]. This warm start, which exploits the common dimensionality between teacher and student, shortens the distillation budget needed to reach competitive accuracy.

DistilBERT is trained on the same corpus as BERT-base: English Wikipedia and the Toronto BookCorpus dataset, totaling around 3.3 billion tokens. According to the paper, "DistilBERT was trained on 8 16GB V100 GPUs for approximately 90 hours" [1], an order of magnitude less compute than the original BERT pretraining.

### Removed objectives

Next sentence prediction is omitted entirely. This decision was consistent with later findings, including those in the [RoBERTa](/wiki/roberta) paper, that next sentence prediction either has no benefit or is mildly harmful when training is otherwise well-tuned. Dropping it also matches the simpler architecture, which removes the segment embeddings.

## How well does DistilBERT perform? (Benchmarks)

The DistilBERT paper reports a series of comparisons against BERT-base on standard NLP tasks. The headline result is a roughly 97% retention of BERT-base's [GLUE benchmark](/wiki/glue_benchmark) score using only 60% of the inference time and 40% fewer parameters [1]. For reference, the paper reports a GLUE macro score of 68.7 for ELMo, 79.5 for BERT-base, and 77.0 for DistilBERT [1].

| Task | BERT-base | DistilBERT-base |
|---|---|---|
| GLUE macro average (dev) | 79.5 | 77.0 |
| MNLI matched | 86.7 | 82.2 |
| QQP F1 | 88.6 | 88.5 |
| SST-2 | 91.7 | 91.3 |
| STS-B Pearson | 89.0 | 86.9 |
| QNLI | 92.7 | 89.2 |
| RTE | 69.3 | 59.9 |
| SQuAD v1.1 (EM / F1) | 81.2 / 88.5 | 77.7 / 85.8 |
| [IMDb](/wiki/squad) sentiment accuracy | 93.46% | 92.82% |

(SQuAD v1.1 and IMDb figures are from the paper's downstream-task table [1].)

Classification tasks are where DistilBERT shines. On SST-2, QQP, and IMDb the gap to the teacher is less than one point. On natural language inference and other reasoning-heavy tasks the gap is larger, and on extractive [question answering](/wiki/question_answering) over [SQuAD](/wiki/squad) v1.1 the F1 drops from 88.5 to 85.8, about 2.7 points, retaining roughly 97% of the teacher's F1 [1]. The paper attributes the remaining gaps to the depth reduction, since deeper transformers tend to support more complex compositional reasoning.

The paper reports a 60% speedup at inference, measuring a full pass over the STS-B development set in 410 seconds for DistilBERT versus 668 seconds for BERT-base on CPU [1]. The authors also demonstrated DistilBERT in an on-device proof-of-concept, running it on a mobile phone, showing it is fast enough for many on-device applications [1]. Memory usage at inference is roughly halved compared to BERT-base, important for browser deployments using ONNX Runtime or for [edge computing](/wiki/edge_computing) environments.

## What DistilBERT checkpoints and variants exist?

The `distilbert` namespace on the Hugging Face Hub contains several official checkpoints, and the technique has been applied to other base models to produce a small family of distilled transformers.

### Official DistilBERT checkpoints

- `distilbert-base-uncased`. The flagship model, distilled from `bert-base-uncased`. English only, lowercased input, 30,522 token WordPiece vocabulary. This is the most downloaded variant.
- `distilbert-base-cased`. Distilled from `bert-base-cased`. Preserves casing, useful for [named entity recognition](/wiki/named_entity_recognition) where capitalization is an important feature.
- `distilbert-base-multilingual-cased`. Distilled from `bert-base-multilingual-cased`, covering 104 languages. Smaller and faster than mBERT while retaining the bulk of its cross-lingual transfer ability.
- `distilbert-base-uncased-finetuned-sst-2-english`. A ready-to-use sentiment classifier fine-tuned on SST-2. This checkpoint is the default model behind the `pipeline("sentiment-analysis")` shortcut in the Hugging Face library, which made it one of the most downloaded models on the platform.
- `distilbert-base-cased-distilled-squad`. A SQuAD-fine-tuned variant frequently used as a default question-answering pipeline.

### Related distilled models

The Hugging Face team and the broader community produced several siblings using the same recipe:

- **DistilGPT2.** A distilled version of [GPT-2](/wiki/gpt_2) small with 82M parameters, used for lightweight text generation.
- **DistilRoBERTa.** Distilled from [RoBERTa](/wiki/roberta)-base. Inherits RoBERTa's improved tokenization and dynamic masking, and tends to outperform DistilBERT on downstream tasks at comparable size.
- **DistilCamemBERT.** A French distilled model derived from [CamemBERT](/wiki/camembert), built by the community to bring DistilBERT's efficiency benefits to French NLP.
- **DistilBERT-based sentence-transformers.** Embedding models such as `multi-qa-distilbert` are built on DistilBERT backbones for fast retrieval and clustering.

## How do you use DistilBERT in Hugging Face Transformers?

DistilBERT was integrated into the [transformers](/wiki/transformers) library at release. The library exposes a family of task-specific heads:

- `DistilBertModel`: the bare encoder.
- `DistilBertForMaskedLM`: encoder plus masked language modeling head.
- `DistilBertForSequenceClassification`: sentiment, NLI, topic, and other [text classification](/wiki/text_classification_models) tasks.
- `DistilBertForQuestionAnswering`: extractive question answering head for SQuAD-style data.
- `DistilBertForTokenClassification`: token-level classifier for [named entity recognition](/wiki/named_entity_recognition) and other token classification tasks.
- `DistilBertForMultipleChoice`: multiple choice tasks such as SWAG.

The simplest usage pattern relies on `AutoTokenizer` and `AutoModel`:

```python
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModel.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("DistilBERT is a smaller BERT.", return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
```

For a sentiment classifier, the high-level pipeline API hides the model selection behind a single call:

```python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("This article is comprehensive and clear.")
# [{'label': 'POSITIVE', 'score': 0.9998}]
```

Because `distilbert-base-uncased-finetuned-sst-2-english` is the default for that pipeline, every user who runs the code above downloads a DistilBERT checkpoint, which contributed to its dominant share of Hugging Face download counts.

DistilBERT was also one of the first models supported by ONNX Runtime, TensorFlow Lite, and Core ML conversion paths, making it a popular choice for production deployment outside the standard PyTorch runtime.

## Why was DistilBERT so widely adopted?

DistilBERT became a widely deployed default in industrial NLP systems. For years after its release it was one of the top models on the Hugging Face Hub by monthly downloads, often appearing alongside BERT-base, RoBERTa-base, and `sentence-transformers` MiniLM checkpoints; in one Hub-statistics analysis the distilbert organization made up 44.6% of downloads among the most-downloaded entities studied [2]. The reasons are practical:

- It is fast enough to serve inside web request handlers without batching.
- It is small enough to fit in tight container memory limits and to ship inside browser extensions or mobile apps.
- Its accuracy is high enough to act as a strong baseline for classification, retrieval, and entity tagging.
- Its API in the transformers library is identical to BERT, making it a drop-in replacement.

DistilBERT is also a common default in [Kaggle](/wiki/kaggle) competitions and in introductory NLP courses, where its training speed lets users iterate many ideas in a single day. The paper has been cited several thousand times since publication, and its methodology section is frequently referenced as a canonical example of [knowledge distillation](/wiki/knowledge_distillation) applied to a pretrained transformer. Its release helped establish [Hugging Face](/wiki/hugging_face) as a research organization rather than only a library maintainer.

## How does DistilBERT compare with other compression methods?

Knowledge distillation is one of several techniques for shrinking a pretrained transformer. Each approach has different tradeoffs in accuracy, speed, hardware support, and engineering complexity.

| Method | Examples | Mechanism | Strength | Weakness |
|---|---|---|---|---|
| Knowledge distillation | DistilBERT, TinyBERT, MobileBERT, MiniLM | Train smaller student to mimic teacher outputs and hidden states | Strong accuracy retention, hardware-agnostic | Requires teacher and a separate training run |
| Pruning | Movement pruning, magnitude pruning | Remove weights, heads, or layers based on importance criteria | Can be combined with other methods | Sparsity often needs special kernels for actual speedup |
| Quantization | int8, int4, dynamic, static, GPTQ | Reduce numerical precision of weights and activations | Easy to apply post hoc, large memory savings | Accuracy loss at very low bit widths |
| Parameter sharing | [ALBERT](/wiki/albert) | Share weights across layers | Smaller checkpoint | Same FLOPs at inference, no speedup |
| Architecture redesign | [ELECTRA](/wiki/electra), [DeBERTa](/wiki/deberta) | Change pretraining objective or attention design | More efficient pretraining | Different model family, retraining required |

TinyBERT (Jiao et al. 2019) is a smaller cousin of DistilBERT with 4 layers, a hidden size of 312, and around 14.5 million parameters [6]. It uses a two-stage distillation that first distills general knowledge and then performs task-specific distillation, giving strong accuracy at much smaller sizes at the cost of a more elaborate pipeline.

MobileBERT (Sun et al. 2020) targets mobile inference using bottleneck architectures and inverted-bottleneck feed-forward layers [7]. It has 25.3 million parameters but matches BERT-base's GLUE score within one point.

MiniLM (Wang et al. 2020) introduces deep self-attention distillation that aligns the student's attention distributions and value-relation matrices with the teacher's [8]. Its checkpoints `MiniLM-L6` and `MiniLM-L12` became standard backbones for fast sentence embedding models in retrieval-augmented generation pipelines.

[ALBERT](/wiki/albert) (Lan et al. 2019) takes a different route, sharing parameters across layers and factorizing the embedding matrix [9]. While ALBERT-base has only 12 million parameters, it executes the same FLOPs at inference, so it does not provide the speed gains DistilBERT does.

## Limitations

DistilBERT inherits several limitations from BERT and adds a few of its own:

- **Fixed sequence length.** The maximum input length is 512 tokens, the same as BERT-base. Long documents must be chunked, which can hurt accuracy on tasks that depend on long-range dependencies.
- **Encoder only.** DistilBERT cannot generate text. It is suited to classification, tagging, and embedding tasks but not to free-form generation, where [GPT-2](/wiki/gpt_2) or modern [large language models](/wiki/large_language_model) are appropriate.
- **Reasoning gap.** The loss of half the encoder layers shows up as a measurable accuracy drop on tasks requiring complex multi-hop reasoning, particularly on natural language inference and span-based question answering. The drop from 88.5 to 85.8 F1 on SQuAD v1.1 is the clearest example.
- **Not the smallest possible BERT.** At 66 million parameters DistilBERT is large for TinyML and microcontroller deployments. TinyBERT and MiniLM-L6 reach a fraction of the size, and quantized variants can fit in tens of megabytes.
- **English-centric defaults.** Most downstream usage is on the English uncased variant; the multilingual checkpoint has been somewhat eclipsed by XLM-R for cross-lingual work.
- **Eclipsed for some tasks.** In the LLM era, DistilBERT has been superseded for advanced tasks by larger generative models with in-context learning, but it remains competitive for classification and embedding pipelines where latency and cost matter.

## Successors

DistilBERT helped seed a generation of small encoder models that took the same compress-and-deploy philosophy further:

- [ELECTRA](/wiki/electra)-small (Clark et al. 2020) replaced masked language modeling with a discriminative replaced-token-detection objective, achieving strong accuracy at small sizes with much less compute [10].
- MiniLM and MiniLMv2 produced extremely small encoders that became the backbone for fast sentence embedding models in the sentence-transformers library.
- [DeBERTa](/wiki/deberta) (He et al. 2020) introduced disentangled attention and improved both small and large encoder accuracy.
- The `paraphrase-` and `multi-qa-` families of sentence-transformers extended the DistilBERT lineage into the embedding world for retrieval pipelines.

In the modern landscape, distilled small encoders coexist with [large language models](/wiki/large_language_model). LLMs handle open-ended tasks and reasoning, while distilled encoders such as DistilBERT and MiniLM handle high-volume, latency-sensitive jobs such as content classification, intent detection, retrieval, and reranking, often inside the [retrieval-augmented generation](/wiki/retrieval_augmented_generation) stack of an LLM application.

## Who created DistilBERT? (Hugging Face context)

[Hugging Face](/wiki/hugging_face) was founded in 2016 by [Clement Delangue](/wiki/clement_delangue), Julien Chaumond, and Thomas Wolf, originally as a chatbot company aimed at teenage users. After the chatbot pivot, the team began releasing PyTorch ports of new transformer models, starting with `pytorch-pretrained-BERT` in late 2018. This library was renamed `pytorch-transformers` and then simply `transformers` as it expanded to cover [GPT-2](/wiki/gpt_2), [RoBERTa](/wiki/roberta), XLNet, and other architectures.

DistilBERT was one of the first significant original research contributions from the Hugging Face team rather than a port of an external model. It was first described in a NeurIPS 2019 workshop paper [1] and announced in a blog post in October 2019. The release demonstrated that the company could conduct competitive applied research on top of its open-source platform, repositioning Hugging Face from a library maintainer to a research organization.

DistilBERT's popularity also fed back into the platform's growth. The default sentiment-analysis pipeline uses a fine-tuned DistilBERT, and millions of users who ran the basic example downloaded a DistilBERT checkpoint as part of their first interaction with the Hub. The recipe has continued to inform later projects, including the Zephyr, SmolLM, and SmolVLM lines.

## See also

- [BERT](/wiki/bert), [RoBERTa](/wiki/roberta), [ALBERT](/wiki/albert)
- [ELECTRA](/wiki/electra), [DeBERTa](/wiki/deberta)
- [Knowledge distillation](/wiki/knowledge_distillation), [Model compression](/wiki/model_compression)
- [Pruning](/wiki/pruning), [Quantization](/wiki/quantization)
- [Hugging Face](/wiki/hugging_face), [Transformers](/wiki/transformers)
- [Edge computing](/wiki/edge_computing), [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)

## References

[1] Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. *5th Workshop on Energy Efficient Machine Learning and Cognitive Computing, NeurIPS 2019*. arXiv:1910.01108. https://arxiv.org/abs/1910.01108

[2] Bourdois, L. Model statistics of the 50 most downloaded entities on Hugging Face. Hugging Face blog. https://huggingface.co/blog/lbourdois/huggingface-models-stats

[3] Hugging Face. distilbert-base-uncased model card. https://huggingface.co/distilbert/distilbert-base-uncased

[4] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. *NAACL 2019*. arXiv:1810.04805. https://arxiv.org/abs/1810.04805

[5] Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. *NIPS 2014 Deep Learning Workshop*. arXiv:1503.02531. https://arxiv.org/abs/1503.02531

[6] Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., & Liu, Q. (2019). TinyBERT: Distilling BERT for Natural Language Understanding. *Findings of EMNLP 2020*. arXiv:1909.10351. https://arxiv.org/abs/1909.10351

[7] Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., & Zhou, D. (2020). MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. *ACL 2020*. arXiv:2004.02984. https://arxiv.org/abs/2004.02984

[8] Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., & Zhou, M. (2020). MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. *NeurIPS 2020*. arXiv:2002.10957. https://arxiv.org/abs/2002.10957

[9] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. *ICLR 2020*. arXiv:1909.11942. https://arxiv.org/abs/1909.11942

[10] Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. *ICLR 2020*. arXiv:2003.10555. https://arxiv.org/abs/2003.10555

[11] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692. https://arxiv.org/abs/1907.11692

[12] Wolf, T., Debut, L., Sanh, V., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. *EMNLP 2020 System Demonstrations*. https://aclanthology.org/2020.emnlp-demos.6/

[13] Sanh, V. (2019). Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT. Hugging Face blog. https://medium.com/huggingface/distilbert-8cf3380435b5

[14] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. *EMNLP 2019*. arXiv:1908.10084. https://arxiv.org/abs/1908.10084
