DeBERTa (Decoding-enhanced BERT with Disentangled Attention) is a family of pre-trained language models developed by Microsoft Research. Introduced by Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen, DeBERTa improves upon BERT and RoBERTa through two core innovations: a disentangled attention mechanism that separately encodes content and positional information, and an enhanced mask decoder that incorporates absolute position information during pre-training. The original DeBERTa paper was published at ICLR 2021, and successive versions (DeBERTa v2 and DeBERTa v3) have pushed the boundaries of natural language understanding (NLU). In January 2021, a 1.5-billion-parameter DeBERTa model became the first single model to surpass human performance on the SuperGLUE benchmark, scoring 89.9 against the human baseline of 89.8.
Transformer-based pre-trained language models such as BERT, GPT, RoBERTa, XLNet, ALBERT, and ELECTRA have achieved state-of-the-art results on many NLU tasks. These models are first pre-trained on large text corpora using self-supervised objectives like masked language modeling (MLM) and then fine-tuned on task-specific labeled data.
BERT and its variants represent each input token as a single vector that combines both content meaning and positional information. However, this design has a limitation: it conflates two fundamentally different types of information. The word "bank" in the phrase "river bank" carries different meaning from "bank" in "bank account," and its position relative to surrounding words plays a role in disambiguating the meaning. DeBERTa was motivated by the observation that separating content and position representations allows the attention mechanism to model richer interactions between words.
Another motivation came from the way BERT handles position information during decoding. In the original BERT, the absolute position of each token is added to the input embedding at the bottom of the model. By the time the representation reaches the upper layers, the positional signal has been diluted through multiple Transformer layers. DeBERTa addresses this through an enhanced mask decoder that reintroduces absolute positions right before the final prediction layer.
The first version of DeBERTa was presented in the paper "DeBERTa: Decoding-enhanced BERT with Disentangled Attention" (He et al., 2020, published at ICLR 2021). It introduces two primary technical contributions: the disentangled attention mechanism and the enhanced mask decoder.
In standard Transformer models, the input representation of each token is a single vector that sums a content embedding (from the token itself) and a positional embedding (indicating where the token sits in the sequence). The self-attention score between any two tokens is then computed as the dot product of their combined representations after linear projections.
DeBERTa departs from this design by representing each token with two separate vectors: one for content and one for relative position. The attention score between tokens at positions i and j is decomposed into three components:
A fourth component, position-to-position, is excluded because it does not carry useful information; the attention between two positions is the same regardless of the tokens that occupy them.
Formally, for a pair of tokens at positions i and j, the disentangled attention score A(i,j) is computed as:
A(i,j) = H_i * W_qc * (H_j * W_kc)^T + H_i * W_qc * (P_{i|j} * W_kp)^T + (P_{j|i} * W_qp)^T * H_j * W_kc
where H_i and H_j are the content vectors for tokens i and j, P_{i|j} represents the relative position embedding of position i relative to j, and W_qc, W_kc, W_qp, W_kp are learnable projection matrices for query-content, key-content, query-position, and key-position respectively.
DeBERTa uses relative position encodings rather than absolute ones. The maximum relative distance is clipped to a fixed value k (set to 512 by default), so position embeddings are shared across all positions that differ by the same relative offset. This reduces the space complexity of position embeddings from O(N^2 * d) to O(k * d), where d is the hidden dimension.
The disentangled attention captures relationships based on content and relative position, but it does not account for the absolute position of tokens. Absolute position information is important for certain predictions. For example, in the sentence "A new store opened beside the new mall," the words "store" and "mall" both appear after the word "new." The model must rely on absolute position (or broader context) to determine which "new" refers to which noun when predicting a masked token.
DeBERTa addresses this by incorporating absolute position embeddings into the decoding layer, right before the softmax layer that predicts the masked tokens. This is called the Enhanced Mask Decoder (EMD). Rather than adding position information at the input (as BERT does), DeBERTa adds it only at the output layer, where it has the most direct impact on the final prediction. This approach preserves the benefits of relative position encoding throughout the lower Transformer layers while still giving the decoder access to absolute position information when it matters.
DeBERTa also introduces a virtual adversarial training method called SiFT (Scale-invariant Fine-Tuning) for improving model generalization during fine-tuning. In adversarial training, small perturbations are added to input embeddings to make the model more robust. However, the norms of embedding vectors vary significantly across different words and model sizes, which can cause instability during training.
SiFT addresses this by first normalizing word embedding vectors into unit vectors, then applying adversarial perturbations to the normalized embeddings. This normalization makes the perturbation scale-invariant, meaning it works consistently regardless of the magnitude of the original embeddings. The improvement from SiFT is more pronounced for larger DeBERTa models.
The initial release of DeBERTa included three model sizes:
| Model | Layers | Hidden Size | Attention Heads | Parameters | Vocabulary |
|---|---|---|---|---|---|
| DeBERTa-Base | 12 | 768 | 12 | 100M | 50K (BPE) |
| DeBERTa-Large | 24 | 1,024 | 16 | 350M | 50K (BPE) |
| DeBERTa-XLarge | 48 | 1,024 | 16 | 700M | 50K (BPE) |
DeBERTa-Base and DeBERTa-Large use the same BPE tokenizer vocabulary as RoBERTa (50K tokens). The models were pre-trained using the same data as RoBERTa: English Wikipedia (12 GB), BookCorpus (6 GB), OpenWebText (38 GB), and Stories, a subset of CommonCrawl (31 GB), totaling approximately 78 GB after deduplication.
DeBERTa follows the masked language modeling objective from BERT, where 15% of input tokens are masked and the model predicts the original tokens. The pre-training hyperparameters for the large model include a batch size of 2,048, a learning rate of 2e-4, and 1 million training steps. DeBERTa-Large was trained on 6 DGX-2 machines (96 NVIDIA V100 GPUs) over the course of 20 days. DeBERTa-Base was trained on 4 DGX-2 machines (64 V100 GPUs) for 10 days.
DeBERTa v2 introduced several improvements that enabled scaling to larger model sizes and improved performance:
DeBERTa v2 replaced the GPT-2 BPE tokenizer (50K vocabulary) used in v1 with a SentencePiece tokenizer and a much larger vocabulary of 128K tokens. The larger vocabulary was built from the training data and provides better coverage of rare words and subword units.
DeBERTa v2 added a convolutional layer alongside the first Transformer layer to better capture local dependencies among input tokens. This n-gram induced input encoding helps the model learn character-level and subword-level patterns that complement the global patterns captured by self-attention.
In the disentangled attention mechanism, v2 shares the position projection matrix and the content projection matrix across all attention layers, reducing the total number of parameters without a significant loss in performance.
DeBERTa v2 adopted a log-bucket scheme for encoding relative positions, similar to the approach used in T5. Instead of assigning a unique embedding to each relative distance, nearby positions share the same bucket while distant positions are grouped logarithmically. This allows the model to handle longer sequences more efficiently.
DeBERTa v2 scaled the architecture to much larger sizes:
| Model | Layers | Hidden Size | Attention Heads | Parameters | Vocabulary |
|---|---|---|---|---|---|
| DeBERTa-v2-XLarge | 24 | 1,536 | 24 | 710M | 128K (SPM) |
| DeBERTa-v2-XXLarge | 48 | 1,536 | 24 | 1,320M | 128K (SPM) |
The 1.5-billion-parameter DeBERTa-v2-XXLarge model (sometimes referred to as DeBERTa 1.5B in Microsoft's announcements) was the model that first surpassed human performance on the SuperGLUE benchmark. For training this largest variant, CC-News data was added to the training corpus.
DeBERTa v3 was introduced in the paper "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing" (He et al., 2021, published at ICLR 2023). It represents the most refined version of the DeBERTa family and is the variant most widely used in practice today.
The most significant change in DeBERTa v3 is the replacement of masked language modeling with replaced token detection (RTD), the training objective introduced by ELECTRA. In the RTD framework, a small generator network (trained with MLM) produces plausible replacement tokens for masked positions, and a larger discriminator network must determine whether each token in the input is original or has been replaced by the generator.
RTD is more sample-efficient than MLM because the discriminator receives a training signal from every token in the input, not just the 15% that are masked. This means the model effectively learns from more data per training step. The combined loss function is L = L_MLM + lambda * L_RTD, where lambda is set to 50.
In the original ELECTRA design, the generator and discriminator share their token embeddings. This sharing is beneficial because it allows the discriminator to leverage the generator's knowledge about token semantics. However, the two models have conflicting training objectives: the generator's MLM loss encourages embeddings to cluster similar tokens together (since any plausible replacement should score highly), while the discriminator's RTD loss pushes embeddings of different tokens apart (since it must distinguish original tokens from replacements). This creates a "tug-of-war" dynamic where the two losses pull the shared embeddings in opposite directions, hurting both training efficiency and final model quality.
DeBERTa v3 solves the tug-of-war problem with a technique called Gradient-Disentangled Embedding Sharing (GDES). The key idea is to let the discriminator use a copy of the generator's embeddings, but block the gradients from flowing back through the generator's embeddings during the discriminator's update.
Formally, the discriminator's embedding E_D is defined as:
E_D = sg(E_G) + E_delta
where sg() is the stop-gradient operator, E_G is the generator's embedding, and E_delta is a residual embedding that captures the difference between what the discriminator needs and what the generator provides. The stop-gradient operator ensures that the discriminator's loss does not affect the generator's embeddings, eliminating the tug-of-war dynamic. After pre-training, the discriminator's final embedding is obtained by adding E_delta to E_G.
This approach preserves the benefits of embedding sharing (the discriminator still sees the generator's learned token representations) while avoiding the conflicting gradient problem. Experiments showed that GDES consistently outperforms both no sharing and vanilla sharing across all model sizes.
DeBERTa v3 was released in four sizes to cover different computational budgets:
| Model | Layers | Hidden Size | Parameters (Backbone) | Parameters (with Embedding) | Vocabulary | Training Data |
|---|---|---|---|---|---|---|
| DeBERTa-v3-XSmall | 12 | 384 | 22M | 70M | 128K (SPM) | 160 GB |
| DeBERTa-v3-Small | 6 | 768 | 44M | 142M | 128K (SPM) | 160 GB |
| DeBERTa-v3-Base | 12 | 768 | 86M | 184M | 128K (SPM) | 160 GB |
| DeBERTa-v3-Large | 24 | 1,024 | 304M | 435M | 128K (SPM) | 160 GB |
The "backbone parameters" count excludes the embedding layer. Because DeBERTa v3 uses a large 128K-token SentencePiece vocabulary, the embedding layer itself accounts for a substantial portion of the total parameters.
DeBERTa v3 also includes a multilingual variant, mDeBERTa-v3-Base, which was trained on 2.5 TB of CC100 multilingual data covering 100 languages. With 86M backbone parameters and a 250K-token vocabulary, mDeBERTa-v3-Base achieved 79.8% zero-shot cross-lingual accuracy on the XNLI benchmark, a 3.6% improvement over XLM-R Base.
DeBERTa v3 models were pre-trained on 160 GB of data consisting of English Wikipedia, BookCorpus, OpenWebText, CC-News, and Stories. Training used the AdamW optimizer with beta_1 = 0.9 and beta_2 = 0.98. The batch size was 8,192, and models were trained for 500,000 steps with 10,000 warmup steps. Learning rates were set to 3e-4 for the large model and 5e-4 for the base and small models.
The following table summarizes all DeBERTa variants across all versions:
| Model | Version | Layers | Hidden Size | Parameters | Vocabulary | Tokenizer |
|---|---|---|---|---|---|---|
| DeBERTa-Base | v1 | 12 | 768 | 100M | 50K | BPE |
| DeBERTa-Large | v1 | 24 | 1,024 | 350M | 50K | BPE |
| DeBERTa-XLarge | v1 | 48 | 1,024 | 700M | 50K | BPE |
| DeBERTa-v2-XLarge | v2 | 24 | 1,536 | 710M | 128K | SentencePiece |
| DeBERTa-v2-XXLarge | v2 | 48 | 1,536 | 1,320M | 128K | SentencePiece |
| DeBERTa-v3-XSmall | v3 | 12 | 384 | 22M | 128K | SentencePiece |
| DeBERTa-v3-Small | v3 | 6 | 768 | 44M | 128K | SentencePiece |
| DeBERTa-v3-Base | v3 | 12 | 768 | 86M | 128K | SentencePiece |
| DeBERTa-v3-Large | v3 | 24 | 1,024 | 304M | 128K | SentencePiece |
| mDeBERTa-v3-Base | v3 | 12 | 768 | 86M | 250K | SentencePiece |
The GLUE (General Language Understanding Evaluation) benchmark consists of eight NLU tasks. The following table compares DeBERTa variants against other pre-trained models on the GLUE development set:
| Model | CoLA | QQP | MNLI-m/mm | SST-2 | STS-B | QNLI | RTE | MRPC | Average |
|---|---|---|---|---|---|---|---|---|---|
| BERT-Large | 60.6 | 91.3 | 86.6 | 93.2 | 90.0 | 92.3 | 70.4 | 88.0 | 84.05 |
| RoBERTa-Large | 68.0 | 92.2 | 90.2 | 96.4 | 92.4 | 93.9 | 86.6 | 90.9 | 88.82 |
| XLNet-Large | 69.0 | 92.3 | 90.8 | 97.0 | 92.5 | 94.9 | 85.9 | 90.8 | 89.15 |
| ELECTRA-Large | 69.1 | 92.4 | 90.9 | 96.9 | 92.6 | 95.0 | 88.0 | 90.8 | 89.46 |
| DeBERTa-Large (v1) | 70.5 | 92.3 | 91.1 | 96.8 | 92.8 | 95.3 | 88.3 | 91.9 | 90.00 |
| DeBERTa-v3-Large | 75.3 | 93.0 | 91.8/91.9 | 96.9 | 93.0 | 96.0 | 92.7 | 92.2 | 91.37 |
DeBERTa-v3-Large achieves a 91.37% average score, outperforming DeBERTa v1 Large by 1.37 points and ELECTRA-Large by 1.91 points. The improvements are especially large on low-resource tasks: RTE gains 4.4 points and CoLA gains 4.8 points compared to DeBERTa v1.
For base-sized models, DeBERTa-v3-Base also shows strong improvements:
| Model | Parameters | MNLI-m/mm | SQuAD v2.0 (F1/EM) |
|---|---|---|---|
| BERT-Base | 86M | 84.3/84.7 | 76.3/73.7 |
| RoBERTa-Base | 86M | 87.6 | 83.7/80.5 |
| ELECTRA-Base | 86M | 88.8 | -/80.5 |
| DeBERTa-Base (v1) | 100M | 88.8/88.5 | 86.2/83.1 |
| DeBERTa-v3-Base | 86M | 90.6/90.7 | 88.4/85.4 |
DeBERTa-v3-Base outperforms DeBERTa v1 Base by 1.8 points on MNLI-m and 2.2 points (F1) on SQuAD v2.0, while using fewer backbone parameters (86M vs. 100M).
The SuperGLUE benchmark is a more challenging successor to GLUE, consisting of eight tasks including BoolQ, CB, COPA, MultiRC, ReCoRD, RTE, WiC, and WSC. On December 29, 2020, a single DeBERTa-v2-XXLarge model (1.5B parameters) became the first individual model to surpass human performance:
| Task | DeBERTa 1.5B (Single) | Human Baseline |
|---|---|---|
| BoolQ | 90.4 | 89.0 |
| CB (F1/Acc) | 94.9/97.2 | 95.8/98.9 |
| COPA | 96.8 | 100.0 |
| MultiRC (F1a/EM) | 88.2/63.7 | 81.8/51.9 |
| ReCoRD (F1/EM) | 94.5/94.1 | 93.8/93.4 |
| RTE | 93.2 | 93.6 |
| WiC | 76.4 | 80.0 |
| WSC | 95.9 | 100.0 |
| Average | 89.9 | 89.8 |
The single model scored 89.9, narrowly exceeding the human baseline of 89.8. An ensemble DeBERTa model pushed the score further to 90.3. This result was notable because DeBERTa achieved it with 1.5 billion parameters, substantially fewer than T5-11B (11 billion parameters), which had held the previous top spot on the leaderboard.
On the Stanford Question Answering Dataset (SQuAD), DeBERTa demonstrates consistent improvements over prior models:
| Model | SQuAD v1.1 (F1/EM) | SQuAD v2.0 (F1/EM) |
|---|---|---|
| RoBERTa-Large | 94.6/88.9 | 89.4/86.5 |
| XLNet-Large | 95.1/89.7 | 90.6/87.9 |
| DeBERTa-Large (v1) | 95.5/90.1 | 90.7/88.0 |
| DeBERTa-v3-Large | - | 91.5/89.0 |
| DeBERTa-v2-XXLarge | - | 92.2/89.7 |
DeBERTa-Large outperforms RoBERTa-Large by 0.9 F1 on SQuAD v1.1 and 1.3 F1 on SQuAD v2.0. The v3-Large variant further pushes SQuAD v2.0 to 91.5 F1, and the v2-XXLarge reaches 92.2 F1.
For smaller models, DeBERTa-v3-XSmall with only 22M backbone parameters achieves 84.8 F1 on SQuAD v2.0, comparable to RoBERTa-Base (83.7 F1) which has nearly four times as many parameters.
On the RACE reading comprehension benchmark, DeBERTa-Large achieved 86.8% accuracy, a 3.6-point improvement over RoBERTa-Large's 83.2%. This large gain on a challenging reading comprehension task highlights DeBERTa's strength in tasks that require understanding complex passages and reasoning about content and position.
DeBERTa belongs to the family of encoder-only pre-trained language models that descend from BERT. Understanding its relationship to these models helps clarify DeBERTa's contributions.
BERT (Bidirectional Encoder Representations from Transformers) uses absolute position embeddings added at the input layer and computes attention over combined content-position representations. DeBERTa disentangles content and position into separate vectors, computes factored attention scores, and reintroduces absolute positions only at the decoder layer. On the GLUE benchmark, DeBERTa-Large outperforms BERT-Large by nearly 6 points on average (90.00 vs. 84.05).
RoBERTa (Robustly Optimized BERT Pretraining Approach) keeps BERT's architecture but improves pre-training by removing the next-sentence prediction objective, training with larger batches, more data, and longer training schedules. DeBERTa builds on RoBERTa's training improvements and adds architectural innovations. Even when trained on half the data that RoBERTa used, DeBERTa-Large outperforms RoBERTa-Large on MNLI by 0.9 points (91.1 vs. 90.2) and on SQuAD v2.0 by 2.3 F1 points (90.7 vs. 88.4). However, DeBERTa consumes roughly twice the GPU memory of RoBERTa during training and inference because of its disentangled attention mechanism, which maintains separate projection matrices for content and position.
ELECTRA introduced the replaced token detection (RTD) pre-training objective, where a generator produces replacement tokens and a discriminator identifies which tokens have been replaced. This is more sample-efficient than MLM because every token provides a training signal. DeBERTa v1 and v2 use MLM for pre-training, while DeBERTa v3 adopts ELECTRA's RTD objective and adds the GDES method to resolve the embedding sharing problem. DeBERTa-v3-Large outperforms ELECTRA-Large by 1.91 points on the GLUE average (91.37 vs. 89.46).
ALBERT (A Lite BERT) reduces model size through cross-layer parameter sharing and factorized embedding parameterization. While ALBERT focuses on parameter efficiency, DeBERTa focuses on representational power through disentangled attention. DeBERTa achieves higher performance at similar computational costs but uses more parameters than ALBERT.
XLNet uses a permutation-based language modeling objective and integrates relative position encodings from Transformer-XL. DeBERTa also uses relative positions but goes further by fully disentangling content and position representations. On the GLUE benchmark, DeBERTa-Large outperforms XLNet-Large on six out of eight tasks, with notable gains on MRPC (+1.1 points), RTE (+2.4 points), and CoLA (+1.5 points).
All DeBERTa models are available through the Hugging Face Transformers library, making them accessible for fine-tuning on downstream tasks. The models are released under the MIT license. As of early 2026, the most popular variants by monthly downloads on Hugging Face are:
| Model | Monthly Downloads (approx.) |
|---|---|
| microsoft/deberta-v3-base | ~2,350,000 |
| microsoft/deberta-v3-small | ~1,134,000 |
| microsoft/deberta-v3-large | ~1,045,000 |
| microsoft/deberta-v3-xsmall | ~39,000 |
The v3-base variant alone receives over 2.3 million downloads per month, making it one of the most widely used encoder models on the platform. Beyond the base Microsoft checkpoints, hundreds of community-contributed fine-tuned models exist for tasks including sentiment analysis, natural language inference, prompt injection detection, and named entity recognition.
DeBERTa is well suited for a range of NLU tasks:
The choice of DeBERTa variant depends on the available computational resources and the requirements of the task:
DeBERTa has become the dominant encoder model in competitive machine learning on Kaggle, particularly for NLP tasks. From 2021 through 2024, DeBERTa models appeared in the winning or top-placing solutions of nearly every text-related Kaggle competition.
An analysis of Kaggle text data competitions from 2021 to 2023 found that the top three solutions predominantly used five architectures: deberta-v3-large, deberta-large, roberta-base, roberta-large, and deberta-v2-xlarge. Among these, deberta-v3-large was the most frequently used model. Almost all winning NLP solutions during this period relied on some version of DeBERTa.
Competitions where DeBERTa featured prominently include text classification, essay scoring, feedback analysis, and reading comprehension challenges. The model's strength on small and medium-sized labeled datasets (a common scenario in Kaggle competitions) is one reason for its popularity: DeBERTa's disentangled attention and efficient pre-training produce strong representations that transfer well even with limited fine-tuning data.
Starting in late 2022 and accelerating through 2024, decoder-only generative models such as LLaMA, Mistral, Phi, and Gemma began appearing in competitive NLP solutions. Some competition winners used generative models exclusively, while others combined DeBERTa-based encoders with decoder models in ensemble setups. For example, in Kaggle's PII Data Detection competition, ensembles of DeBERTa models were trained on synthetic data generated by large language models.
Despite the rise of generative models, DeBERTa remained the encoder model of choice among Kaggle winners in 2024. Its smaller size, faster inference, and strong performance on classification and extraction tasks keep it relevant. The emergence of newer encoder architectures such as ModernBERT may eventually challenge DeBERTa's dominance, but as of 2025, it continues to be the standard encoder backbone for competitive NLP.
While DeBERTa's benchmark results are impressive, the authors of the original paper were careful to note several limitations:
DeBERTa's contributions to the NLP field extend beyond benchmark scores. The disentangled attention mechanism demonstrated that separating content and position representations leads to better language understanding, influencing subsequent model designs. The GDES technique in DeBERTa v3 provided a principled solution to the embedding sharing problem in ELECTRA-style training.
DeBERTa also proved that architectural innovation can be as important as scale. The 1.5B-parameter DeBERTa outperformed the 11B-parameter T5 on SuperGLUE, showing that a well-designed model can achieve top performance at a fraction of the size. This finding has implications for making NLU models more accessible and environmentally sustainable.
The model's open-source release under the MIT license and its integration into Hugging Face Transformers have contributed to its widespread adoption in both research and industry. As of 2026, DeBERTa-v3-base and DeBERTa-v3-large remain among the most downloaded encoder models on Hugging Face, with combined monthly downloads exceeding 4 million.