ELECTRA, which stands for Efficiently Learning an Encoder that Classifies Token Replacements Accurately, is a pre-training method for natural language processing introduced by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. The paper was published at the International Conference on Learning Representations (ICLR) in 2020 as a collaboration between Stanford University and Google Brain. ELECTRA proposes a novel pre-training objective called replaced token detection (RTD) that fundamentally rethinks how language models learn during pre-training. Instead of masking tokens and predicting them (as in BERT's masked language modeling), ELECTRA corrupts the input by replacing some tokens with plausible alternatives generated by a small generator network, then trains a discriminator network to identify which tokens have been replaced. This approach allows the model to learn from all input tokens rather than just the masked subset, resulting in significantly better sample efficiency and computational savings.
Before ELECTRA, the dominant paradigm for pre-training language representations was masked language modeling (MLM), popularized by BERT in 2018. In MLM, roughly 15% of input tokens are replaced with a special [MASK] token, and the model is trained to predict the original tokens at those positions. While effective, this approach has a fundamental inefficiency: the model only receives a training signal from the 15% of tokens that were masked, leaving the remaining 85% of positions unused for learning.
This inefficiency means that MLM-based models require enormous amounts of compute and data to achieve strong performance. Models like RoBERTa and XLNet demonstrated that simply training BERT longer with more data yields improvements, but at substantial computational cost. RoBERTa, for example, used 160 GB of text data and trained for 500K steps with large batch sizes on 1,024 V100 GPUs.
The authors of ELECTRA identified this core inefficiency and asked a simple question: what if a pre-training task could learn from every single token in the input, not just a small masked subset? This question led to the development of replaced token detection.
The replaced token detection (RTD) task is the central innovation of ELECTRA. Rather than masking tokens and asking the model to predict what they were, RTD works in two stages:
Token replacement: A small generator network replaces a subset of the input tokens with plausible alternatives. The generator is trained using standard masked language modeling, so it learns to produce tokens that are contextually appropriate.
Token detection: A larger discriminator network receives the corrupted input and must predict, for every token in the sequence, whether it is the original token or a replacement from the generator.
This formulation is similar in spirit to a Generative Adversarial Network (GAN), but with important distinctions. The generator produces discrete tokens via sampling rather than continuous representations, and the training uses maximum likelihood for the generator rather than an adversarial loss. The authors found that adversarial training performed poorly in this text setting: adversarially trained generators achieved only 58% accuracy at MLM compared to 65% for MLE-trained generators, and produced low-entropy output distributions with most probability mass concentrated on a single token.
The key advantage of RTD over MLM is that the discriminator receives a binary classification signal at every token position in the input sequence. In a 512-token sequence, MLM provides a training signal for roughly 76 tokens (15% of 512), while RTD provides a signal for all 512 tokens. This represents a 6.7x increase in training signal per example.
However, the per-position signal in RTD is less informative than in MLM. MLM requires predicting the identity of a token from a vocabulary of approximately 30,000 words (roughly 15 bits of information per position), whereas RTD is a binary classification task (1 bit per position). Despite this difference, the empirical results demonstrate that the increased coverage more than compensates for the reduced per-position complexity.
ELECTRA uses a two-network architecture during pre-training:
Generator: A small Transformer encoder trained with masked language modeling. Given an input sequence where 15% (or 25% for the Large model) of tokens are masked, the generator predicts the original tokens at those positions. The predicted tokens are then sampled from the generator's output distribution and used to replace the masked tokens in the input. If the generator happens to sample the correct original token at a given position, that position is labeled as "original" for the discriminator.
Discriminator: A larger Transformer encoder that receives the corrupted sequence (with generator-sampled tokens replacing the masked positions) and must predict whether each token is "original" or "replaced." After pre-training, only the discriminator is used for downstream tasks; the generator is discarded.
The two networks are trained jointly by minimizing a combined loss:
L = L_MLM(x, theta_G) + lambda * L_Disc(x, theta_D)
where L_MLM is the standard masked language modeling loss for the generator, L_Disc is the binary cross-entropy loss for the discriminator over all token positions, and lambda is a weighting hyperparameter set to 50. The large value of lambda is necessary because the MLM loss per position is approximately 15 times larger than the binary classification loss (due to the vocabulary size difference), so the weighting ensures that the discriminator receives sufficient gradient signal. Most of the gradient thus flows to the discriminator, which is the model used for downstream fine-tuning.
A critical finding of the paper is that the generator should be smaller than the discriminator. Using a generator of comparable size to the discriminator actually hurts performance. The authors provide two explanations for this:
The optimal generator size was found to be between 1/4 and 1/3 of the discriminator's hidden dimension. For ELECTRA-Base, the generator uses 1/3 of the discriminator's hidden size (256 vs. 768), while for ELECTRA-Large, it uses 1/4 (256 vs. 1,024).
The paper explores different weight-sharing strategies between the generator and discriminator:
| Sharing Strategy | GLUE Dev Score |
|---|---|
| No weight tying | 83.6 |
| Tying token embeddings only | 84.3 |
| Tying all weights | 84.4 |
Tying token embeddings provides the most benefit. The authors explain that the generator's softmax over the entire vocabulary densely updates all token embeddings during MLM training, which benefits the discriminator. When generator and discriminator differ in hidden size (the recommended configuration), only token and positional embeddings can be shared, and a projection layer connects the different dimensionalities.
For the Small and Base models, ELECTRA is trained on the same data as BERT: English Wikipedia and BooksCorpus, totaling approximately 3.3 billion tokens (16 GB of text). For the Large model, the training data is expanded to match the XLNet corpus, which adds ClueWeb, CommonCrawl, and Gigaword for a total of approximately 33 billion tokens.
ELECTRA does not use BERT's next sentence prediction (NSP) objective. Token masking is performed dynamically (decided on-the-fly during training) rather than during preprocessing.
ELECTRA was released in three sizes. The table below summarizes the discriminator architecture for each configuration:
| Configuration | Layers | Hidden Size | FFN Size | Attention Heads | Head Size | Embedding Size | Parameters |
|---|---|---|---|---|---|---|---|
| ELECTRA-Small | 12 | 256 | 1,024 | 4 | 64 | 128 | 14M |
| ELECTRA-Base | 12 | 768 | 3,072 | 12 | 64 | 768 | 110M |
| ELECTRA-Large | 24 | 1,024 | 4,096 | 16 | 64 | 1,024 | 335M |
The generator architectures for each size are as follows:
| Configuration | Generator Hidden Size | Generator Layers | Generator Heads | Generator Size Fraction |
|---|---|---|---|---|
| ELECTRA-Small | 64 | 12 | 1 | 1/4 |
| ELECTRA-Base | 256 | 12 | 4 | 1/3 |
| ELECTRA-Large | 256 | 24 | 4 | 1/4 |
Note that ELECTRA-Small uses a smaller embedding dimension (128) than its hidden dimension (256), with a linear projection layer bridging the two. This technique reduces the number of parameters in the embedding matrix and was also used in ALBERT.
The following table lists the key pre-training hyperparameters for each model size:
| Hyperparameter | Small | Base | Large |
|---|---|---|---|
| Mask Percentage | 15% | 15% | 25% |
| Learning Rate | 5e-4 | 2e-4 | 2e-4 |
| Adam epsilon | 1e-6 | 1e-6 | 1e-6 |
| Adam beta_1 | 0.9 | 0.9 | 0.9 |
| Adam beta_2 | 0.999 | 0.999 | 0.999 |
| Dropout | 0.1 | 0.1 | 0.1 |
| Attention Dropout | 0.1 | 0.1 | 0.1 |
| Weight Decay | 0.01 | 0.01 | 0.01 |
| Batch Size | 128 | 256 | 2,048 |
| Training Steps | 1M | 766K | 400K |
| Discriminator Loss Weight (lambda) | 50 | 50 | 50 |
All models use the Adam optimizer. The Large model uses a higher mask percentage (25%) than the Small and Base models (15%), which was found to improve performance for larger models. The Large model is also trained in two configurations: ELECTRA-400K (400K steps) for compute-matched comparisons and ELECTRA-1.75M (1.75M steps) for maximum performance.
The GLUE (General Language Understanding Evaluation) benchmark is a standard suite of nine natural language understanding tasks. ELECTRA achieves strong results across all model sizes.
The following table compares ELECTRA-Small to other models with similar or greater compute budgets:
| Model | Parameters | Train FLOPs | Training Hardware | GLUE Dev Avg |
|---|---|---|---|---|
| ELMo | 96M | 3.3e18 | 14 days, 3 GTX 1080 | 71.2 |
| BERT-Small | 14M | 1.4e18 | 4 days, 1 V100 | 75.1 |
| GPT | 117M | 4.0e19 | 25 days, 8 P6000 | 78.8 |
| ELECTRA-Small | 14M | 1.4e18 | 4 days, 1 V100 | 79.9 |
| ELECTRA-Small (50% steps) | 14M | 7.1e17 | 2 days, 1 V100 | 79.0 |
| ELECTRA-Small (25% steps) | 14M | 3.6e17 | 1 day, 1 V100 | 77.7 |
| BERT-Base | 110M | 6.4e19 | 4 days, 16 TPUv3 | 82.2 |
| ELECTRA-Base | 110M | 6.4e19 | 4 days, 16 TPUv3 | 85.1 |
ELECTRA-Small achieves a GLUE dev score of 79.9, outperforming GPT (78.8) despite having only 14M parameters compared to GPT's 117M and using roughly 1/30th the compute. It also surpasses BERT-Small by nearly 5 GLUE points under identical compute conditions. Even when trained for only half the steps, ELECTRA-Small (50%) reaches 79.0, still surpassing GPT. ELECTRA-Base (85.1) outperforms BERT-Large (84.0) while using only one-third the parameters.
The following table compares ELECTRA-Large to other large pre-trained models on the GLUE dev set with per-task breakdowns:
| Model | FLOPs | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| BERT-Large | 1.9e20 | 60.6 | 93.2 | 88.0 | 90.0 | 91.3 | 86.6 | 92.3 | 70.4 | 84.0 |
| RoBERTa (100K steps) | 6.4e20 | 66.1 | 95.6 | 91.4 | 92.2 | 92.0 | 89.3 | 94.0 | 82.7 | 87.9 |
| RoBERTa (500K steps) | 3.2e21 | 68.0 | 96.4 | 90.9 | 92.1 | 92.2 | 90.2 | 94.7 | 86.6 | 88.9 |
| XLNet | 3.9e21 | 69.0 | 97.0 | 90.8 | 92.2 | 92.3 | 90.8 | 94.9 | 85.9 | 89.1 |
| ELECTRA-400K | 7.1e20 | 69.3 | 96.0 | 90.6 | 92.1 | 92.4 | 90.5 | 94.5 | 86.8 | 89.0 |
| ELECTRA-1.75M | 3.1e21 | 69.1 | 96.9 | 90.8 | 92.6 | 92.4 | 90.9 | 95.0 | 88.0 | 89.5 |
ELECTRA-400K matches RoBERTa-500K and XLNet performance while using less than 1/4 of their compute (7.1e20 vs. 3.2e21 and 3.9e21 FLOPs). When trained longer (ELECTRA-1.75M), the model achieves the highest average score of 89.5, outperforming both RoBERTa and XLNet with comparable or less compute.
On the official GLUE test set (scored via the evaluation server), ELECTRA-Large achieves a strong average:
| Model | FLOPs | CoLA | SST-2 | MRPC | STS-B | QQP | MNLI | QNLI | RTE | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| BERT-Large | 1.9e20 | 60.5 | 94.9 | 85.4 | 86.5 | 89.3 | 86.7 | 92.7 | 70.1 | 79.8 |
| RoBERTa | 3.2e21 | 67.8 | 96.7 | 89.8 | 91.9 | 90.2 | 90.8 | 95.4 | 88.2 | 88.1 |
| ALBERT | 3.1e22 | 69.1 | 97.1 | 91.2 | 92.0 | 90.5 | 91.3 | - | 89.2 | 89.0 |
| XLNet | 3.9e21 | 70.2 | 97.1 | 90.5 | 92.6 | 90.4 | 90.9 | - | 88.5 | 89.1 |
| ELECTRA | 3.1e21 | 71.7 | 97.1 | 90.7 | 92.5 | 90.8 | 91.3 | 95.8 | 89.8 | 89.5 |
ELECTRA achieves the highest CoLA score (71.7) and overall average (89.5) among all compared models on the GLUE test set, while using less compute than ALBERT (which requires roughly 10x more FLOPs).
The Stanford Question Answering Dataset (SQuAD) evaluates reading comprehension. ELECTRA was tested on both SQuAD 1.1 (all questions answerable) and SQuAD 2.0 (includes unanswerable questions).
| Model | FLOPs | Params | SQuAD 1.1 EM | SQuAD 1.1 F1 | SQuAD 2.0 Dev EM | SQuAD 2.0 Dev F1 | SQuAD 2.0 Test EM | SQuAD 2.0 Test F1 |
|---|---|---|---|---|---|---|---|---|
| BERT-Base | 6.4e19 | 110M | 80.8 | 88.5 | - | - | - | - |
| BERT-Large | 1.9e20 | 335M | 84.1 | 90.9 | 79.0 | 81.8 | 80.0 | 83.0 |
| SpanBERT | 7.1e20 | 335M | 88.8 | 94.6 | 85.7 | 88.7 | 85.7 | 88.7 |
| XLNet | 3.9e21 | 360M | 89.7 | 95.1 | 87.9 | 90.6 | 87.9 | 90.7 |
| RoBERTa (500K) | 3.2e21 | 356M | 88.9 | 94.6 | 86.5 | 89.4 | 86.8 | 89.8 |
| ALBERT | 3.1e22 | 235M | 89.3 | 94.8 | 87.4 | 90.2 | 88.1 | 90.9 |
| ELECTRA-Base | 6.4e19 | 110M | 84.5 | 90.8 | 80.5 | 83.3 | - | - |
| ELECTRA-400K | 7.1e20 | 335M | 88.7 | 94.2 | 86.9 | 89.6 | - | - |
| ELECTRA-1.75M | 3.1e21 | 335M | 89.7 | 94.9 | 88.0 | 90.6 | 88.7 | 91.4 |
ELECTRA-1.75M set a new state-of-the-art on the SQuAD 2.0 test set at the time of publication, achieving 88.7 EM and 91.4 F1. On SQuAD 1.1, ELECTRA-1.75M matches XLNet's exact match score (89.7) with comparable compute. Notably, ELECTRA-Base (110M parameters) achieves SQuAD 1.1 performance (84.5 EM, 90.8 F1) comparable to BERT-Large (84.1 EM, 90.9 F1) while using roughly 1/3 the compute and 1/3 the parameters.
The authors conducted extensive ablations to understand why ELECTRA outperforms BERT. They examined several variants to isolate the source of improvement:
| Variant | Description | GLUE Dev Score |
|---|---|---|
| BERT (baseline) | Standard masked language modeling | 82.2 |
| ELECTRA 15% | Discriminator loss on only 15% of tokens | 82.4 |
| Replace MLM | Replace [MASK] with generator tokens, then predict originals | 82.4 |
| All-Tokens MLM | Replace all tokens with generator outputs, predict originals everywhere | 84.3 |
| ELECTRA (full) | Discriminator loss on all tokens | 85.0 |
These results reveal two key insights:
Training on all tokens matters. Comparing BERT (82.2) to All-Tokens MLM (84.3) shows that simply extending the prediction task to all token positions provides a meaningful boost, accounting for most of the improvement.
The RTD objective itself provides additional benefit. ELECTRA (85.0) outperforms All-Tokens MLM (84.3), indicating that the binary detection task has advantages beyond just covering more positions. The detection task may be easier to learn than token prediction, allowing the model to extract more useful representations from each example.
Pre-train/fine-tune mismatch matters. Replace MLM (82.4) removes the [MASK] token mismatch but does not improve over BERT (82.2), suggesting that the mismatch is not a major source of BERT's inefficiency by itself.
ELECTRA's most striking result is its sample efficiency. The paper demonstrates consistent advantages across multiple compute budgets:
| Comparison | Compute Advantage |
|---|---|
| ELECTRA-Small vs. GPT | ~30x less compute, 8x fewer parameters, higher GLUE score (79.9 vs. 78.8) |
| ELECTRA-Small vs. BERT-Base | ~45x less compute, 8x fewer parameters |
| ELECTRA-Base vs. BERT-Large | ~3x less compute, 3x fewer parameters, higher GLUE score (85.1 vs. 84.0) |
| ELECTRA-400K vs. RoBERTa-500K | ~4.5x less compute, comparable GLUE score (89.0 vs. 88.9) |
| ELECTRA-400K vs. XLNet | ~5.5x less compute, comparable GLUE score (89.0 vs. 89.1) |
| ELECTRA-1.75M vs. RoBERTa-500K | Comparable compute, higher GLUE score (89.5 vs. 88.9) |
These results demonstrate that the choice of pre-training objective can matter as much as model size, training data, or training duration. ELECTRA shows that extracting more learning signal from each training example is a powerful approach to improving efficiency.
The following table summarizes how ELECTRA compares to other prominent pre-trained language models of its era:
| Model | Year | Pre-training Objective | Trains on All Tokens | Parameters (Large) | GLUE Dev Avg (Large) |
|---|---|---|---|---|---|
| ELMo | 2018 | Bidirectional LM | Yes | 96M | 71.2 |
| GPT | 2018 | Autoregressive LM | Yes | 117M | 78.8 |
| BERT | 2018 | MLM + NSP | No (15%) | 335M | 84.0 |
| XLNet | 2019 | Permutation LM | No (~15%) | 360M | 89.1 |
| RoBERTa | 2019 | MLM (optimized) | No (15%) | 356M | 88.9 |
| ALBERT | 2019 | MLM + SOP | No (15%) | 235M | - |
| ELECTRA | 2020 | Replaced Token Detection | Yes (100%) | 335M | 89.5 |
ELECTRA is unique among encoder-only models in that it provides a training signal for every token position. While autoregressive models like GPT and ELMo also train on all tokens, they do so in a unidirectional or shallow bidirectional manner. ELECTRA combines the sample efficiency of all-token training with the deep bidirectional contextualization of encoder-only architectures.
The most notable model to build on ELECTRA's replaced token detection is DeBERTa V3, published by He, Gao, and Chen at Microsoft Research in 2021. DeBERTaV3 replaces DeBERTa's original MLM pre-training objective with ELECTRA-style replaced token detection, combining RTD with DeBERTa's disentangled attention mechanism and enhanced mask decoder.
However, the DeBERTaV3 authors identified a problem with ELECTRA's weight sharing approach. When the generator and discriminator share token embeddings, the training losses pull the embeddings in opposite directions: the generator's MLM loss encourages semantically similar tokens to have similar embeddings, while the discriminator's RTD loss encourages them to be distinguishable. This conflict creates what the authors call a "tug-of-war" dynamic that reduces training efficiency.
To solve this, DeBERTaV3 introduces gradient-disentangled embedding sharing (GDES). This technique shares embedding parameters between the generator and discriminator but uses stop-gradient operations to prevent the discriminator's loss from flowing back through the shared embeddings, and vice versa. This allows both networks to benefit from shared embeddings without conflicting gradients.
DeBERTaV3 Large achieves a 91.37% average score on the GLUE benchmark, improving over ELECTRA by 1.91 percentage points and over the original DeBERTa by 1.37 points. The multilingual variant, mDeBERTa, achieves 79.8% zero-shot cross-lingual accuracy on XNLI, a 3.6-point improvement over XLM-R Base.
The replaced token detection paradigm influenced several other models and research directions:
The broader insight from ELECTRA, that creative pre-training objective design can yield efficiency gains as significant as architecture changes or scaling increases, has become an influential theme in NLP research.
The official ELECTRA implementation is available in TensorFlow through the google-research/electra repository on GitHub. The codebase supports pre-training on custom corpora and fine-tuning on GLUE, SQuAD, and other downstream tasks. It requires Python 3, TensorFlow 1.15, NumPy, and scikit-learn.
Pre-trained ELECTRA models are also available through the Hugging Face Transformers library, which provides PyTorch and TensorFlow implementations. The Hugging Face model identifiers are:
| Model | Hugging Face Identifier | Parameters |
|---|---|---|
| ELECTRA-Small | google/electra-small-discriminator | 14M |
| ELECTRA-Base | google/electra-base-discriminator | 110M |
| ELECTRA-Large | google/electra-large-discriminator | 335M |
ELECTRA-Small can be pre-trained from scratch on a single NVIDIA V100 GPU (16 GB) in approximately four days for 1 million training steps. Reasonable results can be obtained after just 200K steps (approximately 10 hours of training), making ELECTRA accessible to researchers without access to large compute clusters.
Despite its efficiency advantages, ELECTRA has several limitations:
No generative capability. Because the discriminator is trained for binary classification rather than token prediction, ELECTRA cannot be directly used for text generation tasks. The generator is discarded after pre-training, and the discriminator does not have a language modeling head.
Training complexity. The generator-discriminator framework adds implementation complexity compared to single-model approaches like BERT or RoBERTa. The generator size, weight-sharing strategy, and loss weighting all require careful tuning.
Generator size sensitivity. The model's performance depends on the generator producing challenging but not impossible replacements. If the generator is too weak, the detection task becomes trivial; if too strong, the discriminator cannot learn effectively.
Embedding tug-of-war. As identified by the DeBERTaV3 authors, sharing embeddings between the generator and discriminator creates conflicting gradient dynamics, though this can be mitigated with gradient-disentangled sharing.
Encoder-only architecture. Like BERT, ELECTRA produces an encoder-only model. Models like T5, which use an encoder-decoder architecture, can handle both understanding and generation tasks within a single model.
ELECTRA demonstrated that the pre-training objective is a critical and underexplored dimension of language model design. By showing that a well-designed pre-training task can deliver performance gains equivalent to 4x or more compute, ELECTRA shifted attention toward training efficiency and objective design rather than pure scaling. The model's influence is visible in subsequent work on efficient pre-training, most notably DeBERTaV3, and replaced token detection remains a foundational technique in the NLP toolkit. ELECTRA-Small, in particular, showed that competitive NLP models could be trained on modest hardware, helping democratize access to high-quality pre-trained language representations.