ELECTRA

Deep Learning Natural Language Processing Transformer Models

22 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v8 · 4,358 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ELECTRA, which stands for Efficiently Learning an Encoder that Classifies Token Replacements Accurately, is a pre-training method for natural language processing that learns from every input token by training a discriminator to detect tokens that a small generator network has replaced, making it far more compute-efficient than BERT. Introduced by Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning, the paper was published at the International Conference on Learning Representations (ICLR) in 2020 as a collaboration between Stanford University and Google Brain, and was first posted to arXiv on March 23, 2020.^[1] In the paper's headline result, an ELECTRA model trained on one GPU for 4 days outperforms GPT on the GLUE benchmark while using roughly 30 times less compute, and at scale ELECTRA matches RoBERTa and XLNet using less than 1/4 of their compute.^[1] ELECTRA proposes a novel pre-training objective called replaced token detection (RTD) that fundamentally rethinks how language models learn during pre-training.^[1] Instead of masking tokens and predicting them (as in BERT's masked language modeling), ELECTRA corrupts the input by replacing some tokens with plausible alternatives generated by a small generator network, then trains a discriminator network to identify which tokens have been replaced.^[1] This approach allows the model to learn from all input tokens rather than just the masked subset, resulting in significantly better sample efficiency and computational savings.^[1] As the authors summarize, the new task "is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out."^[1]

What problem does ELECTRA solve?

Before ELECTRA, the dominant paradigm for pre-training language representations was masked language modeling (MLM), popularized by BERT in 2018.^[2] In MLM, roughly 15% of input tokens are replaced with a special [MASK] token, and the model is trained to predict the original tokens at those positions.^[2] While effective, this approach has a fundamental inefficiency: the model only receives a training signal from the 15% of tokens that were masked, leaving the remaining 85% of positions unused for learning.^[1]

This inefficiency means that MLM-based models require enormous amounts of compute and data to achieve strong performance. Models like RoBERTa and XLNet demonstrated that simply training BERT longer with more data yields improvements, but at substantial computational cost.^[3]^[4] RoBERTa, for example, used 160 GB of text data and trained for 500K steps with large batch sizes on 1,024 V100 GPUs.^[3]

The authors of ELECTRA identified this core inefficiency and asked a simple question: what if a pre-training task could learn from every single token in the input, not just a small masked subset? This question led to the development of replaced token detection.^[1]

How does replaced token detection work?

The replaced token detection (RTD) task is the central innovation of ELECTRA. Rather than masking tokens and asking the model to predict what they were, RTD works in two stages:

Token replacement: A small generator network replaces a subset of the input tokens with plausible alternatives. The generator is trained using standard masked language modeling, so it learns to produce tokens that are contextually appropriate.^[1]
Token detection: A larger discriminator network receives the corrupted input and must predict, for every token in the sequence, whether it is the original token or a replacement from the generator.^[1]

This formulation is similar in spirit to a Generative Adversarial Network (GAN), but with important distinctions. The generator produces discrete tokens via sampling rather than continuous representations, and the training uses maximum likelihood for the generator rather than an adversarial loss.^[1] The authors found that adversarial training performed poorly in this text setting: adversarially trained generators achieved only 58% accuracy at MLM compared to 65% for MLE-trained generators, and produced low-entropy output distributions with most probability mass concentrated on a single token.^[1]

The key advantage of RTD over MLM is that the discriminator receives a binary classification signal at every token position in the input sequence. In a 512-token sequence, MLM provides a training signal for roughly 76 tokens (15% of 512), while RTD provides a signal for all 512 tokens. This represents a 6.7x increase in training signal per example.^[1]

However, the per-position signal in RTD is less informative than in MLM. MLM requires predicting the identity of a token from a vocabulary of approximately 30,000 words (roughly 15 bits of information per position), whereas RTD is a binary classification task (1 bit per position). Despite this difference, the empirical results demonstrate that the increased coverage more than compensates for the reduced per-position complexity.^[1]

Architecture and Training

How are the generator and discriminator structured?

ELECTRA uses a two-network architecture during pre-training:

Generator: A small Transformer encoder trained with masked language modeling. Given an input sequence where 15% (or 25% for the Large model) of tokens are masked, the generator predicts the original tokens at those positions. The predicted tokens are then sampled from the generator's output distribution and used to replace the masked tokens in the input. If the generator happens to sample the correct original token at a given position, that position is labeled as "original" for the discriminator.^[1]

Discriminator: A larger Transformer encoder that receives the corrupted sequence (with generator-sampled tokens replacing the masked positions) and must predict whether each token is "original" or "replaced." After pre-training, only the discriminator is used for downstream tasks; the generator is discarded.^[1]

Combined Loss Function

The two networks are trained jointly by minimizing a combined loss:

L = L_MLM(x, theta_G) + lambda * L_Disc(x, theta_D)

where L_MLM is the standard masked language modeling loss for the generator, L_Disc is the binary cross-entropy loss for the discriminator over all token positions, and lambda is a weighting hyperparameter set to 50.^[1] The large value of lambda is necessary because the MLM loss per position is approximately 15 times larger than the binary classification loss (due to the vocabulary size difference), so the weighting ensures that the discriminator receives sufficient gradient signal. Most of the gradient thus flows to the discriminator, which is the model used for downstream fine-tuning.^[1]

Why is the generator smaller than the discriminator?

A critical finding of the paper is that the generator should be smaller than the discriminator. Using a generator of comparable size to the discriminator actually hurts performance.^[1] The authors provide two explanations for this:

A generator that is too powerful produces replacements that are nearly indistinguishable from real tokens, making the discrimination task too difficult for the discriminator to learn effectively.^[1]
With a same-size generator, the combined model has twice as many parameters but the same compute budget, meaning each network gets effectively half the training.^[1]

The optimal generator size was found to be between 1/4 and 1/3 of the discriminator's hidden dimension. For ELECTRA-Base, the generator uses 1/3 of the discriminator's hidden size (256 vs. 768), while for ELECTRA-Large, it uses 1/4 (256 vs. 1,024).^[1]

The paper explores different weight-sharing strategies between the generator and discriminator:

Sharing Strategy	GLUE Dev Score
No weight tying	83.6
Tying token embeddings only	84.3
Tying all weights	84.4

Tying token embeddings provides the most benefit. The authors explain that the generator's softmax over the entire vocabulary densely updates all token embeddings during MLM training, which benefits the discriminator.^[1] When generator and discriminator differ in hidden size (the recommended configuration), only token and positional embeddings can be shared, and a projection layer connects the different dimensionalities.^[1]

What data was ELECTRA trained on?

For the Small and Base models, ELECTRA is trained on the same data as BERT: English Wikipedia and BooksCorpus, totaling approximately 3.3 billion tokens (16 GB of text). For the Large model, the training data is expanded to match the XLNet corpus, which adds ClueWeb, CommonCrawl, and Gigaword for a total of approximately 33 billion tokens.^[1]

ELECTRA does not use BERT's next sentence prediction (NSP) objective. Token masking is performed dynamically (decided on-the-fly during training) rather than during preprocessing.^[1]

What model sizes does ELECTRA come in?

ELECTRA was released in three sizes. The table below summarizes the discriminator architecture for each configuration:

Configuration	Layers	Hidden Size	FFN Size	Attention Heads	Head Size	Embedding Size	Parameters
ELECTRA-Small	12	256	1,024	4	64	128	14M
ELECTRA-Base	12	768	3,072	12	64	768	110M
ELECTRA-Large	24	1,024	4,096	16	64	1,024	335M

The generator architectures for each size are as follows:

Configuration	Generator Hidden Size	Generator Layers	Generator Heads	Generator Size Fraction
ELECTRA-Small	64	12	1	1/4
ELECTRA-Base	256	12	4	1/3
ELECTRA-Large	256	24	4	1/4

Note that ELECTRA-Small uses a smaller embedding dimension (128) than its hidden dimension (256), with a linear projection layer bridging the two. This technique reduces the number of parameters in the embedding matrix and was also used in ALBERT.^[1]^[5]

Pre-training Hyperparameters

The following table lists the key pre-training hyperparameters for each model size:

Hyperparameter	Small	Base	Large
Mask Percentage	15%	15%	25%
Learning Rate	5e-4	2e-4	2e-4
Adam epsilon	1e-6	1e-6	1e-6
Adam beta_1	0.9	0.9	0.9
Adam beta_2	0.999	0.999	0.999
Dropout	0.1	0.1	0.1
Attention Dropout	0.1	0.1	0.1
Weight Decay	0.01	0.01	0.01
Batch Size	128	256	2,048
Training Steps	1M	766K	400K
Discriminator Loss Weight (lambda)	50	50	50

All models use the Adam optimizer. The Large model uses a higher mask percentage (25%) than the Small and Base models (15%), which was found to improve performance for larger models. The Large model is also trained in two configurations: ELECTRA-400K (400K steps) for compute-matched comparisons and ELECTRA-1.75M (1.75M steps) for maximum performance.^[1]

How well does ELECTRA perform on benchmarks?

GLUE Benchmark

The GLUE (General Language Understanding Evaluation) benchmark is a standard suite of nine natural language understanding tasks. ELECTRA achieves strong results across all model sizes.^[1]

Small Models (GLUE Dev Set)

The following table compares ELECTRA-Small to other models with similar or greater compute budgets:

Model	Parameters	Train FLOPs	Training Hardware	GLUE Dev Avg
ELMo	96M	3.3e18	14 days, 3 GTX 1080	71.2
BERT-Small	14M	1.4e18	4 days, 1 V100	75.1
GPT	117M	4.0e19	25 days, 8 P6000	78.8
ELECTRA-Small	14M	1.4e18	4 days, 1 V100	79.9
ELECTRA-Small (50% steps)	14M	7.1e17	2 days, 1 V100	79.0
ELECTRA-Small (25% steps)	14M	3.6e17	1 day, 1 V100	77.7
BERT-Base	110M	6.4e19	4 days, 16 TPUv3	82.2
ELECTRA-Base	110M	6.4e19	4 days, 16 TPUv3	85.1

ELECTRA-Small achieves a GLUE dev score of 79.9, outperforming GPT (78.8) despite having only 14M parameters compared to GPT's 117M and using roughly 1/30th the compute.^[1] It also surpasses BERT-Small by nearly 5 GLUE points under identical compute conditions.^[1] Even when trained for only half the steps, ELECTRA-Small (50%) reaches 79.0, still surpassing GPT.^[1] ELECTRA-Base (85.1) outperforms BERT-Large (84.0) while using only one-third the parameters.^[1]

Large Models (GLUE Dev Set)

The following table compares ELECTRA-Large to other large pre-trained models on the GLUE dev set with per-task breakdowns:

Model	FLOPs	CoLA	SST-2	MRPC	STS-B	QQP	MNLI	QNLI	RTE	Avg
BERT-Large	1.9e20	60.6	93.2	88.0	90.0	91.3	86.6	92.3	70.4	84.0
RoBERTa (100K steps)	6.4e20	66.1	95.6	91.4	92.2	92.0	89.3	94.0	82.7	87.9
RoBERTa (500K steps)	3.2e21	68.0	96.4	90.9	92.1	92.2	90.2	94.7	86.6	88.9
XLNet	3.9e21	69.0	97.0	90.8	92.2	92.3	90.8	94.9	85.9	89.1
ELECTRA-400K	7.1e20	69.3	96.0	90.6	92.1	92.4	90.5	94.5	86.8	89.0
ELECTRA-1.75M	3.1e21	69.1	96.9	90.8	92.6	92.4	90.9	95.0	88.0	89.5

ELECTRA-400K matches RoBERTa-500K and XLNet performance while using less than 1/4 of their compute (7.1e20 vs. 3.2e21 and 3.9e21 FLOPs).^[1] When trained longer (ELECTRA-1.75M), the model achieves the highest average score of 89.5, outperforming both RoBERTa and XLNet with comparable or less compute.^[1]

GLUE Test Set

On the official GLUE test set (scored via the evaluation server), ELECTRA-Large achieves a strong average:

Model	FLOPs	CoLA	SST-2	MRPC	STS-B	QQP	MNLI	QNLI	RTE	Avg
BERT-Large	1.9e20	60.5	94.9	85.4	86.5	89.3	86.7	92.7	70.1	79.8
RoBERTa	3.2e21	67.8	96.7	89.8	91.9	90.2	90.8	95.4	88.2	88.1
ALBERT	3.1e22	69.1	97.1	91.2	92.0	90.5	91.3	-	89.2	89.0
XLNet	3.9e21	70.2	97.1	90.5	92.6	90.4	90.9	-	88.5	89.1
ELECTRA	3.1e21	71.7	97.1	90.7	92.5	90.8	91.3	95.8	89.8	89.5

ELECTRA achieves the highest CoLA score (71.7) and overall average (89.5) among all compared models on the GLUE test set, while using less compute than ALBERT (which requires roughly 10x more FLOPs).^[1]

SQuAD Results

The Stanford Question Answering Dataset (SQuAD) evaluates reading comprehension. ELECTRA was tested on both SQuAD 1.1 (all questions answerable) and SQuAD 2.0 (includes unanswerable questions).^[10]

Model	FLOPs	Params	SQuAD 1.1 EM	SQuAD 1.1 F1	SQuAD 2.0 Dev EM	SQuAD 2.0 Dev F1	SQuAD 2.0 Test EM	SQuAD 2.0 Test F1
BERT-Base	6.4e19	110M	80.8	88.5	-	-	-	-
BERT-Large	1.9e20	335M	84.1	90.9	79.0	81.8	80.0	83.0
SpanBERT	7.1e20	335M	88.8	94.6	85.7	88.7	85.7	88.7
XLNet	3.9e21	360M	89.7	95.1	87.9	90.6	87.9	90.7
RoBERTa (500K)	3.2e21	356M	88.9	94.6	86.5	89.4	86.8	89.8
ALBERT	3.1e22	235M	89.3	94.8	87.4	90.2	88.1	90.9
ELECTRA-Base	6.4e19	110M	84.5	90.8	80.5	83.3	-	-
ELECTRA-400K	7.1e20	335M	88.7	94.2	86.9	89.6	-	-
ELECTRA-1.75M	3.1e21	335M	89.7	94.9	88.0	90.6	88.7	91.4

ELECTRA-1.75M set a new state-of-the-art on the SQuAD 2.0 test set at the time of publication, achieving 88.7 EM and 91.4 F1.^[1] On SQuAD 1.1, ELECTRA-1.75M matches XLNet's exact match score (89.7) with comparable compute.^[1] Notably, ELECTRA-Base (110M parameters) achieves SQuAD 1.1 performance (84.5 EM, 90.8 F1) comparable to BERT-Large (84.1 EM, 90.9 F1) while using roughly 1/3 the compute and 1/3 the parameters.^[1]

Why does ELECTRA outperform BERT? (Ablation Studies)

The authors conducted extensive ablations to understand why ELECTRA outperforms BERT. They examined several variants to isolate the source of improvement:^[1]

Variant	Description	GLUE Dev Score
BERT (baseline)	Standard masked language modeling	82.2
ELECTRA 15%	Discriminator loss on only 15% of tokens	82.4
Replace MLM	Replace [MASK] with generator tokens, then predict originals	82.4
All-Tokens MLM	Replace all tokens with generator outputs, predict originals everywhere	84.3
ELECTRA (full)	Discriminator loss on all tokens	85.0

These results reveal two key insights:

Training on all tokens matters. Comparing BERT (82.2) to All-Tokens MLM (84.3) shows that simply extending the prediction task to all token positions provides a meaningful boost, accounting for most of the improvement.^[1]
The RTD objective itself provides additional benefit. ELECTRA (85.0) outperforms All-Tokens MLM (84.3), indicating that the binary detection task has advantages beyond just covering more positions.^[1] The detection task may be easier to learn than token prediction, allowing the model to extract more useful representations from each example.^[1]
Pre-train/fine-tune mismatch matters. Replace MLM (82.4) removes the [MASK] token mismatch but does not improve over BERT (82.2), suggesting that the mismatch is not a major source of BERT's inefficiency by itself.^[1]

How sample-efficient is ELECTRA?

ELECTRA's most striking result is its sample efficiency. The paper demonstrates consistent advantages across multiple compute budgets:^[1]

Comparison	Compute Advantage
ELECTRA-Small vs. GPT	~30x less compute, 8x fewer parameters, higher GLUE score (79.9 vs. 78.8)
ELECTRA-Small vs. BERT-Base	~45x less compute, 8x fewer parameters
ELECTRA-Base vs. BERT-Large	~3x less compute, 3x fewer parameters, higher GLUE score (85.1 vs. 84.0)
ELECTRA-400K vs. RoBERTa-500K	~4.5x less compute, comparable GLUE score (89.0 vs. 88.9)
ELECTRA-400K vs. XLNet	~5.5x less compute, comparable GLUE score (89.0 vs. 89.1)
ELECTRA-1.75M vs. RoBERTa-500K	Comparable compute, higher GLUE score (89.5 vs. 88.9)

These results demonstrate that the choice of pre-training objective can matter as much as model size, training data, or training duration. ELECTRA shows that extracting more learning signal from each training example is a powerful approach to improving efficiency.^[1]

The following table summarizes how ELECTRA compares to other prominent pre-trained language models of its era:

Model	Year	Pre-training Objective	Trains on All Tokens	Parameters (Large)	GLUE Dev Avg (Large)
ELMo	2018	Bidirectional LM	Yes	96M	71.2
GPT	2018	Autoregressive LM	Yes	117M	78.8
BERT	2018	MLM + NSP	No (15%)	335M	84.0
XLNet	2019	Permutation LM	No (~15%)	360M	89.1
RoBERTa	2019	MLM (optimized)	No (15%)	356M	88.9
ALBERT	2019	MLM + SOP	No (15%)	235M	-
ELECTRA	2020	Replaced Token Detection	Yes (100%)	335M	89.5

ELECTRA is unique among encoder-only models in that it provides a training signal for every token position.^[1] While autoregressive models like GPT and ELMo also train on all tokens, they do so in a unidirectional or shallow bidirectional manner.^[1] ELECTRA combines the sample efficiency of all-token training with the deep bidirectional contextualization of encoder-only architectures.^[1]

Influence on Subsequent Work

DeBERTa V3

The most notable model to build on ELECTRA's replaced token detection is DeBERTa V3, published by He, Gao, and Chen at Microsoft Research in 2021.^[6] DeBERTaV3 replaces DeBERTa's original MLM pre-training objective with ELECTRA-style replaced token detection, combining RTD with DeBERTa's disentangled attention mechanism and enhanced mask decoder.^[6]

However, the DeBERTaV3 authors identified a problem with ELECTRA's weight sharing approach. When the generator and discriminator share token embeddings, the training losses pull the embeddings in opposite directions: the generator's MLM loss encourages semantically similar tokens to have similar embeddings, while the discriminator's RTD loss encourages them to be distinguishable. This conflict creates what the authors call a "tug-of-war" dynamic that reduces training efficiency.^[6]

To solve this, DeBERTaV3 introduces gradient-disentangled embedding sharing (GDES). This technique shares embedding parameters between the generator and discriminator but uses stop-gradient operations to prevent the discriminator's loss from flowing back through the shared embeddings, and vice versa. This allows both networks to benefit from shared embeddings without conflicting gradients.^[6]

DeBERTaV3 Large achieves a 91.37% average score on the GLUE benchmark, improving over ELECTRA by 1.91 percentage points and over the original DeBERTa by 1.37 points.^[6] The multilingual variant, mDeBERTa, achieves 79.8% zero-shot cross-lingual accuracy on XNLI, a 3.6-point improvement over XLM-R Base.^[6]

Other Models Influenced by ELECTRA

The replaced token detection paradigm influenced several other models and research directions:

COCO-LM (Meng et al., 2021) improves on ELECTRA by introducing corrective language modeling (CLM), which jointly performs both RTD and MLM at each token position, combined with sequence contrastive learning.^[8]
MC-BERT (Xu et al., 2020) extends ELECTRA with multi-task corrective pre-training for Chinese NLU tasks.
PEER (2023) extends ELECTRA's pre-training framework with a ranking-based objective for improved representation learning.
Fast-ELECTRA (2023) proposes methods to reduce the computational overhead of the generator network during pre-training.
Multiple domain-specific ELECTRA variants have been developed for biomedical NLP, legal text, and other specialized domains, including BioELECTRA for biomedical literature and Chinese-ELECTRA for Chinese language understanding.

The broader insight from ELECTRA, that creative pre-training objective design can yield efficiency gains as significant as architecture changes or scaling increases, has become an influential theme in NLP research.

Where can you get ELECTRA?

The official ELECTRA implementation is available in TensorFlow through the google-research/electra repository on GitHub.^[11] The codebase supports pre-training on custom corpora and fine-tuning on GLUE, SQuAD, and other downstream tasks. It requires Python 3, TensorFlow 1.15, NumPy, and scikit-learn.^[11]

Pre-trained ELECTRA models are also available through the Hugging Face Transformers library, which provides PyTorch and TensorFlow implementations. The Hugging Face model identifiers are:

Model	Hugging Face Identifier	Parameters
ELECTRA-Small	`google/electra-small-discriminator`	14M
ELECTRA-Base	`google/electra-base-discriminator`	110M
ELECTRA-Large	`google/electra-large-discriminator`	335M

ELECTRA-Small can be pre-trained from scratch on a single NVIDIA V100 GPU (16 GB) in approximately four days for 1 million training steps.^[9] Reasonable results can be obtained after just 200K steps (approximately 10 hours of training), making ELECTRA accessible to researchers without access to large compute clusters.^[9]

What are ELECTRA's limitations?

Despite its efficiency advantages, ELECTRA has several limitations:

No generative capability. Because the discriminator is trained for binary classification rather than token prediction, ELECTRA cannot be directly used for text generation tasks. The generator is discarded after pre-training, and the discriminator does not have a language modeling head.^[1]
Training complexity. The generator-discriminator framework adds implementation complexity compared to single-model approaches like BERT or RoBERTa. The generator size, weight-sharing strategy, and loss weighting all require careful tuning.
Generator size sensitivity. The model's performance depends on the generator producing challenging but not impossible replacements. If the generator is too weak, the detection task becomes trivial; if too strong, the discriminator cannot learn effectively.^[1]
Embedding tug-of-war. As identified by the DeBERTaV3 authors, sharing embeddings between the generator and discriminator creates conflicting gradient dynamics, though this can be mitigated with gradient-disentangled sharing.^[6]
Encoder-only architecture. Like BERT, ELECTRA produces an encoder-only model. Models like T5, which use an encoder-decoder architecture, can handle both understanding and generation tasks within a single model.

Legacy

ELECTRA demonstrated that the pre-training objective is a critical and underexplored dimension of language model design. By showing that a well-designed pre-training task can deliver performance gains equivalent to 4x or more compute, ELECTRA shifted attention toward training efficiency and objective design rather than pure scaling.^[1] The model's influence is visible in subsequent work on efficient pre-training, most notably DeBERTaV3, and replaced token detection remains a foundational technique in the NLP toolkit.^[6] ELECTRA-Small, in particular, showed that competitive NLP models could be trained on modest hardware, helping democratize access to high-quality pre-trained language representations.^[1]

References

Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *Proceedings of the International Conference on Learning Representations (ICLR 2020)*. arXiv:2003.10555. https://arxiv.org/abs/2003.10555 and https://openreview.net/forum?id=r1xMH1BtvB ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*. ↩
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." *arXiv preprint arXiv:1907.11692*. ↩
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*. ↩
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." *Proceedings of ICLR 2020*. ↩
He, P., Gao, J., & Chen, W. (2021). "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." *arXiv preprint arXiv:2111.09543*. ↩
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). "Improving Language Understanding by Generative Pre-Training." *OpenAI Technical Report*.
Meng, Y., Xiong, C., Bajaj, P., Bennett, P., Han, J., & Song, X. (2021). "COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining." *Advances in Neural Information Processing Systems 34 (NeurIPS 2021)*. ↩
Google Research. (2020). "More Efficient NLP Model Pre-training with ELECTRA." *Google Research Blog*. https://research.google/blog/more-efficient-nlp-model-pre-training-with-electra/ ↩
Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." *Proceedings of ACL 2018*. ↩
Clark, K. (2020). "ELECTRA GitHub Repository." https://github.com/google-research/electra ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

What links here

ALBERT BERT Bidirectional DeBERTa DistilBERT Embeddings Fill-Mask Models GELU (Gaussian Error Linear Unit)GLUE benchmark Machine learning terms/Natural Language Processing NLU Pre-Trained Model Pre-training RoBERTa SQuAD Self-Supervised Learning Text Classification Models Transfer Learning XLNet

What problem does ELECTRA solve?

How does replaced token detection work?

Architecture and Training

How are the generator and discriminator structured?

Combined Loss Function

Why is the generator smaller than the discriminator?

Weight Sharing

What data was ELECTRA trained on?

What model sizes does ELECTRA come in?

Pre-training Hyperparameters

How well does ELECTRA perform on benchmarks?

GLUE Benchmark

Small Models (GLUE Dev Set)

Large Models (GLUE Dev Set)

GLUE Test Set

SQuAD Results

Why does ELECTRA outperform BERT? (Ablation Studies)

How sample-efficient is ELECTRA?

How does ELECTRA differ from related models?

Influence on Subsequent Work

DeBERTa V3

Other Models Influenced by ELECTRA

Where can you get ELECTRA?

What are ELECTRA's limitations?

Legacy

References

Improve this article

Related Articles

Positional encoding

XLNet

RoBERTa

ALBERT

DeBERTa

DistilBERT

What links here

Related Articles

Positional encoding

XLNet

RoBERTa

ALBERT

DeBERTa

DistilBERT

What links here