ALBERT

Deep Learning Natural Language Processing Transformer Models

25 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v9 · 5,045 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

ALBERT (A Lite BERT) is a parameter-efficient variant of the BERT language model developed by researchers at Google Research and the Toyota Technological Institute at Chicago (TTIC). Published as a conference paper at ICLR 2020, ALBERT introduced two parameter-reduction techniques, factorized embedding parameterization and cross-layer parameter sharing, that together reduce the parameter count of a BERT-like model by up to 89% without proportional loss in performance.^[1] The paper also replaced BERT's Next Sentence Prediction (NSP) pre-training objective with a more challenging Sentence Order Prediction (SOP) task that forces the model to learn finer-grained inter-sentence coherence.^[1]

The ALBERT paper was authored by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Among them, Lan, Goodman, Sharma, and Soricut were affiliated with Google Research, while Chen and Gimpel were at TTIC. The paper was first posted to arXiv on September 26, 2019 (arXiv:1909.11942), and its final published version appeared at the International Conference on Learning Representations (ICLR) in April 2020.^[1]

In its own words, the paper states: "we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT," and reports that "an ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster."^[1] An ALBERT configuration with the same hidden size and layer count as BERT-large achieves roughly 18 times fewer parameters (18M vs. 334M) and can be trained approximately 1.7 times faster. ALBERT-xxlarge, the largest configuration with 235M parameters, established new state-of-the-art results on the GLUE benchmark (89.4), SQuAD 1.1 and 2.0, and the RACE reading comprehension benchmark at the time of publication, all while using fewer parameters than BERT-large.^[1]

Background and Motivation

Pre-trained language models grew rapidly in size during 2018 and 2019. BERT-large had 334M parameters, GPT-2 reached 1.5 billion, and models continued to scale upward.^[2] While larger models generally improved downstream task accuracy, this growth brought practical problems: GPU and TPU memory limits restricted the maximum model size that could be trained on available hardware, longer training times slowed research iteration, and, somewhat unexpectedly, simply making models bigger sometimes led to performance degradation rather than improvement.^[1]

The ALBERT authors observed that existing approaches tied the size of the word embedding layer directly to the hidden layer size, which was wasteful because word embeddings are context-independent while hidden states are context-dependent. They also noted that every Transformer layer in BERT learned its own set of parameters, even though prior work had shown that Transformer layers often learn similar patterns across depths. These two observations motivated the core parameter-reduction techniques in ALBERT.^[1]

A secondary motivation was the weakness of BERT's Next Sentence Prediction (NSP) objective. Research by the RoBERTa team and others had already shown that NSP provided little or no benefit for downstream tasks.^[3] The ALBERT authors hypothesized that NSP was too easy because it conflated two separate signals: topic prediction (whether two sentences are from the same document) and coherence prediction (whether the sentences are in a logical order). Topic prediction overlaps heavily with the masked language modeling (MLM) signal, making it redundant.^[1]

How is ALBERT structured?

ALBERT shares the same encoder-only Transformer architecture as BERT. It uses multi-head self-attention layers followed by position-wise feed-forward networks, with layer normalization and residual connections at each sub-layer.^[5] The key differences lie in how the embedding layer is structured and how parameters are shared across layers.

What is factorized embedding parameterization?

In BERT, the word embedding dimension E is always equal to the hidden layer dimension H. This means that increasing the hidden size (to improve the model's representational capacity) also increases the embedding matrix proportionally. Since the vocabulary size V is typically 30,000, the embedding matrix alone accounts for V x H parameters, which can be substantial.^[2]

The ALBERT authors argued that this coupling is unnecessary. Word embeddings encode context-independent representations of individual tokens, while hidden states encode context-dependent representations shaped by self-attention across the entire input sequence. The former requires less capacity than the latter. Therefore, ALBERT decouples the two sizes by setting E much smaller than H. The vocabulary is first projected into a low-dimensional embedding space of size E, and then a linear projection maps each E-dimensional embedding up to the H-dimensional hidden space.^[1]

This factorization reduces the embedding parameter count from O(V x H) to O(V x E + E x H). When H is much larger than E, the savings are substantial. For example, with V = 30,000, H = 4,096, and E = 128, the embedding parameters drop from approximately 123 million (30,000 x 4,096) to approximately 4.4 million (30,000 x 128 + 128 x 4,096).^[1] Google Research described this as "an 80% reduction in the parameters of the projection block."^[13]

The ALBERT paper experimented with embedding sizes of 64, 128, 256, and 768. Under the all-shared condition (where cross-layer parameter sharing is active), an embedding size of 128 yielded the best overall performance across downstream tasks.^[1] The full results are shown below.

Embedding Size (E)	Parameters (Not Shared)	Avg (Not Shared)	Parameters (All Shared)	Avg (All Shared)
64	87M	81.3	10M	79.0
128	89M	81.7	12M	80.1
256	93M	81.8	16M	79.6
768	108M	82.3	31M	79.8

Without parameter sharing, larger embedding sizes consistently produced better results, with E = 768 achieving the highest average. With all-shared parameters, however, E = 128 provided the best average score. This occurs because the shared parameters compress the model's capacity, and a smaller embedding space is sufficient in this regime. All subsequent ALBERT experiments use E = 128.^[1]

The second technique is cross-layer parameter sharing, in which all Transformer layers use the same set of parameters. Instead of each of the L layers learning its own attention weights and feed-forward weights independently, ALBERT defines a single set of Transformer parameters and reuses them at every layer. This means that a 12-layer ALBERT model stores parameters for only one Transformer block, while at runtime the input passes through that same block 12 times.^[1]

This approach prevents the total parameter count from growing with the depth of the network. A 12-layer and a 24-layer ALBERT model with the same hidden size have identical parameter counts; only the computation time increases with depth.^[1] Google Research reported that this achieves "a 90% parameter reduction for the attention-feedforward block (a 70% reduction overall)."^[13]

The paper evaluated four sharing strategies to understand which components benefit most from sharing:

Sharing Strategy	Parameters (E=128)	Parameters (E=768)	Avg Score (E=128)	Avg Score (E=768)
Not shared (BERT-style)	89M	108M	81.6	82.3
Shared attention only	64M	83M	81.7	81.6
Shared FFN only	38M	57M	80.2	79.5
All shared (ALBERT-style)	12M	31M	80.1	79.8

Sharing only the attention parameters had a minimal effect on performance (and even improved the average slightly with E = 128, from 81.6 to 81.7). The larger degradation came from sharing the feed-forward network (FFN) parameters. Sharing all parameters together still delivered reasonable accuracy at a fraction of the parameter count. The default ALBERT configuration shares all parameters, accepting a modest performance decrease in exchange for a dramatic reduction in parameter count.^[1]

An important property of cross-layer sharing is that it stabilizes the network. The authors found that the L2 distances and cosine similarities between input and output embeddings of each layer oscillated rather than converging to zero, suggesting that ALBERT's shared parameters create a smoother loss surface compared to BERT's independently parameterized layers.^[1]

What is Sentence Order Prediction (SOP)?

ALBERT replaces BERT's Next Sentence Prediction (NSP) task with Sentence Order Prediction (SOP). In NSP, the model receives two segments and predicts whether they are consecutive sentences from the same document (positive) or whether the second segment was randomly sampled from a different document (negative).^[2] The ALBERT authors argued that the negative examples in NSP are too easy because the model can distinguish them based on topic alone (different documents are almost always about different topics), without needing to understand inter-sentence coherence.^[1]

SOP uses the same positive examples as NSP: two consecutive segments from the same document. However, its negative examples are created by simply swapping the order of those two segments. Because both the positive and negative examples come from the same document, the model cannot rely on topic differences and must instead learn the actual ordering relationship between segments. This forces the model to capture discourse-level coherence properties.^[1]

The ablation results from the paper compare three settings using an ALBERT-base configuration: no sentence-level loss (as in XLNet and RoBERTa), NSP (as in BERT), and SOP (as in ALBERT).^[1]

Pre-training Objective	MLM Accuracy	NSP Accuracy	SOP Accuracy	SQuAD 1.1	SQuAD 2.0	MNLI	SST-2	RACE	Avg
None	54.9	52.4	53.3	88.6/81.5	78.1/75.3	81.5	89.9	61.7	79.0
NSP	54.5	90.5	52.0	88.4/81.5	77.2/74.6	81.6	91.1	62.3	79.2
SOP	54.0	78.9	86.5	89.3/82.3	80.0/77.1	82.0	90.3	64.0	80.1

A model trained with NSP achieves 90.5% accuracy on the NSP task but performs at random (52.0%) on SOP, demonstrating that NSP learns only the easier topic-prediction signal. Conversely, a model trained with SOP can still solve the NSP task reasonably well (78.9%) while also achieving 86.5% accuracy on the harder SOP task. On downstream benchmarks, SOP consistently outperforms NSP, with gains of approximately +1% on SQuAD 1.1, +2% on SQuAD 2.0, and +1.7% on RACE, yielding a +1.1 improvement in average score (80.1 vs. 79.0 for no sentence-level loss).^[1]

What are the ALBERT model configurations?

ALBERT comes in four standard sizes. All configurations use a vocabulary of 30,000 tokens processed by a SentencePiece tokenizer and an embedding dimension E of 128.^[1]

Model	Hidden Size (H)	Layers (L)	Attention Heads	Intermediate Size	Parameters
ALBERT-base	768	12	12	3,072	12M
ALBERT-large	1,024	24	16	4,096	18M
ALBERT-xlarge	2,048	24	32	8,192	60M
ALBERT-xxlarge	4,096	12	64	16,384	235M

For comparison with BERT:

Model	Hidden Size (H)	Embedding Size (E)	Layers	Parameters	Parameter Sharing
BERT-base	768	768	12	108M	No
BERT-large	1,024	1,024	24	334M	No
ALBERT-base	768	128	12	12M	Yes
ALBERT-large	1,024	128	24	18M	Yes
ALBERT-xlarge	2,048	128	24	60M	Yes
ALBERT-xxlarge	4,096	128	12	235M	Yes

ALBERT-base has 12M parameters compared to BERT-base's 108M, a reduction of roughly 89%. ALBERT-large has 18M parameters compared to BERT-large's 334M, roughly 18 times fewer. Even the largest configuration, ALBERT-xxlarge with 235M parameters, is about 70% the size of BERT-large while using a hidden size four times larger (4,096 vs. 1,024).^[1]

Notably, ALBERT-xxlarge uses only 12 layers rather than 24 because the authors found that a 24-layer version achieved similar results (88.7 average on dev benchmarks for both configurations) but was computationally more expensive. Reducing the depth to 12 layers halved the computation cost without sacrificing accuracy.^[1]

ALBERT-large has the same hidden size (1,024) and layer count (24) as BERT-large, yet contains only 18M parameters compared to BERT-large's 334M. This dramatic reduction comes entirely from factorized embeddings and cross-layer parameter sharing.^[1]

How was ALBERT pre-trained?

Training Data

ALBERT was pre-trained on the same data as BERT: the BookCorpus (approximately 800 million words from 11,038 unpublished books) and English Wikipedia (approximately 2,500 million words, with lists, tables, and headers removed). The combined training corpus contains roughly 16 GB of uncompressed text. For the state-of-the-art comparison experiments, ALBERT was also trained on the additional data used by XLNet and RoBERTa.^[1]

Training Objectives

Like BERT, ALBERT uses masked language modeling (MLM) as its primary pre-training objective. Fifteen percent of input tokens are randomly selected; of those, 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% are left unchanged. The model is trained to predict the original token at each selected position.^[2]

ALBERT uses n-gram masking with a maximum n-gram length of 3. The probability of selecting an n-gram of length n decreases with n, encouraging shorter masked spans while still allowing the model to learn to predict multi-token phrases.^[1]

The second objective is the SOP task described above. Together, the MLM and SOP losses are combined and optimized jointly during pre-training.^[1]

Training Details

ALBERT uses a maximum input length of 512 tokens and is optimized with the LAMB optimizer.^[12] The learning rate is set to 0.00176 with a batch size of 4,096. Models are trained for 125,000 steps on Cloud TPU V3 hardware, with the number of TPUs ranging from 64 (for smaller models) to 512 (for ALBERT-xxlarge).^[1]

Training Speed

Because ALBERT has far fewer parameters, each training iteration is faster in terms of communication overhead (less data needs to be synchronized across devices) and memory consumption. The paper reports the following relative training speeds measured against BERT-large as a baseline:^[1]

Model	Parameters	Training Speedup (vs. BERT-large)
BERT-base	108M	4.7x
BERT-large	334M	1.0x (baseline)
ALBERT-base	12M	5.6x
ALBERT-large	18M	1.7x
ALBERT-xlarge	60M	0.6x
ALBERT-xxlarge	235M	0.3x

ALBERT-large is 1.7 times faster than BERT-large per training step despite matching BERT-large's depth and hidden size. However, ALBERT-xlarge and ALBERT-xxlarge are slower than BERT-large because their much larger hidden sizes (2,048 and 4,096 respectively) increase the computation per layer significantly, even though the total stored parameters are smaller.^[1]

An important training efficiency result is that ALBERT-xxlarge reaches higher accuracy than BERT-large in less wall-clock time:^[1]

Model	Training Steps	Training Time	SQuAD 1.1	SQuAD 2.0	MNLI	SST-2	RACE	Avg
BERT-large	400k	34h	93.5/87.4	86.9/84.3	87.8	94.6	77.3	87.2
ALBERT-xxlarge	125k	32h	94.0/88.1	88.3/85.3	87.8	95.4	82.5	88.7

With similar training time, ALBERT-xxlarge outperforms BERT-large by 1.5 points on average. The improvement is especially pronounced on RACE (+5.2 points), which requires reasoning over longer passages.^[1]

How did ALBERT perform on benchmarks?

Overall Comparison with BERT

The following table compares ALBERT configurations against BERT on five representative tasks, all trained on the same data (BookCorpus + Wikipedia):^[1]

Model	Parameters	SQuAD 1.1 (EM/F1)	SQuAD 2.0 (EM/F1)	MNLI	SST-2	RACE	Avg
BERT-base	108M	90.4/83.2	80.4/77.6	84.5	92.8	68.2	82.3
BERT-large	334M	92.2/85.5	85.0/82.2	86.6	93.0	73.9	85.2
ALBERT-base	12M	89.3/82.3	80.0/77.1	81.6	90.3	64.0	80.1
ALBERT-large	18M	90.6/83.9	82.3/79.4	83.5	91.7	68.5	82.4
ALBERT-xlarge	60M	92.5/86.1	86.1/83.1	86.4	92.4	74.8	85.5
ALBERT-xxlarge	235M	94.1/88.3	88.1/85.1	88.0	95.2	82.3	88.7

ALBERT-xxlarge achieves significant improvements over BERT-large across every task: +1.9 points on SQuAD 1.1 (comparing F1 of 88.3 vs. 85.5), +3.1 on SQuAD 2.0 (85.1 vs. 82.2), +1.4 on MNLI, +2.2 on SST-2, and +8.4 on RACE. ALBERT-xlarge, with only 60M parameters (roughly 18% of BERT-large's count), already matches or exceeds BERT-large on SQuAD 1.1, SQuAD 2.0, and RACE.^[1]

ALBERT-large, with only 18M parameters, matches BERT-base (108M parameters) in average score (82.4 vs. 82.3).^[1]

GLUE Benchmark

On the GLUE benchmark, the authors compared ALBERT-xxlarge (trained for 1M and 1.5M steps with additional data) against single-model results from BERT-large, XLNet-large, and RoBERTa-large on the development set:^[6]

Model	MNLI	QNLI	QQP	RTE	SST-2	MRPC	CoLA	STS-B
BERT-large	86.6	92.3	91.3	70.4	93.2	88.0	60.6	90.0
XLNet-large	89.8	93.9	91.8	83.8	95.6	89.2	63.6	91.8
RoBERTa-large	90.2	94.7	92.2	86.6	96.4	90.9	68.0	92.4
ALBERT-xxlarge (1M)	90.4	95.2	92.0	88.1	96.8	90.2	68.7	92.7
ALBERT-xxlarge (1.5M)	90.8	95.3	92.2	89.2	96.9	90.9	71.4	93.0

ALBERT-xxlarge at 1.5M training steps outperformed all competing single models on every GLUE dev task. The improvements were especially large on RTE (+2.6 over RoBERTa, +5.4 over XLNet, +18.8 over BERT-large) and CoLA (+3.4 over RoBERTa, +7.8 over XLNet, +10.8 over BERT-large).^[1]

On the GLUE test set using ensemble models, ALBERT achieved an overall score of 89.4, setting a new state-of-the-art at the time of publication:^[1]

Model	MNLI	QNLI	QQP	RTE	SST-2	MRPC	CoLA	STS-B	WNLI	Avg
XLNet	90.2	98.6	90.3	86.3	96.8	93.0	67.8	91.6	90.4	88.4
RoBERTa	90.8	98.9	90.2	88.2	96.7	92.3	67.8	92.2	89.0	88.5
ALBERT	91.3	99.2	90.5	89.2	97.1	93.4	69.1	92.5	91.8	89.4

SQuAD Results

On the Stanford Question Answering Dataset (SQuAD), ALBERT-xxlarge set new records:^[1]

Model	SQuAD 1.1 Dev (EM/F1)	SQuAD 2.0 Dev (EM/F1)	SQuAD 2.0 Test (EM/F1)
BERT-large	90.9/84.1	81.8/79.0	89.1/86.3
XLNet-large	94.5/89.0	88.8/86.1	89.1/86.3
RoBERTa-large	94.6/88.9	89.4/86.5	89.8/86.8
ALBERT (1M steps)	94.8/89.2	89.9/87.2	-
ALBERT (1.5M steps)	94.8/89.3	90.2/87.4	90.9/88.1
ALBERT (ensemble)	-	-	92.2/89.7

The single-model ALBERT at 1.5M steps beat RoBERTa by 0.4 F1 on SQuAD 1.1 dev and 0.9 F1 on SQuAD 2.0 dev. The ALBERT ensemble achieved 92.2 EM / 89.7 F1 on the SQuAD 2.0 test set.^[1]

RACE Results

The RACE (ReAding Comprehension from Examinations) benchmark tests machine reading comprehension using English exam questions collected from Chinese middle school and high school examinations.^[9]

Model	RACE Test Accuracy
BERT-large	72.0
XLNet-large	81.8
RoBERTa-large	83.2
ALBERT (1.5M, single model)	86.5
ALBERT (ensemble)	89.4

ALBERT's single model beat RoBERTa by 3.3 points, and the ensemble beat it by 6.2 points. The RACE improvement was the largest of all benchmarks, suggesting that ALBERT's SOP objective is especially helpful for tasks that require understanding multi-sentence coherence and discourse structure.^[1]

Ablation Studies

Effect of Network Depth

Using an ALBERT-large configuration (which has only 18M parameters regardless of depth due to parameter sharing), the authors evaluated the impact of varying the number of layers:^[1]

Layers	SQuAD 1.1 (EM/F1)	SQuAD 2.0 (EM/F1)	MNLI	SST-2	RACE	Avg
1	31.1/22.9	50.1/50.1	66.4	80.8	40.1	52.9
3	79.8/69.7	64.4/61.7	77.7	86.7	54.0	71.2
6	86.4/78.4	73.8/71.1	81.2	88.9	60.9	77.2
12	89.8/83.3	80.7/77.9	83.3	91.7	66.7	81.5
24	90.3/83.3	81.8/79.0	83.3	91.5	68.7	82.1
48	90.0/83.1	81.8/78.9	83.4	91.9	66.9	81.8

Performance improves sharply from 1 to 12 layers, gains taper off between 12 and 24 layers, and a 48-layer model shows slight degradation on some tasks. This indicates diminishing returns from depth when parameters are shared, and it explains why ALBERT-xxlarge uses 12 layers rather than 24.^[1]

Effect of Hidden Size

Using a 3-layer ALBERT-large configuration, the authors tested increasing hidden sizes:^[1]

Hidden Size	Parameters	SQuAD 1.1	SQuAD 2.0	MNLI	SST-2	RACE	Avg
1,024	18M	79.8/69.7	64.4/61.7	77.7	86.7	54.0	71.2
2,048	60M	83.3/74.1	69.1/66.6	79.7	88.6	58.2	74.6
4,096	225M	85.0/76.4	71.0/68.1	80.3	90.4	60.4	76.3
6,144	499M	84.7/75.8	67.8/65.4	78.1	89.1	56.0	74.0

Performance increases from H = 1,024 to H = 4,096 but drops at H = 6,144 (499M parameters), demonstrating model degradation. The authors noted that this may be related to optimization difficulty rather than model capacity, which is why H = 4,096 was selected for ALBERT-xxlarge.^[1]

Removing Dropout

For ALBERT-xxlarge, removing dropout improved the average dev score from 90.4 to 90.7 across all benchmarks. The authors observed that the model had not overfit the training data even after 1 million training steps, and noted that ALBERT's parameter sharing already provides an implicit regularization effect, making explicit dropout unnecessary and potentially harmful.^[1]

Configuration	SQuAD 1.1	SQuAD 2.0	MNLI	SST-2	RACE	Avg
With dropout	94.7/89.2	89.6/86.9	90.0	96.3	85.7	90.4
Without dropout	94.8/89.5	89.9/87.2	90.4	96.5	86.1	90.7

Additional Training Data

Using ALBERT-base, the paper also evaluated the effect of adding the extra training data used by XLNet and RoBERTa:^[1]

Data	SQuAD 1.1	SQuAD 2.0	MNLI	SST-2	RACE	Avg
BookCorpus + Wikipedia only	89.3/82.3	80.0/77.1	81.6	90.3	64.0	80.1
With additional data	88.8/81.7	79.1/76.3	82.4	92.8	66.0	80.8

Additional data improved performance on some tasks (MNLI, SST-2, RACE) but slightly decreased performance on SQuAD, resulting in a modest overall improvement of +0.7 in average score.^[1]

The following table situates ALBERT alongside other prominent models from the same era:

Model	Year	Parameters	Key Innovation	GLUE Test Score (Ensemble)
BERT-large	2018	334M	Masked LM + NSP pre-training	80.5
XLNet-large	2019	360M	Permutation language modeling	88.4
RoBERTa-large	2019	355M	Optimized BERT pre-training (no NSP)	88.5
ALBERT-xxlarge	2019	235M	Factorized embeddings, parameter sharing, SOP	89.4
ELECTRA-large	2020	335M	Replaced token detection	89.4
DeBERTa-large	2020	350M	Disentangled attention + enhanced mask decoder	90.0

ALBERT achieved a GLUE test score of 89.4, matching or exceeding all contemporaneous models while using significantly fewer parameters than any competitor.^[1]

What are ALBERT's limitations?

Despite its parameter efficiency, ALBERT has several known limitations.

Inference speed is not proportionally faster. ALBERT reduces the number of stored parameters, but the computation graph during inference remains the same size. A 12-layer ALBERT model still performs 12 forward passes through the Transformer block; the fact that each pass reuses the same weights does not reduce the number of floating-point operations. As a result, ALBERT-xxlarge (235M parameters) is actually slower at inference than BERT-large (334M parameters) because its hidden size of 4,096 is four times larger than BERT-large's 1,024, requiring more computation per layer.^[1]

Training the largest configurations is expensive. Although ALBERT-large trains 1.7 times faster than BERT-large, ALBERT-xxlarge trains approximately 3.3 times slower (0.3x relative speed). The large hidden dimension drives up computation costs even though the parameter count is smaller.^[1]

Performance at small scales lags behind BERT. ALBERT-base (12M parameters) achieves an average benchmark score of 80.1 compared to BERT-base's 82.3. The aggressive parameter sharing takes a toll at smaller model sizes where the single shared Transformer block does not have enough capacity to capture all the patterns that 12 independent blocks would learn. This makes ALBERT less attractive as a lightweight model for resource-constrained deployment compared to knowledge distillation-based approaches like DistilBERT.^[11]

Same inference memory for activations. While ALBERT uses less memory to store its model weights, the activation memory (the intermediate computation results that must be kept in memory during a forward pass) is determined by the hidden size and sequence length, not the number of unique parameters. ALBERT-xxlarge with its 4,096 hidden dimension uses more activation memory than BERT-large.^[1]

No reduction in computational FLOPs. Unlike methods such as knowledge distillation (e.g., DistilBERT) or pruning, ALBERT does not reduce the actual computation required during inference. For applications where inference latency is the primary concern, ALBERT offers less benefit than distillation-based approaches that achieve faster inference by using fewer layers.^[11]

Sensitivity to width scaling. The ablation study on hidden size shows that performance drops when increasing from H = 4,096 to H = 6,144, suggesting that very wide models with shared parameters may be difficult to optimize. This limits the potential for further scaling ALBERT to even larger hidden dimensions.^[1]

Version 2 Improvements

The ALBERT team released a second version (v2) of all model checkpoints with several training improvements. Version 2 models employ no dropout, additional training data, and longer training schedules, with ALBERT-base v2 trained for 10 million steps and larger models trained for 3 million steps. The removal of dropout was motivated by the observation that ALBERT models do not overfit the training data as quickly as BERT models, likely because the parameter-sharing mechanism itself acts as a regularizer.^[1] The v2 checkpoints became the standard versions available through the Hugging Face Transformers library.

What is ALBERT's impact and legacy?

ALBERT demonstrated that the number of stored parameters in a model and its representational capacity are not the same thing. By decoupling these two properties through factorized embeddings and parameter sharing, ALBERT showed that much smaller models could match or exceed the performance of their larger counterparts.^[1] This insight influenced subsequent work on efficient Transformers and parameter-efficient methods.

The factorized embedding technique has been adopted in various forms by later models. The idea of decomposing large embedding matrices into smaller factors appears in ELECTRA, DeBERTa, and other efficient Transformer architectures.^[10]

Cross-layer parameter sharing, while not universally adopted due to its inference speed limitations, inspired research into other forms of weight sharing and weight tying in deep learning. Universal Transformers, which were proposed independently around the same time, explored a similar idea of applying the same Transformer block iteratively. The concept of parameter efficiency also foreshadowed later work on parameter-efficient fine-tuning methods such as LoRA and adapters.

ALBERT's SOP objective proved that more carefully designed pre-training tasks could improve downstream performance. This line of thinking influenced subsequent work on contrastive and order-aware pre-training objectives.

Is ALBERT open source?

Yes. The ALBERT codebase was open-sourced through the google-research/albert repository on GitHub, and pre-trained checkpoints for all configurations (v1 and v2) are available through the Hugging Face Transformers library under identifiers such as albert-base-v2, albert-large-v2, albert-xlarge-v2, and albert-xxlarge-v2. The model supports PyTorch, TensorFlow, and JAX/Flax backends.

References

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." *Proceedings of the International Conference on Learning Representations (ICLR 2020)*. arXiv:1909.11942. ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*, pp. 4171-4186. arXiv:1810.04805. ↩
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692. ↩
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q.V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." *Advances in Neural Information Processing Systems 32 (NeurIPS 2019)*. arXiv:1906.08237.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*. arXiv:1706.03762. ↩
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S.R. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." *Proceedings of ICLR 2019*. arXiv:1804.07461. ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of EMNLP 2016*. arXiv:1606.05250.
Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." *Proceedings of ACL 2018*. arXiv:1806.03822.
Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). "RACE: Large-scale ReAding Comprehension Dataset From Examinations." *Proceedings of EMNLP 2017*. arXiv:1704.04683. ↩
Clark, K., Luong, M.-T., Le, Q.V., & Manning, C.D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *Proceedings of ICLR 2020*. arXiv:2003.10555. ↩
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv:1910.01108. ↩
You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C.J. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *Proceedings of ICLR 2020*. arXiv:1904.00962. ↩
Google Research. "ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations." Google Research Blog. https://research.google/blog/albert-a-lite-bert-for-self-supervised-learning-of-language-representations/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

8 revisions by 1 contributors · full history

Suggest edit

ALBERT

Background and Motivation

How is ALBERT structured?

What is factorized embedding parameterization?

What is Sentence Order Prediction (SOP)?

What are the ALBERT model configurations?

How was ALBERT pre-trained?

Training Data

Training Objectives

Training Details

Training Speed

How did ALBERT perform on benchmarks?

Overall Comparison with BERT

GLUE Benchmark

SQuAD Results

RACE Results

Ablation Studies

Effect of Network Depth

Effect of Hidden Size

Removing Dropout

Additional Training Data

What are ALBERT's limitations?

Version 2 Improvements

What is ALBERT's impact and legacy?

Is ALBERT open source?

References

Improve this article

What links here (24 of 29)

What links here (24 of 29)

Background and Motivation

How is ALBERT structured?

What is factorized embedding parameterization?

What is cross-layer parameter sharing?

What is Sentence Order Prediction (SOP)?

What are the ALBERT model configurations?

How was ALBERT pre-trained?

Training Data

Training Objectives

Training Details

Training Speed

How did ALBERT perform on benchmarks?

Overall Comparison with BERT

GLUE Benchmark

SQuAD Results

RACE Results

Ablation Studies

Effect of Network Depth

Effect of Hidden Size

Removing Dropout

Additional Training Data

How does ALBERT compare with related models?

What are ALBERT's limitations?

Version 2 Improvements

What is ALBERT's impact and legacy?

Is ALBERT open source?

References

Improve this article

Related Articles

Positional encoding

XLNet

RoBERTa

ELECTRA

DeBERTa

DistilBERT

What links here (24 of 29)

Related Articles

Positional encoding

XLNet

RoBERTa

ELECTRA

DeBERTa

DistilBERT

What links here (24 of 29)