ALBERT (A Lite BERT) is a parameter-efficient variant of the BERT language model developed by researchers at Google Research and the Toyota Technological Institute at Chicago (TTIC). Published as a conference paper at ICLR 2020, ALBERT introduced two parameter-reduction techniques, factorized embedding parameterization and cross-layer parameter sharing, that together reduce the parameter count of a BERT-like model by up to 89% without proportional loss in performance. The paper also replaced BERT's Next Sentence Prediction (NSP) pre-training objective with a more challenging Sentence Order Prediction (SOP) task that forces the model to learn finer-grained inter-sentence coherence.
The ALBERT paper was authored by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Among them, Lan, Goodman, Sharma, and Soricut were affiliated with Google Research, while Chen and Gimpel were at TTIC. The paper was first posted to arXiv on September 26, 2019 (arXiv:1909.11942), and its final published version appeared at the International Conference on Learning Representations (ICLR) in April 2020.
An ALBERT configuration with the same hidden size and layer count as BERT-large achieves roughly 18 times fewer parameters (18M vs. 334M) and can be trained approximately 1.7 times faster. ALBERT-xxlarge, the largest configuration with 235M parameters, established new state-of-the-art results on the GLUE benchmark (89.4), SQuAD 1.1 and 2.0, and the RACE reading comprehension benchmark at the time of publication, all while using fewer parameters than BERT-large.
Pre-trained language models grew rapidly in size during 2018 and 2019. BERT-large had 334M parameters, GPT-2 reached 1.5 billion, and models continued to scale upward. While larger models generally improved downstream task accuracy, this growth brought practical problems: GPU and TPU memory limits restricted the maximum model size that could be trained on available hardware, longer training times slowed research iteration, and, somewhat unexpectedly, simply making models bigger sometimes led to performance degradation rather than improvement.
The ALBERT authors observed that existing approaches tied the size of the word embedding layer directly to the hidden layer size, which was wasteful because word embeddings are context-independent while hidden states are context-dependent. They also noted that every Transformer layer in BERT learned its own set of parameters, even though prior work had shown that Transformer layers often learn similar patterns across depths. These two observations motivated the core parameter-reduction techniques in ALBERT.
A secondary motivation was the weakness of BERT's Next Sentence Prediction (NSP) objective. Research by the RoBERTa team and others had already shown that NSP provided little or no benefit for downstream tasks. The ALBERT authors hypothesized that NSP was too easy because it conflated two separate signals: topic prediction (whether two sentences are from the same document) and coherence prediction (whether the sentences are in a logical order). Topic prediction overlaps heavily with the masked language modeling (MLM) signal, making it redundant.
ALBERT shares the same encoder-only Transformer architecture as BERT. It uses multi-head self-attention layers followed by position-wise feed-forward networks, with layer normalization and residual connections at each sub-layer. The key differences lie in how the embedding layer is structured and how parameters are shared across layers.
In BERT, the word embedding dimension E is always equal to the hidden layer dimension H. This means that increasing the hidden size (to improve the model's representational capacity) also increases the embedding matrix proportionally. Since the vocabulary size V is typically 30,000, the embedding matrix alone accounts for V x H parameters, which can be substantial.
The ALBERT authors argued that this coupling is unnecessary. Word embeddings encode context-independent representations of individual tokens, while hidden states encode context-dependent representations shaped by self-attention across the entire input sequence. The former requires less capacity than the latter. Therefore, ALBERT decouples the two sizes by setting E much smaller than H. The vocabulary is first projected into a low-dimensional embedding space of size E, and then a linear projection maps each E-dimensional embedding up to the H-dimensional hidden space.
This factorization reduces the embedding parameter count from O(V x H) to O(V x E + E x H). When H is much larger than E, the savings are substantial. For example, with V = 30,000, H = 4,096, and E = 128, the embedding parameters drop from approximately 123 million (30,000 x 4,096) to approximately 4.4 million (30,000 x 128 + 128 x 4,096). Google Research described this as "an 80% reduction in the parameters of the projection block."
The ALBERT paper experimented with embedding sizes of 64, 128, 256, and 768. Under the all-shared condition (where cross-layer parameter sharing is active), an embedding size of 128 yielded the best overall performance across downstream tasks. The full results are shown below.
| Embedding Size (E) | Parameters (Not Shared) | Avg (Not Shared) | Parameters (All Shared) | Avg (All Shared) |
|---|---|---|---|---|
| 64 | 87M | 81.3 | 10M | 79.0 |
| 128 | 89M | 81.7 | 12M | 80.1 |
| 256 | 93M | 81.8 | 16M | 79.6 |
| 768 | 108M | 82.3 | 31M | 79.8 |
Without parameter sharing, larger embedding sizes consistently produced better results, with E = 768 achieving the highest average. With all-shared parameters, however, E = 128 provided the best average score. This occurs because the shared parameters compress the model's capacity, and a smaller embedding space is sufficient in this regime. All subsequent ALBERT experiments use E = 128.
The second technique is cross-layer parameter sharing, in which all Transformer layers use the same set of parameters. Instead of each of the L layers learning its own attention weights and feed-forward weights independently, ALBERT defines a single set of Transformer parameters and reuses them at every layer. This means that a 12-layer ALBERT model stores parameters for only one Transformer block, while at runtime the input passes through that same block 12 times.
This approach prevents the total parameter count from growing with the depth of the network. A 12-layer and a 24-layer ALBERT model with the same hidden size have identical parameter counts; only the computation time increases with depth. Google Research reported that this achieves "a 90% parameter reduction for the attention-feedforward block (a 70% reduction overall)."
The paper evaluated four sharing strategies to understand which components benefit most from sharing:
| Sharing Strategy | Parameters (E=128) | Parameters (E=768) | Avg Score (E=128) | Avg Score (E=768) |
|---|---|---|---|---|
| Not shared (BERT-style) | 89M | 108M | 81.6 | 82.3 |
| Shared attention only | 64M | 83M | 81.7 | 81.6 |
| Shared FFN only | 38M | 57M | 80.2 | 79.5 |
| All shared (ALBERT-style) | 12M | 31M | 80.1 | 79.8 |
Sharing only the attention parameters had a minimal effect on performance (and even improved the average slightly with E = 128, from 81.6 to 81.7). The larger degradation came from sharing the feed-forward network (FFN) parameters. Sharing all parameters together still delivered reasonable accuracy at a fraction of the parameter count. The default ALBERT configuration shares all parameters, accepting a modest performance decrease in exchange for a dramatic reduction in parameter count.
An important property of cross-layer sharing is that it stabilizes the network. The authors found that the L2 distances and cosine similarities between input and output embeddings of each layer oscillated rather than converging to zero, suggesting that ALBERT's shared parameters create a smoother loss surface compared to BERT's independently parameterized layers.
ALBERT replaces BERT's Next Sentence Prediction (NSP) task with Sentence Order Prediction (SOP). In NSP, the model receives two segments and predicts whether they are consecutive sentences from the same document (positive) or whether the second segment was randomly sampled from a different document (negative). The ALBERT authors argued that the negative examples in NSP are too easy because the model can distinguish them based on topic alone (different documents are almost always about different topics), without needing to understand inter-sentence coherence.
SOP uses the same positive examples as NSP: two consecutive segments from the same document. However, its negative examples are created by simply swapping the order of those two segments. Because both the positive and negative examples come from the same document, the model cannot rely on topic differences and must instead learn the actual ordering relationship between segments. This forces the model to capture discourse-level coherence properties.
The ablation results from the paper compare three settings using an ALBERT-base configuration: no sentence-level loss (as in XLNet and RoBERTa), NSP (as in BERT), and SOP (as in ALBERT).
| Pre-training Objective | MLM Accuracy | NSP Accuracy | SOP Accuracy | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|---|---|
| None | 54.9 | 52.4 | 53.3 | 88.6/81.5 | 78.1/75.3 | 81.5 | 89.9 | 61.7 | 79.0 |
| NSP | 54.5 | 90.5 | 52.0 | 88.4/81.5 | 77.2/74.6 | 81.6 | 91.1 | 62.3 | 79.2 |
| SOP | 54.0 | 78.9 | 86.5 | 89.3/82.3 | 80.0/77.1 | 82.0 | 90.3 | 64.0 | 80.1 |
A model trained with NSP achieves 90.5% accuracy on the NSP task but performs at random (52.0%) on SOP, demonstrating that NSP learns only the easier topic-prediction signal. Conversely, a model trained with SOP can still solve the NSP task reasonably well (78.9%) while also achieving 86.5% accuracy on the harder SOP task. On downstream benchmarks, SOP consistently outperforms NSP, with gains of approximately +1% on SQuAD 1.1, +2% on SQuAD 2.0, and +1.7% on RACE, yielding a +1.1 improvement in average score (80.1 vs. 79.0 for no sentence-level loss).
ALBERT comes in four standard sizes. All configurations use a vocabulary of 30,000 tokens processed by a SentencePiece tokenizer and an embedding dimension E of 128.
| Model | Hidden Size (H) | Layers (L) | Attention Heads | Intermediate Size | Parameters |
|---|---|---|---|---|---|
| ALBERT-base | 768 | 12 | 12 | 3,072 | 12M |
| ALBERT-large | 1,024 | 24 | 16 | 4,096 | 18M |
| ALBERT-xlarge | 2,048 | 24 | 32 | 8,192 | 60M |
| ALBERT-xxlarge | 4,096 | 12 | 64 | 16,384 | 235M |
For comparison with BERT:
| Model | Hidden Size (H) | Embedding Size (E) | Layers | Parameters | Parameter Sharing |
|---|---|---|---|---|---|
| BERT-base | 768 | 768 | 12 | 108M | No |
| BERT-large | 1,024 | 1,024 | 24 | 334M | No |
| ALBERT-base | 768 | 128 | 12 | 12M | Yes |
| ALBERT-large | 1,024 | 128 | 24 | 18M | Yes |
| ALBERT-xlarge | 2,048 | 128 | 24 | 60M | Yes |
| ALBERT-xxlarge | 4,096 | 128 | 12 | 235M | Yes |
ALBERT-base has 12M parameters compared to BERT-base's 108M, a reduction of roughly 89%. ALBERT-large has 18M parameters compared to BERT-large's 334M, roughly 18 times fewer. Even the largest configuration, ALBERT-xxlarge with 235M parameters, is about 70% the size of BERT-large while using a hidden size four times larger (4,096 vs. 1,024).
Notably, ALBERT-xxlarge uses only 12 layers rather than 24 because the authors found that a 24-layer version achieved similar results (88.7 average on dev benchmarks for both configurations) but was computationally more expensive. Reducing the depth to 12 layers halved the computation cost without sacrificing accuracy.
ALBERT-large has the same hidden size (1,024) and layer count (24) as BERT-large, yet contains only 18M parameters compared to BERT-large's 334M. This dramatic reduction comes entirely from factorized embeddings and cross-layer parameter sharing.
ALBERT was pre-trained on the same data as BERT: the BookCorpus (approximately 800 million words from 11,038 unpublished books) and English Wikipedia (approximately 2,500 million words, with lists, tables, and headers removed). The combined training corpus contains roughly 16 GB of uncompressed text. For the state-of-the-art comparison experiments, ALBERT was also trained on the additional data used by XLNet and RoBERTa.
Like BERT, ALBERT uses masked language modeling (MLM) as its primary pre-training objective. Fifteen percent of input tokens are randomly selected; of those, 80% are replaced with a [MASK] token, 10% are replaced with a random token, and 10% are left unchanged. The model is trained to predict the original token at each selected position.
ALBERT uses n-gram masking with a maximum n-gram length of 3. The probability of selecting an n-gram of length n decreases with n, encouraging shorter masked spans while still allowing the model to learn to predict multi-token phrases.
The second objective is the SOP task described above. Together, the MLM and SOP losses are combined and optimized jointly during pre-training.
ALBERT uses a maximum input length of 512 tokens and is optimized with the LAMB optimizer. The learning rate is set to 0.00176 with a batch size of 4,096. Models are trained for 125,000 steps on Cloud TPU V3 hardware, with the number of TPUs ranging from 64 (for smaller models) to 512 (for ALBERT-xxlarge).
Because ALBERT has far fewer parameters, each training iteration is faster in terms of communication overhead (less data needs to be synchronized across devices) and memory consumption. The paper reports the following relative training speeds measured against BERT-large as a baseline:
| Model | Parameters | Training Speedup (vs. BERT-large) |
|---|---|---|
| BERT-base | 108M | 4.7x |
| BERT-large | 334M | 1.0x (baseline) |
| ALBERT-base | 12M | 5.6x |
| ALBERT-large | 18M | 1.7x |
| ALBERT-xlarge | 60M | 0.6x |
| ALBERT-xxlarge | 235M | 0.3x |
ALBERT-large is 1.7 times faster than BERT-large per training step despite matching BERT-large's depth and hidden size. However, ALBERT-xlarge and ALBERT-xxlarge are slower than BERT-large because their much larger hidden sizes (2,048 and 4,096 respectively) increase the computation per layer significantly, even though the total stored parameters are smaller.
An important training efficiency result is that ALBERT-xxlarge reaches higher accuracy than BERT-large in less wall-clock time:
| Model | Training Steps | Training Time | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|---|
| BERT-large | 400k | 34h | 93.5/87.4 | 86.9/84.3 | 87.8 | 94.6 | 77.3 | 87.2 |
| ALBERT-xxlarge | 125k | 32h | 94.0/88.1 | 88.3/85.3 | 87.8 | 95.4 | 82.5 | 88.7 |
With similar training time, ALBERT-xxlarge outperforms BERT-large by 1.5 points on average. The improvement is especially pronounced on RACE (+5.2 points), which requires reasoning over longer passages.
The following table compares ALBERT configurations against BERT on five representative tasks, all trained on the same data (BookCorpus + Wikipedia):
| Model | Parameters | SQuAD 1.1 (EM/F1) | SQuAD 2.0 (EM/F1) | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|
| BERT-base | 108M | 90.4/83.2 | 80.4/77.6 | 84.5 | 92.8 | 68.2 | 82.3 |
| BERT-large | 334M | 92.2/85.5 | 85.0/82.2 | 86.6 | 93.0 | 73.9 | 85.2 |
| ALBERT-base | 12M | 89.3/82.3 | 80.0/77.1 | 81.6 | 90.3 | 64.0 | 80.1 |
| ALBERT-large | 18M | 90.6/83.9 | 82.3/79.4 | 83.5 | 91.7 | 68.5 | 82.4 |
| ALBERT-xlarge | 60M | 92.5/86.1 | 86.1/83.1 | 86.4 | 92.4 | 74.8 | 85.5 |
| ALBERT-xxlarge | 235M | 94.1/88.3 | 88.1/85.1 | 88.0 | 95.2 | 82.3 | 88.7 |
ALBERT-xxlarge achieves significant improvements over BERT-large across every task: +1.9 points on SQuAD 1.1 (comparing F1 of 88.3 vs. 85.5), +3.1 on SQuAD 2.0 (85.1 vs. 82.2), +1.4 on MNLI, +2.2 on SST-2, and +8.4 on RACE. ALBERT-xlarge, with only 60M parameters (roughly 18% of BERT-large's count), already matches or exceeds BERT-large on SQuAD 1.1, SQuAD 2.0, and RACE.
ALBERT-large, with only 18M parameters, matches BERT-base (108M parameters) in average score (82.4 vs. 82.3).
On the GLUE benchmark, the authors compared ALBERT-xxlarge (trained for 1M and 1.5M steps with additional data) against single-model results from BERT-large, XLNet-large, and RoBERTa-large on the development set:
| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B |
|---|---|---|---|---|---|---|---|---|
| BERT-large | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 |
| XLNet-large | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8 |
| RoBERTa-large | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4 |
| ALBERT-xxlarge (1M) | 90.4 | 95.2 | 92.0 | 88.1 | 96.8 | 90.2 | 68.7 | 92.7 |
| ALBERT-xxlarge (1.5M) | 90.8 | 95.3 | 92.2 | 89.2 | 96.9 | 90.9 | 71.4 | 93.0 |
ALBERT-xxlarge at 1.5M training steps outperformed all competing single models on every GLUE dev task. The improvements were especially large on RTE (+2.6 over RoBERTa, +5.4 over XLNet, +18.8 over BERT-large) and CoLA (+3.4 over RoBERTa, +7.8 over XLNet, +10.8 over BERT-large).
On the GLUE test set using ensemble models, ALBERT achieved an overall score of 89.4, setting a new state-of-the-art at the time of publication:
| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B | WNLI | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| XLNet | 90.2 | 98.6 | 90.3 | 86.3 | 96.8 | 93.0 | 67.8 | 91.6 | 90.4 | 88.4 |
| RoBERTa | 90.8 | 98.9 | 90.2 | 88.2 | 96.7 | 92.3 | 67.8 | 92.2 | 89.0 | 88.5 |
| ALBERT | 91.3 | 99.2 | 90.5 | 89.2 | 97.1 | 93.4 | 69.1 | 92.5 | 91.8 | 89.4 |
On the Stanford Question Answering Dataset (SQuAD), ALBERT-xxlarge set new records:
| Model | SQuAD 1.1 Dev (EM/F1) | SQuAD 2.0 Dev (EM/F1) | SQuAD 2.0 Test (EM/F1) |
|---|---|---|---|
| BERT-large | 90.9/84.1 | 81.8/79.0 | 89.1/86.3 |
| XLNet-large | 94.5/89.0 | 88.8/86.1 | 89.1/86.3 |
| RoBERTa-large | 94.6/88.9 | 89.4/86.5 | 89.8/86.8 |
| ALBERT (1M steps) | 94.8/89.2 | 89.9/87.2 | - |
| ALBERT (1.5M steps) | 94.8/89.3 | 90.2/87.4 | 90.9/88.1 |
| ALBERT (ensemble) | - | - | 92.2/89.7 |
The single-model ALBERT at 1.5M steps beat RoBERTa by 0.4 F1 on SQuAD 1.1 dev and 0.9 F1 on SQuAD 2.0 dev. The ALBERT ensemble achieved 92.2 EM / 89.7 F1 on the SQuAD 2.0 test set.
The RACE (ReAding Comprehension from Examinations) benchmark tests machine reading comprehension using English exam questions collected from Chinese middle school and high school examinations.
| Model | RACE Test Accuracy |
|---|---|
| BERT-large | 72.0 |
| XLNet-large | 81.8 |
| RoBERTa-large | 83.2 |
| ALBERT (1.5M, single model) | 86.5 |
| ALBERT (ensemble) | 89.4 |
ALBERT's single model beat RoBERTa by 3.3 points, and the ensemble beat it by 6.2 points. The RACE improvement was the largest of all benchmarks, suggesting that ALBERT's SOP objective is especially helpful for tasks that require understanding multi-sentence coherence and discourse structure.
Using an ALBERT-large configuration (which has only 18M parameters regardless of depth due to parameter sharing), the authors evaluated the impact of varying the number of layers:
| Layers | SQuAD 1.1 (EM/F1) | SQuAD 2.0 (EM/F1) | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|
| 1 | 31.1/22.9 | 50.1/50.1 | 66.4 | 80.8 | 40.1 | 52.9 |
| 3 | 79.8/69.7 | 64.4/61.7 | 77.7 | 86.7 | 54.0 | 71.2 |
| 6 | 86.4/78.4 | 73.8/71.1 | 81.2 | 88.9 | 60.9 | 77.2 |
| 12 | 89.8/83.3 | 80.7/77.9 | 83.3 | 91.7 | 66.7 | 81.5 |
| 24 | 90.3/83.3 | 81.8/79.0 | 83.3 | 91.5 | 68.7 | 82.1 |
| 48 | 90.0/83.1 | 81.8/78.9 | 83.4 | 91.9 | 66.9 | 81.8 |
Performance improves sharply from 1 to 12 layers, gains taper off between 12 and 24 layers, and a 48-layer model shows slight degradation on some tasks. This indicates diminishing returns from depth when parameters are shared, and it explains why ALBERT-xxlarge uses 12 layers rather than 24.
Using a 3-layer ALBERT-large configuration, the authors tested increasing hidden sizes:
| Hidden Size | Parameters | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|
| 1,024 | 18M | 79.8/69.7 | 64.4/61.7 | 77.7 | 86.7 | 54.0 | 71.2 |
| 2,048 | 60M | 83.3/74.1 | 69.1/66.6 | 79.7 | 88.6 | 58.2 | 74.6 |
| 4,096 | 225M | 85.0/76.4 | 71.0/68.1 | 80.3 | 90.4 | 60.4 | 76.3 |
| 6,144 | 499M | 84.7/75.8 | 67.8/65.4 | 78.1 | 89.1 | 56.0 | 74.0 |
Performance increases from H = 1,024 to H = 4,096 but drops at H = 6,144 (499M parameters), demonstrating model degradation. The authors noted that this may be related to optimization difficulty rather than model capacity, which is why H = 4,096 was selected for ALBERT-xxlarge.
For ALBERT-xxlarge, removing dropout improved the average dev score from 90.4 to 90.7 across all benchmarks. The authors observed that the model had not overfit the training data even after 1 million training steps, and noted that ALBERT's parameter sharing already provides an implicit regularization effect, making explicit dropout unnecessary and potentially harmful.
| Configuration | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|
| With dropout | 94.7/89.2 | 89.6/86.9 | 90.0 | 96.3 | 85.7 | 90.4 |
| Without dropout | 94.8/89.5 | 89.9/87.2 | 90.4 | 96.5 | 86.1 | 90.7 |
Using ALBERT-base, the paper also evaluated the effect of adding the extra training data used by XLNet and RoBERTa:
| Data | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|
| BookCorpus + Wikipedia only | 89.3/82.3 | 80.0/77.1 | 81.6 | 90.3 | 64.0 | 80.1 |
| With additional data | 88.8/81.7 | 79.1/76.3 | 82.4 | 92.8 | 66.0 | 80.8 |
Additional data improved performance on some tasks (MNLI, SST-2, RACE) but slightly decreased performance on SQuAD, resulting in a modest overall improvement of +0.7 in average score.
The following table situates ALBERT alongside other prominent models from the same era:
| Model | Year | Parameters | Key Innovation | GLUE Test Score (Ensemble) |
|---|---|---|---|---|
| BERT-large | 2018 | 334M | Masked LM + NSP pre-training | 80.5 |
| XLNet-large | 2019 | 360M | Permutation language modeling | 88.4 |
| RoBERTa-large | 2019 | 355M | Optimized BERT pre-training (no NSP) | 88.5 |
| ALBERT-xxlarge | 2019 | 235M | Factorized embeddings, parameter sharing, SOP | 89.4 |
| ELECTRA-large | 2020 | 335M | Replaced token detection | 89.4 |
| DeBERTa-large | 2020 | 350M | Disentangled attention + enhanced mask decoder | 90.0 |
ALBERT achieved a GLUE test score of 89.4, matching or exceeding all contemporaneous models while using significantly fewer parameters than any competitor.
Despite its parameter efficiency, ALBERT has several known limitations.
Inference speed is not proportionally faster. ALBERT reduces the number of stored parameters, but the computation graph during inference remains the same size. A 12-layer ALBERT model still performs 12 forward passes through the Transformer block; the fact that each pass reuses the same weights does not reduce the number of floating-point operations. As a result, ALBERT-xxlarge (235M parameters) is actually slower at inference than BERT-large (334M parameters) because its hidden size of 4,096 is four times larger than BERT-large's 1,024, requiring more computation per layer.
Training the largest configurations is expensive. Although ALBERT-large trains 1.7 times faster than BERT-large, ALBERT-xxlarge trains approximately 3.3 times slower (0.3x relative speed). The large hidden dimension drives up computation costs even though the parameter count is smaller.
Performance at small scales lags behind BERT. ALBERT-base (12M parameters) achieves an average benchmark score of 80.1 compared to BERT-base's 82.3. The aggressive parameter sharing takes a toll at smaller model sizes where the single shared Transformer block does not have enough capacity to capture all the patterns that 12 independent blocks would learn. This makes ALBERT less attractive as a lightweight model for resource-constrained deployment compared to knowledge distillation-based approaches like DistilBERT.
Same inference memory for activations. While ALBERT uses less memory to store its model weights, the activation memory (the intermediate computation results that must be kept in memory during a forward pass) is determined by the hidden size and sequence length, not the number of unique parameters. ALBERT-xxlarge with its 4,096 hidden dimension uses more activation memory than BERT-large.
No reduction in computational FLOPs. Unlike methods such as knowledge distillation (e.g., DistilBERT) or pruning, ALBERT does not reduce the actual computation required during inference. For applications where inference latency is the primary concern, ALBERT offers less benefit than distillation-based approaches that achieve faster inference by using fewer layers.
Sensitivity to width scaling. The ablation study on hidden size shows that performance drops when increasing from H = 4,096 to H = 6,144, suggesting that very wide models with shared parameters may be difficult to optimize. This limits the potential for further scaling ALBERT to even larger hidden dimensions.
The ALBERT team released a second version (v2) of all model checkpoints with several training improvements. Version 2 models employ no dropout, additional training data, and longer training schedules, with ALBERT-base v2 trained for 10 million steps and larger models trained for 3 million steps. The removal of dropout was motivated by the observation that ALBERT models do not overfit the training data as quickly as BERT models, likely because the parameter-sharing mechanism itself acts as a regularizer. The v2 checkpoints became the standard versions available through the Hugging Face Transformers library.
ALBERT demonstrated that the number of stored parameters in a model and its representational capacity are not the same thing. By decoupling these two properties through factorized embeddings and parameter sharing, ALBERT showed that much smaller models could match or exceed the performance of their larger counterparts. This insight influenced subsequent work on efficient Transformers and parameter-efficient methods.
The factorized embedding technique has been adopted in various forms by later models. The idea of decomposing large embedding matrices into smaller factors appears in ELECTRA, DeBERTa, and other efficient Transformer architectures.
Cross-layer parameter sharing, while not universally adopted due to its inference speed limitations, inspired research into other forms of weight sharing and weight tying in deep learning. Universal Transformers, which were proposed independently around the same time, explored a similar idea of applying the same Transformer block iteratively. The concept of parameter efficiency also foreshadowed later work on parameter-efficient fine-tuning methods such as LoRA and adapters.
ALBERT's SOP objective proved that more carefully designed pre-training tasks could improve downstream performance. This line of thinking influenced subsequent work on contrastive and order-aware pre-training objectives.
The ALBERT codebase was open-sourced through the google-research/albert repository on GitHub, and pre-trained checkpoints for all configurations (v1 and v2) are available through the Hugging Face Transformers library under identifiers such as albert-base-v2, albert-large-v2, albert-xlarge-v2, and albert-xxlarge-v2. The model supports PyTorch, TensorFlow, and JAX/Flax backends.