# ALBERT

> Source: https://aiwiki.ai/wiki/albert
> Updated: 2026-06-21
> Categories: Deep Learning, Natural Language Processing, Transformer Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**ALBERT** (A Lite BERT) is a parameter-efficient variant of the [BERT](/wiki/bert) language model developed by researchers at [Google](/wiki/google) Research and the Toyota Technological Institute at Chicago (TTIC). Published as a conference paper at [ICLR](/wiki/iclr) 2020, ALBERT introduced two parameter-reduction techniques, factorized embedding parameterization and cross-layer parameter sharing, that together reduce the parameter count of a BERT-like model by up to 89% without proportional loss in performance.[1] The paper also replaced BERT's Next Sentence Prediction (NSP) pre-training objective with a more challenging Sentence Order Prediction (SOP) task that forces the model to learn finer-grained inter-sentence coherence.[1]

The ALBERT paper was authored by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Among them, Lan, Goodman, Sharma, and Soricut were affiliated with Google Research, while Chen and Gimpel were at TTIC. The paper was first posted to arXiv on September 26, 2019 (arXiv:1909.11942), and its final published version appeared at the International Conference on Learning Representations (ICLR) in April 2020.[1]

In its own words, the paper states: "we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT," and reports that "an ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster."[1] An ALBERT configuration with the same hidden size and layer count as BERT-large achieves roughly 18 times fewer parameters (18M vs. 334M) and can be trained approximately 1.7 times faster. ALBERT-xxlarge, the largest configuration with 235M parameters, established new state-of-the-art results on the [GLUE](/wiki/glue_benchmark) benchmark (89.4), [SQuAD](/wiki/squad) 1.1 and 2.0, and the RACE reading comprehension benchmark at the time of publication, all while using fewer parameters than BERT-large.[1]

## Background and Motivation

Pre-trained language models grew rapidly in size during 2018 and 2019. [BERT](/wiki/bert)-large had 334M parameters, [GPT-2](/wiki/gpt-2) reached 1.5 billion, and models continued to scale upward.[2] While larger models generally improved downstream task accuracy, this growth brought practical problems: GPU and [TPU](/wiki/tpu) memory limits restricted the maximum model size that could be trained on available hardware, longer training times slowed research iteration, and, somewhat unexpectedly, simply making models bigger sometimes led to performance degradation rather than improvement.[1]

The ALBERT authors observed that existing approaches tied the size of the [word embedding](/wiki/word_embedding) layer directly to the hidden layer size, which was wasteful because word embeddings are context-independent while hidden states are context-dependent. They also noted that every [Transformer](/wiki/transformer) layer in BERT learned its own set of parameters, even though prior work had shown that Transformer layers often learn similar patterns across depths. These two observations motivated the core parameter-reduction techniques in ALBERT.[1]

A secondary motivation was the weakness of BERT's Next Sentence Prediction (NSP) objective. Research by the [RoBERTa](/wiki/roberta) team and others had already shown that NSP provided little or no benefit for downstream tasks.[3] The ALBERT authors hypothesized that NSP was too easy because it conflated two separate signals: topic prediction (whether two sentences are from the same document) and coherence prediction (whether the sentences are in a logical order). Topic prediction overlaps heavily with the [masked language modeling](/wiki/masked_language_model) (MLM) signal, making it redundant.[1]

## How is ALBERT structured?

ALBERT shares the same encoder-only [Transformer](/wiki/transformer) architecture as BERT. It uses multi-head [self-attention](/wiki/attention) layers followed by position-wise [feed-forward networks](/wiki/feedforward_neural_network_ffn), with [layer normalization](/wiki/layer_normalization) and residual connections at each sub-layer.[5] The key differences lie in how the embedding layer is structured and how parameters are shared across layers.

### What is factorized embedding parameterization?

In BERT, the word embedding dimension E is always equal to the hidden layer dimension H. This means that increasing the hidden size (to improve the model's representational capacity) also increases the embedding matrix proportionally. Since the vocabulary size V is typically 30,000, the embedding matrix alone accounts for V x H parameters, which can be substantial.[2]

The ALBERT authors argued that this coupling is unnecessary. Word embeddings encode context-independent representations of individual tokens, while hidden states encode context-dependent representations shaped by self-attention across the entire input sequence. The former requires less capacity than the latter. Therefore, ALBERT decouples the two sizes by setting E much smaller than H. The vocabulary is first projected into a low-dimensional embedding space of size E, and then a linear projection maps each E-dimensional embedding up to the H-dimensional hidden space.[1]

This factorization reduces the embedding parameter count from O(V x H) to O(V x E + E x H). When H is much larger than E, the savings are substantial. For example, with V = 30,000, H = 4,096, and E = 128, the embedding parameters drop from approximately 123 million (30,000 x 4,096) to approximately 4.4 million (30,000 x 128 + 128 x 4,096).[1] Google Research described this as "an 80% reduction in the parameters of the projection block."[13]

The ALBERT paper experimented with embedding sizes of 64, 128, 256, and 768. Under the all-shared condition (where cross-layer parameter sharing is active), an embedding size of 128 yielded the best overall performance across downstream tasks.[1] The full results are shown below.

| Embedding Size (E) | Parameters (Not Shared) | Avg (Not Shared) | Parameters (All Shared) | Avg (All Shared) |
|---|---|---|---|---|
| 64 | 87M | 81.3 | 10M | 79.0 |
| 128 | 89M | 81.7 | 12M | 80.1 |
| 256 | 93M | 81.8 | 16M | 79.6 |
| 768 | 108M | 82.3 | 31M | 79.8 |

Without parameter sharing, larger embedding sizes consistently produced better results, with E = 768 achieving the highest average. With all-shared parameters, however, E = 128 provided the best average score. This occurs because the shared parameters compress the model's capacity, and a smaller embedding space is sufficient in this regime. All subsequent ALBERT experiments use E = 128.[1]

### What is cross-layer parameter sharing?

The second technique is cross-layer parameter sharing, in which all Transformer layers use the same set of parameters. Instead of each of the L layers learning its own attention weights and feed-forward weights independently, ALBERT defines a single set of Transformer parameters and reuses them at every layer. This means that a 12-layer ALBERT model stores parameters for only one Transformer block, while at runtime the input passes through that same block 12 times.[1]

This approach prevents the total parameter count from growing with the depth of the network. A 12-layer and a 24-layer ALBERT model with the same hidden size have identical parameter counts; only the computation time increases with depth.[1] Google Research reported that this achieves "a 90% parameter reduction for the attention-feedforward block (a 70% reduction overall)."[13]

The paper evaluated four sharing strategies to understand which components benefit most from sharing:

| Sharing Strategy | Parameters (E=128) | Parameters (E=768) | Avg Score (E=128) | Avg Score (E=768) |
|---|---|---|---|---|
| Not shared (BERT-style) | 89M | 108M | 81.6 | 82.3 |
| Shared attention only | 64M | 83M | 81.7 | 81.6 |
| Shared FFN only | 38M | 57M | 80.2 | 79.5 |
| All shared (ALBERT-style) | 12M | 31M | 80.1 | 79.8 |

Sharing only the attention parameters had a minimal effect on performance (and even improved the average slightly with E = 128, from 81.6 to 81.7). The larger degradation came from sharing the [feed-forward network](/wiki/feedforward_neural_network_ffn) (FFN) parameters. Sharing all parameters together still delivered reasonable accuracy at a fraction of the parameter count. The default ALBERT configuration shares all parameters, accepting a modest performance decrease in exchange for a dramatic reduction in parameter count.[1]

An important property of cross-layer sharing is that it stabilizes the network. The authors found that the L2 distances and cosine similarities between input and output embeddings of each layer oscillated rather than converging to zero, suggesting that ALBERT's shared parameters create a smoother loss surface compared to BERT's independently parameterized layers.[1]

### What is Sentence Order Prediction (SOP)?

ALBERT replaces BERT's Next Sentence Prediction (NSP) task with Sentence Order Prediction (SOP). In NSP, the model receives two segments and predicts whether they are consecutive sentences from the same document (positive) or whether the second segment was randomly sampled from a different document (negative).[2] The ALBERT authors argued that the negative examples in NSP are too easy because the model can distinguish them based on topic alone (different documents are almost always about different topics), without needing to understand inter-sentence coherence.[1]

SOP uses the same positive examples as NSP: two consecutive segments from the same document. However, its negative examples are created by simply swapping the order of those two segments. Because both the positive and negative examples come from the same document, the model cannot rely on topic differences and must instead learn the actual ordering relationship between segments. This forces the model to capture discourse-level coherence properties.[1]

The ablation results from the paper compare three settings using an ALBERT-base configuration: no sentence-level loss (as in [XLNet](/wiki/xlnet) and RoBERTa), NSP (as in BERT), and SOP (as in ALBERT).[1]

| Pre-training Objective | MLM Accuracy | NSP Accuracy | SOP Accuracy | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|---|---|
| None | 54.9 | 52.4 | 53.3 | 88.6/81.5 | 78.1/75.3 | 81.5 | 89.9 | 61.7 | 79.0 |
| NSP | 54.5 | 90.5 | 52.0 | 88.4/81.5 | 77.2/74.6 | 81.6 | 91.1 | 62.3 | 79.2 |
| SOP | 54.0 | 78.9 | 86.5 | 89.3/82.3 | 80.0/77.1 | 82.0 | 90.3 | 64.0 | 80.1 |

A model trained with NSP achieves 90.5% accuracy on the NSP task but performs at random (52.0%) on SOP, demonstrating that NSP learns only the easier topic-prediction signal. Conversely, a model trained with SOP can still solve the NSP task reasonably well (78.9%) while also achieving 86.5% accuracy on the harder SOP task. On downstream benchmarks, SOP consistently outperforms NSP, with gains of approximately +1% on SQuAD 1.1, +2% on SQuAD 2.0, and +1.7% on RACE, yielding a +1.1 improvement in average score (80.1 vs. 79.0 for no sentence-level loss).[1]

## What are the ALBERT model configurations?

ALBERT comes in four standard sizes. All configurations use a vocabulary of 30,000 tokens processed by a [SentencePiece](/wiki/sentencepiece) tokenizer and an embedding dimension E of 128.[1]

| Model | Hidden Size (H) | Layers (L) | Attention Heads | Intermediate Size | Parameters |
|---|---|---|---|---|---|
| ALBERT-base | 768 | 12 | 12 | 3,072 | 12M |
| ALBERT-large | 1,024 | 24 | 16 | 4,096 | 18M |
| ALBERT-xlarge | 2,048 | 24 | 32 | 8,192 | 60M |
| ALBERT-xxlarge | 4,096 | 12 | 64 | 16,384 | 235M |

For comparison with BERT:

| Model | Hidden Size (H) | Embedding Size (E) | Layers | Parameters | Parameter Sharing |
|---|---|---|---|---|---|
| BERT-base | 768 | 768 | 12 | 108M | No |
| BERT-large | 1,024 | 1,024 | 24 | 334M | No |
| ALBERT-base | 768 | 128 | 12 | 12M | Yes |
| ALBERT-large | 1,024 | 128 | 24 | 18M | Yes |
| ALBERT-xlarge | 2,048 | 128 | 24 | 60M | Yes |
| ALBERT-xxlarge | 4,096 | 128 | 12 | 235M | Yes |

ALBERT-base has 12M parameters compared to BERT-base's 108M, a reduction of roughly 89%. ALBERT-large has 18M parameters compared to BERT-large's 334M, roughly 18 times fewer. Even the largest configuration, ALBERT-xxlarge with 235M parameters, is about 70% the size of BERT-large while using a hidden size four times larger (4,096 vs. 1,024).[1]

Notably, ALBERT-xxlarge uses only 12 layers rather than 24 because the authors found that a 24-layer version achieved similar results (88.7 average on dev benchmarks for both configurations) but was computationally more expensive. Reducing the depth to 12 layers halved the computation cost without sacrificing accuracy.[1]

ALBERT-large has the same hidden size (1,024) and layer count (24) as BERT-large, yet contains only 18M parameters compared to BERT-large's 334M. This dramatic reduction comes entirely from factorized embeddings and cross-layer parameter sharing.[1]

## How was ALBERT pre-trained?

### Training Data

ALBERT was pre-trained on the same data as BERT: the [BookCorpus](/wiki/bookcorpus) (approximately 800 million words from 11,038 unpublished books) and English Wikipedia (approximately 2,500 million words, with lists, tables, and headers removed). The combined training corpus contains roughly 16 GB of uncompressed text. For the state-of-the-art comparison experiments, ALBERT was also trained on the additional data used by XLNet and RoBERTa.[1]

### Training Objectives

Like BERT, ALBERT uses masked language modeling (MLM) as its primary pre-training objective. Fifteen percent of input tokens are randomly selected; of those, 80% are replaced with a `[MASK]` token, 10% are replaced with a random token, and 10% are left unchanged. The model is trained to predict the original token at each selected position.[2]

ALBERT uses n-gram masking with a maximum n-gram length of 3. The probability of selecting an n-gram of length n decreases with n, encouraging shorter masked spans while still allowing the model to learn to predict multi-token phrases.[1]

The second objective is the SOP task described above. Together, the MLM and SOP losses are combined and optimized jointly during pre-training.[1]

### Training Details

ALBERT uses a maximum input length of 512 tokens and is optimized with the LAMB optimizer.[12] The learning rate is set to 0.00176 with a [batch size](/wiki/batch_size) of 4,096. Models are trained for 125,000 steps on Cloud TPU V3 hardware, with the number of TPUs ranging from 64 (for smaller models) to 512 (for ALBERT-xxlarge).[1]

### Training Speed

Because ALBERT has far fewer parameters, each training iteration is faster in terms of communication overhead (less data needs to be synchronized across devices) and memory consumption. The paper reports the following relative training speeds measured against BERT-large as a baseline:[1]

| Model | Parameters | Training Speedup (vs. BERT-large) |
|---|---|---|
| BERT-base | 108M | 4.7x |
| BERT-large | 334M | 1.0x (baseline) |
| ALBERT-base | 12M | 5.6x |
| ALBERT-large | 18M | 1.7x |
| ALBERT-xlarge | 60M | 0.6x |
| ALBERT-xxlarge | 235M | 0.3x |

ALBERT-large is 1.7 times faster than BERT-large per training step despite matching BERT-large's depth and hidden size. However, ALBERT-xlarge and ALBERT-xxlarge are slower than BERT-large because their much larger hidden sizes (2,048 and 4,096 respectively) increase the computation per layer significantly, even though the total stored parameters are smaller.[1]

An important training efficiency result is that ALBERT-xxlarge reaches higher accuracy than BERT-large in less wall-clock time:[1]

| Model | Training Steps | Training Time | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|---|
| BERT-large | 400k | 34h | 93.5/87.4 | 86.9/84.3 | 87.8 | 94.6 | 77.3 | 87.2 |
| ALBERT-xxlarge | 125k | 32h | 94.0/88.1 | 88.3/85.3 | 87.8 | 95.4 | 82.5 | 88.7 |

With similar training time, ALBERT-xxlarge outperforms BERT-large by 1.5 points on average. The improvement is especially pronounced on RACE (+5.2 points), which requires reasoning over longer passages.[1]

## How did ALBERT perform on benchmarks?

### Overall Comparison with BERT

The following table compares ALBERT configurations against BERT on five representative tasks, all trained on the same data (BookCorpus + Wikipedia):[1]

| Model | Parameters | SQuAD 1.1 (EM/F1) | SQuAD 2.0 (EM/F1) | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|
| BERT-base | 108M | 90.4/83.2 | 80.4/77.6 | 84.5 | 92.8 | 68.2 | 82.3 |
| BERT-large | 334M | 92.2/85.5 | 85.0/82.2 | 86.6 | 93.0 | 73.9 | 85.2 |
| ALBERT-base | 12M | 89.3/82.3 | 80.0/77.1 | 81.6 | 90.3 | 64.0 | 80.1 |
| ALBERT-large | 18M | 90.6/83.9 | 82.3/79.4 | 83.5 | 91.7 | 68.5 | 82.4 |
| ALBERT-xlarge | 60M | 92.5/86.1 | 86.1/83.1 | 86.4 | 92.4 | 74.8 | 85.5 |
| ALBERT-xxlarge | 235M | 94.1/88.3 | 88.1/85.1 | 88.0 | 95.2 | 82.3 | 88.7 |

ALBERT-xxlarge achieves significant improvements over BERT-large across every task: +1.9 points on SQuAD 1.1 (comparing F1 of 88.3 vs. 85.5), +3.1 on SQuAD 2.0 (85.1 vs. 82.2), +1.4 on MNLI, +2.2 on SST-2, and +8.4 on RACE. ALBERT-xlarge, with only 60M parameters (roughly 18% of BERT-large's count), already matches or exceeds BERT-large on SQuAD 1.1, SQuAD 2.0, and RACE.[1]

ALBERT-large, with only 18M parameters, matches BERT-base (108M parameters) in average score (82.4 vs. 82.3).[1]

### GLUE Benchmark

On the [GLUE](/wiki/glue_benchmark) benchmark, the authors compared ALBERT-xxlarge (trained for 1M and 1.5M steps with additional data) against single-model results from BERT-large, XLNet-large, and RoBERTa-large on the development set:[6]

| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B |
|---|---|---|---|---|---|---|---|---|
| BERT-large | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 |
| XLNet-large | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8 |
| RoBERTa-large | 90.2 | 94.7 | 92.2 | 86.6 | 96.4 | 90.9 | 68.0 | 92.4 |
| ALBERT-xxlarge (1M) | 90.4 | 95.2 | 92.0 | 88.1 | 96.8 | 90.2 | 68.7 | 92.7 |
| ALBERT-xxlarge (1.5M) | 90.8 | 95.3 | 92.2 | 89.2 | 96.9 | 90.9 | 71.4 | 93.0 |

ALBERT-xxlarge at 1.5M training steps outperformed all competing single models on every GLUE dev task. The improvements were especially large on RTE (+2.6 over RoBERTa, +5.4 over XLNet, +18.8 over BERT-large) and CoLA (+3.4 over RoBERTa, +7.8 over XLNet, +10.8 over BERT-large).[1]

On the GLUE test set using ensemble models, ALBERT achieved an overall score of 89.4, setting a new state-of-the-art at the time of publication:[1]

| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B | WNLI | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| XLNet | 90.2 | 98.6 | 90.3 | 86.3 | 96.8 | 93.0 | 67.8 | 91.6 | 90.4 | 88.4 |
| RoBERTa | 90.8 | 98.9 | 90.2 | 88.2 | 96.7 | 92.3 | 67.8 | 92.2 | 89.0 | 88.5 |
| ALBERT | 91.3 | 99.2 | 90.5 | 89.2 | 97.1 | 93.4 | 69.1 | 92.5 | 91.8 | 89.4 |

### SQuAD Results

On the Stanford Question Answering Dataset ([SQuAD](/wiki/squad)), ALBERT-xxlarge set new records:[1]

| Model | SQuAD 1.1 Dev (EM/F1) | SQuAD 2.0 Dev (EM/F1) | SQuAD 2.0 Test (EM/F1) |
|---|---|---|---|
| BERT-large | 90.9/84.1 | 81.8/79.0 | 89.1/86.3 |
| XLNet-large | 94.5/89.0 | 88.8/86.1 | 89.1/86.3 |
| RoBERTa-large | 94.6/88.9 | 89.4/86.5 | 89.8/86.8 |
| ALBERT (1M steps) | 94.8/89.2 | 89.9/87.2 | - |
| ALBERT (1.5M steps) | 94.8/89.3 | 90.2/87.4 | 90.9/88.1 |
| ALBERT (ensemble) | - | - | 92.2/89.7 |

The single-model ALBERT at 1.5M steps beat RoBERTa by 0.4 F1 on SQuAD 1.1 dev and 0.9 F1 on SQuAD 2.0 dev. The ALBERT ensemble achieved 92.2 EM / 89.7 F1 on the SQuAD 2.0 test set.[1]

### RACE Results

The RACE (ReAding Comprehension from Examinations) benchmark tests machine reading comprehension using English exam questions collected from Chinese middle school and high school examinations.[9]

| Model | RACE Test Accuracy |
|---|---|
| BERT-large | 72.0 |
| XLNet-large | 81.8 |
| RoBERTa-large | 83.2 |
| ALBERT (1.5M, single model) | 86.5 |
| ALBERT (ensemble) | 89.4 |

ALBERT's single model beat RoBERTa by 3.3 points, and the ensemble beat it by 6.2 points. The RACE improvement was the largest of all benchmarks, suggesting that ALBERT's SOP objective is especially helpful for tasks that require understanding multi-sentence coherence and discourse structure.[1]

## Ablation Studies

### Effect of Network Depth

Using an ALBERT-large configuration (which has only 18M parameters regardless of depth due to parameter sharing), the authors evaluated the impact of varying the number of layers:[1]

| Layers | SQuAD 1.1 (EM/F1) | SQuAD 2.0 (EM/F1) | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|
| 1 | 31.1/22.9 | 50.1/50.1 | 66.4 | 80.8 | 40.1 | 52.9 |
| 3 | 79.8/69.7 | 64.4/61.7 | 77.7 | 86.7 | 54.0 | 71.2 |
| 6 | 86.4/78.4 | 73.8/71.1 | 81.2 | 88.9 | 60.9 | 77.2 |
| 12 | 89.8/83.3 | 80.7/77.9 | 83.3 | 91.7 | 66.7 | 81.5 |
| 24 | 90.3/83.3 | 81.8/79.0 | 83.3 | 91.5 | 68.7 | 82.1 |
| 48 | 90.0/83.1 | 81.8/78.9 | 83.4 | 91.9 | 66.9 | 81.8 |

Performance improves sharply from 1 to 12 layers, gains taper off between 12 and 24 layers, and a 48-layer model shows slight degradation on some tasks. This indicates diminishing returns from depth when parameters are shared, and it explains why ALBERT-xxlarge uses 12 layers rather than 24.[1]

### Effect of Hidden Size

Using a 3-layer ALBERT-large configuration, the authors tested increasing hidden sizes:[1]

| Hidden Size | Parameters | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|---|
| 1,024 | 18M | 79.8/69.7 | 64.4/61.7 | 77.7 | 86.7 | 54.0 | 71.2 |
| 2,048 | 60M | 83.3/74.1 | 69.1/66.6 | 79.7 | 88.6 | 58.2 | 74.6 |
| 4,096 | 225M | 85.0/76.4 | 71.0/68.1 | 80.3 | 90.4 | 60.4 | 76.3 |
| 6,144 | 499M | 84.7/75.8 | 67.8/65.4 | 78.1 | 89.1 | 56.0 | 74.0 |

Performance increases from H = 1,024 to H = 4,096 but drops at H = 6,144 (499M parameters), demonstrating model degradation. The authors noted that this may be related to optimization difficulty rather than model capacity, which is why H = 4,096 was selected for ALBERT-xxlarge.[1]

### Removing Dropout

For ALBERT-xxlarge, removing [dropout](/wiki/dropout) improved the average dev score from 90.4 to 90.7 across all benchmarks. The authors observed that the model had not overfit the training data even after 1 million training steps, and noted that ALBERT's parameter sharing already provides an implicit regularization effect, making explicit dropout unnecessary and potentially harmful.[1]

| Configuration | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|
| With dropout | 94.7/89.2 | 89.6/86.9 | 90.0 | 96.3 | 85.7 | 90.4 |
| Without dropout | 94.8/89.5 | 89.9/87.2 | 90.4 | 96.5 | 86.1 | 90.7 |

### Additional Training Data

Using ALBERT-base, the paper also evaluated the effect of adding the extra training data used by XLNet and RoBERTa:[1]

| Data | SQuAD 1.1 | SQuAD 2.0 | MNLI | SST-2 | RACE | Avg |
|---|---|---|---|---|---|---|
| BookCorpus + Wikipedia only | 89.3/82.3 | 80.0/77.1 | 81.6 | 90.3 | 64.0 | 80.1 |
| With additional data | 88.8/81.7 | 79.1/76.3 | 82.4 | 92.8 | 66.0 | 80.8 |

Additional data improved performance on some tasks (MNLI, SST-2, RACE) but slightly decreased performance on SQuAD, resulting in a modest overall improvement of +0.7 in average score.[1]

## How does ALBERT compare with related models?

The following table situates ALBERT alongside other prominent models from the same era:

| Model | Year | Parameters | Key Innovation | GLUE Test Score (Ensemble) |
|---|---|---|---|---|
| [BERT](/wiki/bert)-large | 2018 | 334M | Masked LM + NSP pre-training | 80.5 |
| [XLNet](/wiki/xlnet)-large | 2019 | 360M | Permutation language modeling | 88.4 |
| [RoBERTa](/wiki/roberta)-large | 2019 | 355M | Optimized BERT pre-training (no NSP) | 88.5 |
| ALBERT-xxlarge | 2019 | 235M | Factorized embeddings, parameter sharing, SOP | 89.4 |
| [ELECTRA](/wiki/electra)-large | 2020 | 335M | Replaced token detection | 89.4 |
| [DeBERTa](/wiki/deberta)-large | 2020 | 350M | Disentangled attention + enhanced mask decoder | 90.0 |

ALBERT achieved a GLUE test score of 89.4, matching or exceeding all contemporaneous models while using significantly fewer parameters than any competitor.[1]

## What are ALBERT's limitations?

Despite its parameter efficiency, ALBERT has several known limitations.

**[Inference](/wiki/inference) speed is not proportionally faster.** ALBERT reduces the number of stored parameters, but the computation graph during inference remains the same size. A 12-layer ALBERT model still performs 12 forward passes through the Transformer block; the fact that each pass reuses the same weights does not reduce the number of floating-point operations. As a result, ALBERT-xxlarge (235M parameters) is actually slower at inference than BERT-large (334M parameters) because its hidden size of 4,096 is four times larger than BERT-large's 1,024, requiring more computation per layer.[1]

**Training the largest configurations is expensive.** Although ALBERT-large trains 1.7 times faster than BERT-large, ALBERT-xxlarge trains approximately 3.3 times slower (0.3x relative speed). The large hidden dimension drives up computation costs even though the parameter count is smaller.[1]

**Performance at small scales lags behind BERT.** ALBERT-base (12M parameters) achieves an average benchmark score of 80.1 compared to BERT-base's 82.3. The aggressive parameter sharing takes a toll at smaller model sizes where the single shared Transformer block does not have enough capacity to capture all the patterns that 12 independent blocks would learn. This makes ALBERT less attractive as a lightweight model for resource-constrained deployment compared to [knowledge distillation](/wiki/knowledge_distillation)-based approaches like [DistilBERT](/wiki/distilbert).[11]

**Same inference memory for activations.** While ALBERT uses less memory to store its model weights, the activation memory (the intermediate computation results that must be kept in memory during a forward pass) is determined by the hidden size and sequence length, not the number of unique parameters. ALBERT-xxlarge with its 4,096 hidden dimension uses more activation memory than BERT-large.[1]

**No reduction in computational FLOPs.** Unlike methods such as knowledge distillation (e.g., DistilBERT) or pruning, ALBERT does not reduce the actual computation required during inference. For applications where inference latency is the primary concern, ALBERT offers less benefit than distillation-based approaches that achieve faster inference by using fewer layers.[11]

**Sensitivity to width scaling.** The ablation study on hidden size shows that performance drops when increasing from H = 4,096 to H = 6,144, suggesting that very wide models with shared parameters may be difficult to optimize. This limits the potential for further scaling ALBERT to even larger hidden dimensions.[1]

## Version 2 Improvements

The ALBERT team released a second version (v2) of all model checkpoints with several training improvements. Version 2 models employ no dropout, additional training data, and longer training schedules, with ALBERT-base v2 trained for 10 million steps and larger models trained for 3 million steps. The removal of dropout was motivated by the observation that ALBERT models do not overfit the training data as quickly as BERT models, likely because the parameter-sharing mechanism itself acts as a regularizer.[1] The v2 checkpoints became the standard versions available through the [Hugging Face](/wiki/hugging_face) Transformers library.

## What is ALBERT's impact and legacy?

ALBERT demonstrated that the number of stored parameters in a model and its representational capacity are not the same thing. By decoupling these two properties through factorized embeddings and parameter sharing, ALBERT showed that much smaller models could match or exceed the performance of their larger counterparts.[1] This insight influenced subsequent work on efficient [Transformers](/wiki/transformer) and parameter-efficient methods.

The factorized embedding technique has been adopted in various forms by later models. The idea of decomposing large embedding matrices into smaller factors appears in [ELECTRA](/wiki/electra), [DeBERTa](/wiki/deberta), and other efficient Transformer architectures.[10]

Cross-layer parameter sharing, while not universally adopted due to its inference speed limitations, inspired research into other forms of weight sharing and weight tying in [deep learning](/wiki/deep_learning). Universal Transformers, which were proposed independently around the same time, explored a similar idea of applying the same Transformer block iteratively. The concept of parameter efficiency also foreshadowed later work on [parameter-efficient fine-tuning](/wiki/parameter_efficient_fine_tuning) methods such as [LoRA](/wiki/lora) and adapters.

ALBERT's SOP objective proved that more carefully designed pre-training tasks could improve downstream performance. This line of thinking influenced subsequent work on contrastive and order-aware pre-training objectives.

### Is ALBERT open source?

Yes. The ALBERT codebase was open-sourced through the [google-research/albert repository](https://github.com/google-research/albert) on GitHub, and pre-trained checkpoints for all configurations (v1 and v2) are available through the Hugging Face Transformers library under identifiers such as `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, and `albert-xxlarge-v2`. The model supports [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and [JAX](/wiki/jax)/Flax backends.

## References

1. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." *Proceedings of the International Conference on Learning Representations (ICLR 2020)*. arXiv:1909.11942.
2. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: [Pre-training](/wiki/pre-training) of Deep Bidirectional Transformers for Language Understanding." *Proceedings of NAACL-HLT 2019*, pp. 4171-4186. arXiv:1810.04805.
3. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692.
4. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q.V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding." *Advances in Neural Information Processing Systems 32 ([NeurIPS](/wiki/neurips) 2019)*. arXiv:1906.08237.
5. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). "[Attention Is All You Need](/wiki/attention_is_all_you_need)." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*. arXiv:1706.03762.
6. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S.R. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." *Proceedings of ICLR 2019*. arXiv:1804.07461.
7. Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." *Proceedings of EMNLP 2016*. arXiv:1606.05250.
8. Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." *Proceedings of ACL 2018*. arXiv:1806.03822.
9. Lai, G., Xie, Q., Liu, H., Yang, Y., & Hovy, E. (2017). "RACE: Large-scale ReAding Comprehension Dataset From Examinations." *Proceedings of EMNLP 2017*. arXiv:1704.04683.
10. Clark, K., Luong, M.-T., Le, Q.V., & Manning, C.D. (2020). "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." *Proceedings of ICLR 2020*. arXiv:2003.10555.
11. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv:1910.01108.
12. You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., & Hsieh, C.J. (2020). "Large Batch Optimization for Deep Learning: Training BERT in 76 Minutes." *Proceedings of ICLR 2020*. arXiv:1904.00962.
13. Google Research. "ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations." Google Research Blog. https://research.google/blog/albert-a-lite-bert-for-self-supervised-learning-of-language-representations/