ALiBi (Attention with Linear Biases)
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,047 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,047 words
Add missing citations, update stale details, or suggest a clearer explanation.
ALiBi (Attention with Linear Biases) is a positional encoding method for transformer language models introduced by Ofir Press, Noah A. Smith, and Mike Lewis in the paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation," released as arXiv preprint 2108.12409 in August 2021 and presented as a poster at the International Conference on Learning Representations (ICLR) 2022.[1][2] Rather than adding positional embeddings to word vectors at the input layer, ALiBi modifies the attention computation directly by adding a static, non-learned negative bias to query-key attention scores that grows linearly with the distance between the query and key tokens. The bias is scaled by a head-specific slope, allowing different attention heads to attend over different effective ranges. The method was specifically designed to enable trained transformer language models to extrapolate at inference time to sequences considerably longer than those seen during training.
ALiBi played an important role in the practical deployment of long-context open language models between 2022 and 2024. It was adopted by the BigScience workshop's BLOOM 176B model,[3] MosaicML's MPT family (MPT-7B and MPT-30B),[4][5] BloombergGPT,[6] and Replit's code generation models.[7] However, subsequent empirical work has shown that ALiBi's extrapolation capabilities are more limited than the original paper suggested, and most modern long-context models have shifted toward Rotary Position Embedding (RoPE) combined with context-extension techniques such as positional interpolation and YaRN.[8][9]
The transformer architecture introduced by Vaswani et al. in 2017 uses attention as its only mechanism for combining information across positions in a sequence. Because attention treats its inputs as an unordered set, transformers require an explicit mechanism to provide information about token order. The original transformer paper proposed two options: a learned absolute position embedding for each position index, and a fixed sinusoidal embedding parameterised so that relative offsets are linear functions of the absolute embeddings. Both approaches add a position-dependent vector to the input token embedding before the first attention layer.[10]
These additive positional embeddings share a critical limitation: they assume a fixed maximum sequence length. A model trained with positions 0 through L-1 has not learned vectors for positions beyond L, and its quality degrades sharply when applied to longer inputs. Even sinusoidal embeddings, which are deterministic functions of position and therefore well-defined for any position, do not extrapolate well in practice. Press and colleagues showed empirically that the WikiText-103 model of Baevski and Auli (2018), using sinusoidal embeddings and trained on subsequences of L = 512 or L = 1024 tokens, improves perplexity for only the first 20 to 50 additional tokens beyond training length, then begins to degrade.[1] This problem became increasingly pressing as language model applications demanded longer context windows than were practical to use throughout pretraining due to the quadratic compute and memory cost of self-attention.
Several alternatives that move the positional signal from the input layer into the attention computation were developed in the late 2010s and early 2020s. Shaw et al. (2018) proposed relative position representations that add learned vectors to the keys based on the offset between query and key.[11] T5 (Raffel et al. 2020) introduced a simpler relative position bias: a learned scalar added to each attention score, indexed by a bucketed function of the query-key offset, shared across layers.[12] Rotary Position Embedding (RoFormer, Su et al. 2021) instead rotates query and key vectors by an angle that depends on absolute position, so that their inner product depends only on relative position.[8] ALiBi takes a similar conceptual approach of biasing attention scores, but eliminates all learned parameters from the positional mechanism.
Ofir Press completed his PhD at the Paul G. Allen School for Computer Science and Engineering at the University of Washington under Noah A. Smith and held a concurrent research role at Facebook AI Research while developing ALiBi. Press had previously co-authored weight tying (later used in GPT and BERT) and a series of papers on transformer architecture, including "Shortformer: Better Language Modeling Using Shorter Inputs" (ACL 2021), which introduced Position-Infused Attention and demonstrated benefits from training on shorter inputs.[13][14] Mike Lewis, the third author, is a research scientist at Facebook AI Research and lead author on BART.[15] The trio had previously collaborated on Shortformer, and ALiBi can be read as the natural continuation of that line: where Shortformer accelerated training via shorter input subsequences with a relative position scheme, ALiBi eliminates the position embedding entirely.[1][14]
For a causal language model, ALiBi modifies the attention computation so that, for the ith query vector q_i and the matrix K of keys for positions 1 through i, the attention logits become:
softmax(q_i · K^T + m · [-(i-1), ..., -2, -1, 0])
where m is a scalar slope specific to the attention head and the bias vector linearly penalises attention to tokens further in the past, with the strength of the penalty controlled by m.[1] After the softmax, this static bias produces an effective exponential decay in attention weight as a function of the query-key distance, with the rate of decay controlled by m. Press et al. emphasise in the paper that the bias is added directly to the attention scores after the query-key dot product and is not multiplied by the standard 1/sqrt(d_k) scaling factor.[1]
The ALiBi bias is a deterministic function of distance only; it contains zero learned parameters and depends on no run-time state. The bias matrix can be precomputed once for any sequence length, and crucially it is well-defined for sequence lengths longer than those seen during training. This is the architectural feature that enables ALiBi's length extrapolation: at inference time the model can be evaluated on a context of any length simply by extending the bias matrix.[1] Press et al. report that ALiBi adds no operations to the network (the bias can be folded into the existing causal mask), incurs no runtime penalty during training compared to a sinusoidal model on the same input length, and requires only a small additional memory cost of up to roughly 100 MB to store the per-head, per-position bias mask of size n by L by L (where n is the number of heads).[1]
A central design choice in ALiBi is that each attention head receives a different slope, organised as a geometric sequence. For a model with H heads, the slopes form a geometric sequence whose first term and common ratio are both 2^(-8/H). The slope for head h (using 1-indexing from 1 to H) is therefore:
m_h = 2^(-8h/H)
For a model with H = 8 heads, the slopes are 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, and 1/256.[1] The head with the largest slope (m_1 = 1/2) strongly penalises distant tokens and effectively focuses on a short window of recent tokens. The head with the smallest slope (m_8 = 1/256) decays much more slowly and can attend to tokens far away in the context. For 16-head models, Press et al. interpolate the 8-head slopes by geometrically averaging every consecutive pair, producing a geometric sequence with first term 2^(-0.5) and ratio 2^(-0.5), running from 2^(-0.5) down to 2^(-8).[1] Press et al. found that this geometric schedule produced strong extrapolation across model scales without tuning per task or per scale.
The choice not to learn the slopes was a considered decision. Press et al. report that they initially experimented with making the slopes trainable, but trainable slopes "did not yield strong extrapolation results" and also slowed down training speed by about 3% in their experiments.[1] A manual exploration of around ten slope sets led them to a heuristic they describe in the paper: "the slope sets that work best are those with slopes in the (0, 1) range, with the slopes' density increasing as we get closer to 0." They also observed that ALiBi is robust to slope choice: even randomly sampling slopes from an exponential distribution worked acceptably in some cases, although with higher variance.[1]
The reference implementation in the official ALiBi repository extends the schedule to numbers of heads that are not powers of two. It first computes slopes for the largest power of two less than or equal to H using the standard formula, then fills in additional slopes from a finer geometric sequence with first term 2^(-4/H) and ratio 2^(-2·4/H), taking odd-indexed terms to avoid duplicating slopes already used.[16] Implementations in Databricks Mosaic Composer and Keras follow the same structure, with the Keras AlibiBias layer exposing the alibi_bias_max=8 hyperparameter that controls the slope formula.[17][18]
Although the post-softmax effect of ALiBi is an exponential decay in attention weight with distance, Press and colleagues chose the pre-softmax bias to be linear in distance rather than exponential. They report that during development they experimented with exponential bias growth functions and found they performed worse than linear, leading them to adopt the linear form in the final paper.[1] The intuition is that a linear pre-softmax bias produces an exponential post-softmax penalty, but with a more graceful penalty curve than would result from an exponential pre-softmax term, where attention to distant tokens could collapse abruptly to zero.
Press et al. describe ALiBi as imposing an inductive bias toward recency: heads with larger slopes care almost exclusively about nearby tokens, while heads with smaller slopes can integrate longer-range information. The use of multiple slopes across heads is essential. Without it, the network would have to choose a single attention range, and either lose local detail or fail to model long-range dependencies. The geometric spacing ensures coverage of many scales of distance with a small number of heads.[1]
In the related-work section, Press et al. note that ALiBi shares conceptual ground with several earlier proposals. Wennberg and Henter (2021) had concurrently proposed a relative position scheme that adds a distance-based bias to attention scores, but using a radial-basis function with multiple trainable parameters. The Distance-Aware Transformer of Wu et al. (2021) multiplies (rather than adds) attention scores by a distance-dependent bias with a learned parameter per head. Both are restricted to text classification rather than language modelling and do not study extrapolation.[1]
The original ALiBi paper trained transformer language models on the WikiText-103 corpus and a subset of the CC100+RoBERTa training data, and evaluated their perplexity on contexts longer than they were trained on.[1]
For the WikiText-103 experiments, Press et al. used the language model of Baevski and Auli (2018): 16 transformer layers, hidden dimension 1024, 8 attention heads, and a 4096-dimensional feedforward inner layer, with tied input and output embeddings.[1][19] The training corpus is 103 million tokens of English Wikipedia, about half a gigabyte.
Holding all other hyperparameters constant and varying only the position method, the paper reported that the sinusoidal model trained at L = 512 only improves perplexity for the first roughly 20 additional tokens beyond training length, after which performance stagnates and then degrades. The rotary model, applied to the same Baevski and Auli baseline, improves up to about 200 additional tokens but at the cost of slower training. The T5 bias method allows extrapolation for around 600 to 800 additional tokens, but training is at least twice as slow as sinusoidal in the authors' implementation.[1] ALiBi, by contrast, continues to improve perplexity until L_valid is around 3L, and a 512-token-trained ALiBi model continues to improve perplexity until L_valid exceeds 12,000 tokens, while runtime and memory are within 1 to 3% of sinusoidal at the same training length.[1]
Press et al. also reported a concrete cross-method perplexity comparison on the WikiText-103 validation set: their L = 3072 ALiBi model reached 17.60 perplexity, compared with 18.67 perplexity for the sinusoidal baseline at the same length. Even more striking, their L = 512 ALiBi model extrapolated to length 3072 and reached 18.40 perplexity, surpassing the sinusoidal L = 3072 result by a statistically significant margin while training 1.84 times faster.[1] These results were robust across L from 512 to 3072 and transferred without modification to the Toronto Book Corpus, a domain shift away from Wikipedia.[1]
The ALiBi paper's headline result was a 1.3 billion parameter language model trained on the 461 GB CC100+RoBERTa corpus. The architecture used 25 transformer layers, hidden dimension 2048, 16 attention heads, an 8192-dimensional feedforward sublayer, and was trained for one epoch (50,000 updates) on 128 V100 GPUs.[1] The headline finding: when trained on L = 1024 and evaluated on 2048-token sequences, the ALiBi model reached lower perplexity than the sinusoidal model trained on L = 2048, while training 11% faster and using 11% (about 3.1 GB) less memory.[1] Even the more constrained L = 512 ALiBi model, evaluated on 1024-token sequences, was within 0.06 perplexity of the sinusoidal L = 1024 baseline while training 7% faster and using 1.6 GB less memory.[1]
A subtler finding from this larger-scale experiment is that ALiBi achieved its best extrapolation perplexity at roughly twice the training length. The L = 512 model achieved its best perplexity (9.3) at L_valid around 1012 tokens, and the L = 1024 model achieved its best perplexity (8.9) at L_valid around 2024 tokens. Beyond about 2L, ALiBi's perplexity stops improving and begins to degrade gracefully, though it still maintained strong performance at L_valid = 10,000 tokens.[1] One conjecture in the paper is that for L_valid up to 2L, at least half of every batch's predictions are made on subsequences that match the training-length distribution; beyond 2L, less than half of predictions are well-matched, and the model's performance suffers.[1]
| Position method | Extrapolation distance | Training speed vs sinusoidal | Memory cost | Learned parameters |
|---|---|---|---|---|
| Sinusoidal (Vaswani 2017) | Roughly 20 to 50 tokens | 1x (baseline) | Baseline | None |
| Learned absolute | None (undefined beyond L) | 1x | Baseline + L·d | L·d |
| RoPE (Su 2021) | Roughly 200 tokens | 1.02x to 1.03x slower | Baseline | None |
| T5 relative bias | Roughly 600 to 800 tokens | About 2x slower | Higher | 32 buckets x H heads |
| ALiBi (Press 2021) | Roughly 2L to 3L | Within 1 to 3% | + n·L·L mask (under 100 MB) | None |
Reported in Press, Smith, Lewis (2021), Tables 1 to 3 and Figures 1 and 2.[1]
The BigScience workshop's BLOOM model, a 176-billion-parameter open multilingual language model released on 11 July 2022 under the Responsible AI License, used ALiBi as its positional encoding. The model has 70 transformer layers, 112 attention heads per layer, a hidden dimension of 14,336, and was trained on the 1.6 TB ROOTS corpus across 46 natural languages and 13 programming languages with a training sequence length of 2048 tokens.[3] In their architecture rationale, the BLOOM authors wrote that they chose ALiBi because it "directly attenuates the attention scores based on how far away the keys and queries are," and reported that beyond its extrapolation benefits, ALiBi "led to smoother training and better downstream performance even at the original sequence length, outperforming both learned and rotary embeddings" in their preliminary experiments.[3] BLOOM was the largest open-weights language model trained with ALiBi at the time of its release. SambaNova later released a long-context variant, BLOOMChat-v2, that fine-tuned BLOOM-176B for extended sequences.[20]
MosaicML released the MPT-7B family of open, commercially usable language models on 5 May 2023, with each base model trained from scratch on 1 trillion tokens of text and code on 440 A100-40GB GPUs over about 9.5 days at an approximate cost of $200,000.[4] The MPT architecture combines FlashAttention with ALiBi positional embeddings, motivated by the goal of supporting very long contexts at inference time, and uses the Lion optimiser and the GPT-NeoX-20B tokenizer with a 50,432-token vocabulary.[4] MosaicML released several variants: MPT-7B Base (Apache 2.0), MPT-7B-Instruct (CC-BY-SA-3.0), MPT-7B-Chat (CC-BY-NC-SA-4.0), and MPT-7B-StoryWriter-65k+ (Apache 2.0). StoryWriter-65k+ was produced by finetuning the base MPT-7B model with a context length of 65,536 tokens on a filtered fiction subset of the books3 dataset.[4] MosaicML reported that thanks to ALiBi, MPT-7B-StoryWriter-65k+ could extrapolate beyond its training context, demonstrating generations as long as 84,000 tokens on a single node of eight A100-80GB GPUs.[4]
MPT-30B followed on 22 June 2023, with 30 billion parameters and a training sequence length of 8,000 tokens (after an initial 1 trillion tokens at 2,000 tokens, with an additional 50 billion tokens at the longer length).[5] MPT-30B was trained for roughly 13 to 14 days on A100s, or 9 to 10 days on H100s, and the documentation reports that ALiBi allowed inference at context lengths up to about 16,000 tokens. MPT-30B-Chat achieved 37.2% on HumanEval at release.[5] Together MPT-7B and MPT-30B made ALiBi the most visible long-context positional encoding for open-weights models during 2023.
BloombergGPT, a 50-billion-parameter language model for the financial domain, was announced on 30 March 2023 by Bloomberg. The model has 70 transformer layers, 40 attention heads per layer, a hidden dimension of 7,680, and was trained on 512 A100 GPUs over 53 days, processing approximately 569 billion tokens out of a planned mixed corpus of 363 billion tokens from Bloomberg's proprietary financial data sources (FinPile) and 345 billion tokens from general-purpose datasets.[6] BloombergGPT uses ALiBi positional encodings applied additively at every self-attention sublayer, following the same rationale as BLOOM: the choice was motivated by training stability and the ability to extrapolate to longer sequences without retraining.[6] BloombergGPT's vocabulary is a custom 131,072-token unigram SentencePiece tokenizer optimised for financial text.
Replit's code generation models adopted ALiBi for support of long, variable contexts at inference time. Replit Code V1 (replit-code-v1-3b), a 2.7 billion parameter model trained on 525 billion tokens of code (the Stack Dedup v1.2 dataset repeated three epochs) on 256 A100-40GB GPUs on the MosaicML platform, used ALiBi positional embeddings, FlashAttention with a Triton BF16 implementation, the LionW optimiser, and a custom 32,768-token SentencePiece Unigram tokenizer.[7] It reported 21.9% pass@1 on HumanEval and supported 20 programming languages led by Markdown, Java, JavaScript, Python, and TypeScript.
The successor Replit Code V1.5 (replit-code-v1_5-3b), released in 2023, scaled to 3.3 billion parameters trained on roughly 1 trillion tokens of code over five epochs (approximately 200 billion unique tokens) across 128 H100-80GB GPUs.[21] V1.5 used the MosaicML LLM Foundry and Composer stacks, supports 30 programming languages, and was trained with a 4,096-token context.[21]
A number of smaller open models and research codebases adopted ALiBi during the 2022 to 2023 period. The Hugging Face Transformers library added support for ALiBi as part of its BLOOM implementation. The Databricks Mosaic Composer library provides drop-in ALiBi replacements for HuggingFace BERT, RoBERTa, GPT-2, and any superclass thereof, with documentation noting that "ALiBi is not currently implemented for sequence-classification tasks" but reporting perplexity reductions and 1.16x to 1.19x training speedups on GPT-2 models at sizes from 52M to 125M parameters when training at one-quarter the standard sequence length and evaluating at full length.[17] The reference implementation from the original authors remains maintained at the attention_with_linear_biases GitHub repository.[16] In 2025 reporting, Google's Gemini Flash-Lite was described as having replaced RoPE with an ALiBi-style attention bias, alongside related modifications to enable extrapolation to contexts of roughly one million tokens.[22]
ALiBi's design choices, particularly its lack of learned parameters and its simple distance-based bias, place it in a specific region of the design space of positional encodings, with distinct tradeoffs compared to alternatives.
The earliest transformer positional encodings, sinusoidal and learned, add a position-dependent vector to the input embeddings.[10] These methods provide rich positional information at the input but do not extrapolate to longer sequences: learned embeddings have no values for unseen positions, and sinusoidal embeddings produce out-of-distribution patterns at unseen positions even though they are mathematically defined. ALiBi's main advantage over these methods is precisely its ability to handle longer sequences than seen during training. In the original ALiBi experiments on WikiText-103, models with sinusoidal or learned embeddings showed sharp perplexity increases when the inference length exceeded training length, while ALiBi retained or improved its perplexity.[1]
T5's relative position bias is conceptually closest to ALiBi: both add a position-dependent scalar bias to the attention logits. The differences are that T5 uses learned biases looked up by a bucketed offset function, with shared parameters across attention layers, while ALiBi uses fixed, non-learned biases that are linear in the distance and scaled by a per-head slope.[12] T5 relative bias has more parameters and can in principle learn more complex offset-to-bias mappings, but the bucketing scheme caps the longest distance that can be represented, and at inference the biases for previously unseen distance buckets must be set somehow. Press et al. compared ALiBi to T5 relative bias and reported that ALiBi extrapolated further (continuing to improve perplexity beyond 3L compared to about 2L for T5) while being computationally cheaper because it requires no parameter lookups; in their Fairseq-PyTorch implementation on V100 GPUs, T5 bias trained at roughly half the speed of sinusoidal, though they note that Narang et al. found only an 8.7% slowdown on TPU under Mesh Tensorflow.[1]
RoPE, introduced in the RoFormer paper by Su et al. (2021), encodes position by rotating query and key vectors in two-dimensional subspaces by an angle proportional to absolute position. The inner product of rotated queries and keys then depends only on the relative offset between them.[8] RoPE was adopted by Llama and Llama 2 and has since become the dominant positional encoding for large language models.[8][9] Compared to ALiBi, RoPE injects positional information through phase rotations rather than additive biases, and the positional signal can interact more flexibly with content because it modifies queries and keys directly rather than the post-softmax score landscape. The original ALiBi paper found that on WikiText-103, ALiBi outperformed RoPE on extrapolation, with the RoPE model improving for only about 200 extra tokens past training length while ALiBi continued to improve for thousands.[1] Subsequent analyses with techniques such as positional interpolation, NTK-aware scaling, and YaRN have made RoPE far more competitive on long-context tasks, and modern RoPE deployments since Llama 3 increase the base frequency (Llama 3 raised it from 10,000 to 500,000) at training time to enable long contexts directly.[9]
A 2023 study by Kazemnejad et al., "The Impact of Positional Encoding on Length Generalization in Transformers" (NeurIPS 2023), compared absolute position embeddings (APE), T5 relative bias, ALiBi, RoPE, and NoPE (no explicit positional encoding) on a set of synthetic reasoning and algorithmic tasks. They reported that NoPE outperformed ALiBi, RoPE, and APE on length generalisation across many of their tasks, arguing that "commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalisation in downstream tasks."[23] Their analysis showed that decoder-only transformers can encode positional information implicitly through the causal mask, and that explicit positional biases sometimes interfered with this implicit signal. In their experiments, NoPE-trained models behaved most like models using T5's relative bias in their learned attention patterns despite having no explicit position parameters.[23]
Despite its conceptual elegance and early adoption, ALiBi's empirical performance on long-context tasks has been re-evaluated, and several limitations have become clearer.
Although the original paper showed strong extrapolation on language modelling perplexity, subsequent work found that the magnitude of extrapolation is smaller than initially advertised when models are sufficiently well trained or when more demanding evaluations are used. Al-Khateeb et al. (2023) from Cerebras Systems, in "Position Interpolation Improves ALiBi Extrapolation," reported that ALiBi position embeddings "only extrapolate well to a fraction beyond the trained sequence length" for an over-trained model, motivating the application of positional interpolation techniques originally designed for RoPE to ALiBi-based models.[9] They showed that interpolation-based length extension, achieved by rescaling the linear-distance term in ALiBi, significantly improved language modelling perplexity and downstream SCROLLS performance for BLOOM-7B and ALiBi-trained Cerebras-GPT models without further training.[9]
A SambaNova analysis of BLOOM-7B on SCROLLS found that positional interpolation substantially outperformed naive ALiBi extrapolation on NarrativeQA, with an interpolation F1 of 4.17 compared with 1.63 for direct extrapolation, more than 2.5x better without fine-tuning.[24] The same analysis identified an additional ALiBi-specific issue at long sequence lengths in reduced precision: at 8,192 tokens in BF16, ALiBi's static bias values for adjacent positions collapse to the same floating-point representation, so that the last 20 positions of an 8,192-context all receive the same bias. The analysis reported that across the range from position 8,000 to 8,192, FP32 yielded 192 distinct bias values, FP16 yielded 41, and BF16 yielded only 6 distinct values, with this degradation consistent across all 32 attention heads tested.[24] In response, SambaNova recommends computing ALiBi biases in FP32 for long-context inference even when the rest of the model runs in lower precision.
ALiBi's strong inductive bias toward recency is helpful for perplexity on natural text, where local context is highly predictive of the next token, but can be a liability for tasks that require attending to specific distant content. Because every head's attention to far-away tokens is exponentially suppressed by a fixed slope, ALiBi models can struggle on tasks that demand precise, content-dependent retrieval from earlier parts of a long context. Several analyses have observed that perplexity gains under ALiBi do not always translate into downstream gains on long-context reasoning tasks.[23][25]
In 2026, Palmer Schallon's analysis "Surgical Repair of Collapsed Attention Heads in ALiBi Transformers" diagnosed a related failure mode in the BLOOM family. The paper reports that across BLOOM-560M, BLOOM-1B7, BLOOM-3B, and BLOOM-7B1, between 31% and 44% of attention heads attend almost entirely to the beginning-of-sequence token, with the collapse concentrated in head indices where ALiBi's slope schedule imposes the steepest distance penalties.[26] The proposed remedy ("surgical reinitialisation," using Xavier-initialised Q/K/V matrices, zeroed output projections, and gradient masking) recovered 98.7% of operational head capacity (379 of 384 heads) in BLOOM-1B7 in two passes on a consumer GPU and produced a 25% transient improvement in training perplexity even on heads that had not been classified as collapsed, suggesting that "pretrained attention is a local minimum, not a global one" in ALiBi models.[26]
ALiBi was designed for causal (decoder-only) language modelling, where the natural total ordering of tokens makes a one-directional linear bias straightforward. Adapting ALiBi to bidirectional encoder models such as BERT and RoBERTa requires symmetrising the bias or using "nonsymmetric with offset" variants that distinguish left and right neighbours, and the benefits in this setting are smaller. An ICLR 2024 blog post on "Masked Language Model with ALiBi and CLAP head" reported that ALiBi's perplexity benefits for masked language models did not consistently translate to downstream GLUE task improvements. On roberta.base, ALiBi gave perplexity 2.93 versus the baseline 2.94, but harmed downstream performance on CoLA, MRPC, and RTE; on roberta.large, ALiBi (perplexity 2.65) was actually worse than the baseline (2.55) before any fine-tuning.[25] The author observed that "models with lower perplexity do not necessarily yield higher accuracies for downstream tasks and architectural changes beneficial for models at smaller scales do not imply the same for models at larger scales."[25]
The Mosaic Composer documentation similarly warns that performance significantly degrades for ALiBi models trained on sequence lengths at or below 128 tokens, and advises against training with sequences at or below 256 tokens or a train_sequence_length_scaling factor at or below 0.03125.[17]
By 2024, RoPE combined with context-extension techniques such as positional interpolation, NTK-aware interpolation, and YaRN had become the dominant approach to long-context language modelling, supplanting ALiBi in most new releases of frontier and open-source models.[9][27] The Llama family, Mistral models, Qwen, DeepSeek, and many others use RoPE. Llama 3 in particular increased its RoPE base frequency from 10,000 to 500,000 so that models could be trained directly at long context rather than relying on extrapolation.[28] ALiBi remained important in its specific historical niche of mid-2022 to 2023 open models, principally BLOOM, MPT, BloombergGPT, and Replit Code, but few large models trained after 2023 adopted it as their primary position encoding.
Several lines of work have built on ALiBi or proposed related distance-based attention biases.
Al-Khateeb, Dey, Soboleva, and Hestness (Cerebras Systems, 2023) adapted the positional interpolation idea from the RoPE literature to ALiBi by rescaling the linear distance term, allowing ALiBi models to extend their effective context length up to roughly twice the maximum training sequence length while preserving language modelling performance, with substantial gains on the SCROLLS long-context summarisation and retrieval benchmark.[9] The technique can be applied without retraining and significantly outperforms naive ALiBi extrapolation on demanding long-context tasks.
Schallon's 2026 paper, "Surgical Repair of Collapsed Attention Heads in ALiBi Transformers," examined the failure modes of ALiBi attention heads in the high-slope regime and proposed targeted reinitialisation interventions to restore useful attention patterns in heads whose strong slope had reduced the effective receptive field to a few tokens around the beginning-of-sequence token.[26] The technique recovers most collapsed heads at low cost and provides a diagnostic tool for inspecting ALiBi pretraining outcomes at scale.
Several papers in the long-context literature have proposed modifications to ALiBi's slope schedule, including learned slopes, slope schedules tuned per task, and hybrid combinations of distance bias and rotary embeddings.[29] These variants typically retain the core idea of biasing attention by a function of distance with no or few learned parameters at the position level. The general trend in the field, however, has been toward methods that combine RoPE with explicit context-extension procedures rather than toward further elaboration of distance-bias schemes.[9][27]
Reporting in 2025 and 2026 suggests Google's Gemini Flash-Lite replaced RoPE with an ALiBi-style attention-with-linear-biases mechanism (with additional refinements to mitigate long-distance decay) to enable context lengths near one million tokens without retraining.[22] This represents a partial return of distance-bias methods to frontier deployment after several years of RoPE dominance, although the precise architecture is not fully documented in public materials.
ALiBi made several lasting contributions to the literature on positional encoding even as it has been overtaken by other methods for state-of-the-art long-context modelling. It demonstrated empirically that a position encoding can be entirely non-learned and still be effective, that recency can be encoded as a head-specific architectural inductive bias rather than discovered from data, and that the choice of positional mechanism has a first-order impact on a model's ability to generalise to inference-time sequence lengths beyond those seen during training. The geometric slope schedule, fixed before training, became a widely-used pattern for sharing capacity across attention heads that need different effective ranges.[1]
ALiBi also played an important practical role in enabling the first wave of open long-context language models. BLOOM, MPT-7B-StoryWriter, BloombergGPT, and the Replit code models gave researchers and developers access to multi-thousand-token and tens-of-thousands-token contexts at a time when the dominant frontier models were either closed or trained on shorter sequences.[3][4][6][7] The combination of ALiBi with FlashAttention in the MPT family, in particular, helped popularise the idea of training on moderate sequence lengths and extrapolating at inference time, a workflow whose intellectual legacy persists in modern context-extension techniques applied to RoPE-based models.[4][9] The paper's deliberate focus on extrapolation (rather than absolute perplexity at fixed length) also set a methodological precedent. Subsequent positional encoding research has routinely reported evaluation perplexity as a function of L_valid, and many newer methods are described primarily in terms of how they generalise beyond training length.[9][23][27]