Mixture of Block Attention (MoBA)

Deep Learning Neural Networks

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,716 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Mixture of Block Attention (MoBA) is a trainable block-sparse attention mechanism introduced in February 2025 by researchers at Moonshot AI, the company that builds the Kimi family of large language models, together with collaborators at Tsinghua University. MoBA divides the key and value sequence into contiguous blocks and uses a learned gating mechanism to route each query to a small number of the most relevant blocks, instead of letting every query attend to every preceding token. The design carries the routing idea behind mixture of experts (MoE) into the attention operation itself: each block plays the role of an expert, and the gate acts as a router that selects the top-k blocks for a given query ^[1].

The central motivation is to scale Transformer attention to very long contexts without paying the quadratic O(N^2) cost of full attention, while keeping the sparse pattern fully trainable and reversible. Unlike fixed sparse patterns that hard-code which tokens may interact, MoBA lets the model learn where to attend, and it can fall back to exact dense attention by selecting all blocks. The authors describe MoBA as a flexible substitute for full attention that switches seamlessly between sparse and dense modes ^[1]^[2]. MoBA has been deployed in production to serve Kimi's long-context requests ^[1]. The paper, "MoBA: Mixture of Block Attention for Long-Context LLMs," is first-authored by Enzhe Lu and carries a large Moonshot AI author list ^[1].

Background: the cost of long-context attention

Standard scaled dot-product attention compares every query against every key, so for a sequence of length N the computation and memory of the attention scores grow as O(N^2). Long-context models that target hundreds of thousands or millions of tokens make this quadratic term the dominant cost in both training and inference, especially during the prefill stage when the whole prompt is processed at once.

Earlier efficiency work attacked this in two broad ways. One family imposes fixed sparse patterns, such as sliding windows, dilated windows, or global "sink" tokens (as in Longformer and BigBird), which reduce cost but bake in assumptions about which positions matter. A second family changes the architecture more drastically, for example linear-attention and state-space models that drop softmax attention entirely. Both directions risk losing the expressiveness of full attention, and fixed patterns in particular cannot adapt their connectivity to the content of a given sequence. MoBA follows a principle the authors call "less structure": rather than prescribing a pattern, it gives the model the freedom to decide, per query, which regions of the context deserve attention ^[1].

How MoBA works

Blocks. MoBA partitions the context of length N into n equal-sized blocks of size B, where B = N / n. Block i covers the contiguous span of tokens from position (i-1)B + 1 to iB. For instance, a 32,000-token context split with B = 512 yields n = 64 blocks ^[1].

Gating and affinity scores. For each query token q and each block i, MoBA computes a relevance score. The score is the inner product of the query with the mean-pooled keys of that block: s_i = q dot mean_pool(K_i), where K_i are the key vectors of the tokens in block i and mean_pool averages them along the sequence dimension. The mean-pooled key acts as a cheap summary, or representative, of the whole block, so a single dot product estimates how relevant the block is to the query ^[1].

Top-k selection. Each query then keeps only the k blocks with the highest scores. Formally the gate value g_i equals 1 if s_i ranks among the top-k scores across all blocks and 0 otherwise; the query attends only to the keys and values inside the selected blocks. Because k is much smaller than n in long contexts, the number of query-key interactions drops sharply. In the authors' scaling experiments a block size of B = 512 with top-k = 3 already yields sparsity around 81 percent at 8K context and about 95 percent at 32K context ^[1].

Causality. To preserve the autoregressive property, MoBA forbids a query from selecting any block that lies in its future: scores for future blocks are set to negative infinity so they can never enter the top-k. The block containing the query itself, the current block, is always selected, and a standard causal mask is applied inside it so a token cannot see later tokens within its own block. Together these rules guarantee that MoBA never leaks information from future positions ^[1].

Relation to full attention. Setting k = n forces every query to select every block, which recovers exact full attention. Other familiar patterns are special cases too: a sliding window corresponds to always picking the nearest blocks, and an attention sink corresponds to always picking the first block. This is what lets MoBA move continuously between sparse and dense regimes, and what makes it practical to interleave MoBA layers with full-attention layers inside the same model ^[1].

Implementation. MoBA is built on top of FlashAttention. The reference algorithm splits the work into a self-block attention over the current block and a cross-block attention over the selected history blocks, reorders queries by their block assignment so the selected keys and values are contiguous, runs separate FlashAttention calls, and merges the partial results with online softmax. The open-source implementation reports an efficient kernel that is up to 40 times faster than a naive reference at 32K sequence length ^[2].

The mixture-of-experts analogy

MoBA's name is a deliberate echo of mixture of experts. In a standard MoE layer, a router scores a set of feed-forward experts for each token and dispatches the token to its top-k experts, so only a fraction of the network's parameters activate per token. MoBA transplants this conditional-computation pattern into the attention layer: the blocks of the key-value cache are the experts, and the top-k gate is the router that decides which blocks a query is dispatched to. The score s_i plays the role of the routing logit, computed from a mean-pooled block summary in place of an expert embedding.

The analogy is structural rather than parameter-based. Classic MoE sparsifies over distinct learned expert weights, whereas MoBA sparsifies over partitions of the context, all of which share the same projection weights; the "experts" are dynamic spans of the input, not separate modules. The shared insight is that a learned top-k router can preserve quality while activating only a small slice of the available computation, which is precisely the property that makes both techniques attractive at scale ^[1].

Use in Kimi and results

The MoBA authors validate the method at two scales. In controlled experiments tied to scaling laws they train models from 568 million to 2.1 billion parameters and find that MoBA tracks full-attention language-model loss closely despite its high sparsity. For large-scale validation they continue-train Llama-3.1-8B into a model they call Llama-8B-1M-MoBA, progressively extending the context from 128K to 256K to 512K and finally to 1 million tokens, using a larger block size of B = 4096 and top-k = 12. A layer-wise hybrid keeps the last few layers in full attention, which the authors report recovers the small remaining quality gap on the hardest long-context tasks ^[1].

The reported quality stays close to full attention while the compute savings are large:

Setting	MoBA	Full attention
RULER benchmark at 128K	0.7818	0.7849
LongBench at 32K	0.4828	0.4821
Prefill speedup at 1M tokens	up to 6.5x	1x (baseline)
Attention compute at 10M tokens	about 16x less	1x (baseline)

MoBA is essentially on par with full attention on these long-context benchmarks, marginally lower on RULER and marginally higher on LongBench, and it passes needle-in-a-haystack retrieval at the 1M-token setting ^[1]. Because the mechanism is a swap-in replacement for the attention call, Moonshot reports that MoBA has been deployed to serve Kimi's long-context traffic in production ^[1]. The open-source release, available under an MIT license, notes one important caveat: MoBA delivers its speedups only after continued training, so it is not a zero-shot drop-in for an already-trained dense model ^[2].

Relationship to NSA and other methods

MoBA appeared within days of a closely related method, Native Sparse Attention (NSA) from DeepSeek, which was posted to arXiv on February 16, 2025, two days before MoBA ^[3]. Both are natively trainable, block-oriented sparse attention schemes aimed at long context, and their near-simultaneous release marked a shift toward sparse attention that is learned end-to-end rather than applied only at inference.

The two differ in design philosophy. NSA uses a fixed three-branch architecture whose branches always run in parallel: a compressed branch over coarse pooled token blocks, a fine-grained selection branch over a few important blocks, and a sliding-window branch for local context, with a learned gate blending the three outputs. It is co-designed with hardware-aligned Triton kernels to balance arithmetic intensity. MoBA is comparatively minimal: a single top-k router over mean-pooled block keys, no extra branches, and a computation that stays close to ordinary attention. That simplicity is what lets MoBA collapse exactly to full attention and be mixed with dense layers, whereas NSA's branches are always active ^[1]^[3].

Compared with the earlier fixed-pattern methods such as Longformer and BigBird, MoBA's key advance is that connectivity is data-dependent and learned rather than static, yet it can still reproduce those patterns as special cases. It also sits alongside concurrent learned-sparsity work such as Microsoft's SeerAttention, which likewise learns block-level sparse patterns. MoBA should not be confused with Block-Attention, a separate method for parallel context encoding and cross-prompt key-value cache reuse that shares the "block" terminology but targets retrieval-augmented prefill rather than per-query routing. Within Moonshot's own stack, MoBA is the long-context attention counterpart to the company's broader efficiency and training work behind the Kimi line ^[1].

References

Enzhe Lu, et al. "MoBA: Mixture of Block Attention for Long-Context LLMs." arXiv:2502.13189, February 18, 2025. https://arxiv.org/abs/2502.13189 ↩
MoonshotAI. "MoBA: Mixture of Block Attention for Long-Context LLMs" (source code repository, MIT license). GitHub. https://github.com/MoonshotAI/MoBA ↩
DeepSeek-AI. "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention." arXiv:2502.11089, February 16, 2025. https://arxiv.org/abs/2502.11089 ↩
Moonshot AI (Kimi). "Introducing MoBA: Mixture of Block Attention for Long-Context LLMs." Announcement, February 2025. https://github.com/MoonshotAI/MoBA

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

DeepSeek Sparse Attention (DSA)Native Sparse Attention (NSA)

Overview

Background: the cost of long-context attention

How MoBA works

The mixture-of-experts analogy

Use in Kimi and results

Relationship to NSA and other methods

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here