PEER (Parameter Efficient Expert Retrieval / Mixture of a Million Experts)

Deep Learning Neural Networks

10 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v2 · 2,030 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

PEER, short for Parameter Efficient Expert Retrieval, is a neural network layer for Transformer models that replaces the dense feed-forward block with a sparse mixture of experts drawn from an extremely large pool of very small experts. Instead of the dozens or hundreds of experts used by conventional sparse models, a single PEER layer can choose among more than one million experts, where each expert is a tiny multilayer perceptron with a single hidden neuron. To pick a handful of these experts for each token without scoring all million of them, PEER borrows the product-key memory retrieval technique, which finds the most relevant experts in time that grows with the square root of the pool size rather than linearly. ^[1]

The method was introduced in the paper "Mixture of A Million Experts" by Xu Owen He, a researcher at Google DeepMind, posted to arXiv on July 4, 2024. ^[1] The paper's central goal is to decouple a model's total parameter count from its per-token compute cost. Because only a small, fixed number of experts is activated for any given token, a PEER layer can hold a very large number of parameters while keeping the floating-point operations (FLOPs) and activation memory per token low. On language-modeling experiments, He reports that PEER layers reach a better performance-versus-compute trade-off than dense feed-forward networks and than conventional coarse-grained mixture-of-experts layers. ^[1]

Background

PEER sits at the intersection of two earlier research lines: scaling laws for fine-grained mixture of experts, and the product-key memory layer.

Fine-grained mixture of experts

A standard mixture-of-experts (MoE) layer holds several parallel expert sub-networks and uses a learned router to send each token to only one or a few of them. This raises a model's parameter count and capacity while holding the per-token compute roughly fixed, which is why sparse MoE has become a common way to scale large language model capacity. In most production MoE designs each expert is about the size of a full feed-forward block, so the layer has only a modest number of large experts.

Work on scaling laws for MoE questioned whether that is the right design point. "Scaling Laws for Fine-Grained Mixture of Experts" by Jakub Krajewski, Jan Ludziejewski and colleagues, published at ICML 2024, introduced a hyperparameter called granularity that controls how finely the feed-forward computation is split into experts. Their fitted scaling law indicated that the common practice of making each expert as large as the dense feed-forward layer is suboptimal at almost every compute budget, and that using more, smaller experts (higher granularity) improves the loss. ^[3] This finding motivates pushing granularity much higher than usual. The obstacle is the router: a conventional MoE router computes a score for every expert, so its cost grows linearly with the number of experts, which makes pools of hundreds of thousands or millions of experts impractical. PEER is, in effect, an attempt to realize the very high granularity the scaling law favors while sidestepping that routing bottleneck. ^[1]^[3]

Product-key memory

The retrieval mechanism PEER reuses comes from "Large Memory Layers with Product Keys" by Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer and Herve Jegou, presented at NeurIPS 2019. ^[2] A product-key memory stores a very large table of learnable value vectors, each indexed by a key. For each input the layer forms a query and retrieves the top-k value vectors whose keys best match the query, then combines them with softmax weights. The trick that makes this affordable is the product-key structure. Rather than keeping N independent keys, the layer builds keys as the Cartesian product of two smaller sets of about $\sqrt{N}$ sub-keys each, and splits the query into two halves. Finding the best sub-keys in each half and then combining them recovers the overall top-k while only ever scoring on the order of $\sqrt{N}$ candidates. The 2019 paper showed that this lets a model add up to roughly a billion extra parameters with negligible extra computation, and that a memory-augmented 12-layer Transformer could match a 24-layer baseline while running about twice as fast. ^[2]

How PEER works

A PEER layer combines three ingredients: product-key retrieval to select experts, single-neuron experts to keep each expert cheap, and multi-head retrieval to recover expressive power.

Product-key expert retrieval

Each token's hidden state is passed through a small query network to produce a query vector. Every expert in the pool has an associated product key, and the layer retrieves the k experts whose keys have the highest inner product with the query. As in the 2019 memory layer, the full set of keys is formed from two sets of roughly $\sqrt{N}$ sub-keys, so the query is split in two, the top matches are found independently in each sub-key set (producing k-by-k candidate combinations), and a final top-k is taken from those candidates. This reduces the retrieval cost from order $N$ to roughly order $\sqrt{N}$ , which is what makes a pool of more than a million experts tractable. With a pool of $N = 1024 \times 1024 = 1{,}048{,}576$ experts, the layer compares the query against only about 1,024 sub-keys per half instead of more than a million full keys. ^[1] He notes that the product-key structure functions as a learned index for routing, in the spirit of learned index structures. ^[1]

Single-neuron experts

PEER takes each expert to the smallest size possible. An expert is a multilayer perceptron with exactly one hidden neuron, computed as $e(x) = \sigma(u^\top x) v$ , where u and v are learnable vectors and sigma is a nonlinearity such as ReLU or GELU. Setting the expert hidden dimension to one is what allows the pool to contain so many experts at a manageable total parameter cost. A PEER layer with N experts holds on the order of $(2 d_{\text{model}} + 1) N$ parameters, since each expert contributes its input vector u, its output vector v and a bias, but only the handful of retrieved experts ever perform computation for a given token. This is the mechanism by which PEER decouples total parameter count from per-token compute: inactive experts occupy memory but consume no FLOPs and no activation memory during a forward pass. ^[1]

Multi-head retrieval

Because a single one-neuron expert is very limited, PEER applies several independent retrievals in parallel, an idea analogous to multi-head attention. The layer uses h query heads (the default is 8), and each head independently retrieves its own top-k experts (the default is $k = 16$ ) from the same shared pool. Within each head the retrieved experts are combined using softmax weights computed from their retrieval scores, and the head outputs are summed to form the layer output. With the default settings the layer activates $h \times k = 8 \times 16 = 128$ experts per token, which corresponds to a granularity of 128 in the fine-grained MoE scaling law, even though the total pool exceeds a million. He also applies batch normalization to the query, which empirically spreads usage more evenly across the pool: with query batch normalization the experiments report about 99.98 percent of experts being used and a more balanced (lower) distribution unevenness, along with lower perplexity. ^[1]

Results

He evaluates PEER on autoregressive language modeling using an isoFLOP methodology, in which models built from different layer types are trained under matched compute budgets and their compute-optimal configurations are compared. The experiments use two budgets, 6e18 and 2e19 FLOPs, with models pretrained on the C4 dataset and evaluated on several held-out corpora. At equal compute, PEER reaches the lowest compute-optimal perplexity, beating a dense feed-forward baseline, a coarse-grained MoE with 128 experts, and a product-key memory layer. ^[1]

The table below shows the perplexities (lower is better) reported for the PEER model across the evaluation sets at the two budgets. ^[1]

Evaluation set	6e18 FLOPs	2e19 FLOPs
C4	20.63	16.45
The Pile	19.01	14.99
Lambada	17.65	10.33
Curation Corpus	20.68	16.34
Wikitext-103	25.48	19.09

At the 2e19 FLOP budget, the head-to-head comparison on three of these sets illustrates the gap to the baselines reported in the paper. ^[1]

Layer type (2e19 FLOPs)	C4	The Pile	Lambada
Dense feed-forward	18.31	18.19	12.28
Coarse MoE (128 experts)	17.12	17.41	12.97
Product-key memory	17.36	16.34	11.18
PEER	16.45	14.99	10.33

He also reports ablations supporting the design. Increasing the total number of experts while holding the active count fixed improves the loss, consistent with the fine-grained scaling-law prediction, and increasing the number of active experts (granularity) also helps. Query batch normalization improves both expert utilization and perplexity, with the largest effect near the compute-optimal region. ^[1]

Significance and limitations

PEER is notable as a concrete demonstration that the number of experts in a sparse layer can be pushed to the million scale, far beyond the tens or hundreds typical of deployed MoE systems, and that doing so can improve the compute-performance frontier rather than merely adding inert parameters. By turning the product-key memory into a routing mechanism over tiny learnable experts, it offers one answer to how the very high granularity favored by MoE scaling laws might be implemented in practice. He also suggests that a large, expandable pool of fine-grained experts is a natural substrate for lifelong learning, since new experts could in principle be added to absorb new knowledge without retraining the whole model. ^[1]

Several caveats apply. The results come from a single-author preprint and were obtained at relatively small scale, with compute budgets up to 2e19 FLOPs and sub-billion active-parameter models, so it is not established that the advantage persists at frontier scale or under the systems constraints of large training runs. Routing to a million tiny experts adds implementation complexity, and balanced utilization depends on the query batch normalization trick. A single one-neuron expert is also extremely limited on its own; PEER's expressiveness comes from combining 128 of them per token, and the method has not, as of mid-2026, been reported as the feed-forward layer of a widely deployed production model. These are reasons to treat PEER as a promising research direction rather than a settled architecture.

Relationship to other methods

PEER can be read as an extreme, fine-grained form of mixture of experts. A conventional MoE layer routes each token to one or a few experts that are each about the size of a dense feed-forward block, using a router that scores every expert; PEER instead routes to many single-neuron experts from a pool more than a thousand times larger, using sub-linear product-key retrieval in place of a linear-cost router. In relation to product-key memory, PEER directly generalizes the Lample et al. layer: where a memory layer retrieves and averages static learnable value vectors, PEER retrieves and applies small input-dependent expert functions, so the retrieved items perform computation rather than only contributing a stored vector. The fine-grained MoE scaling laws of Krajewski, Ludziejewski and colleagues supply the theoretical motivation, and PEER pushes their granularity parameter toward its practical limit. ^[1]^[2]^[3]

PEER also belongs to the broader family of techniques that decouple parameter count from active compute, alongside other sparse and conditional-computation methods such as routed MoE and parameter-sharing architectures like mixture of recursions. It shares with the original product-key memory work the goal of adding capacity at near-constant FLOPs, and it has inspired open-source reimplementations that make the PEER block available for experimentation. ^[1]^[2]

References

Xu Owen He. "Mixture of A Million Experts." arXiv:2407.04153, July 4, 2024. https://arxiv.org/abs/2407.04153 ↩
Guillaume Lample, Alexandre Sablayrolles, Marc'Aurelio Ranzato, Ludovic Denoyer, Herve Jegou. "Large Memory Layers with Product Keys." Advances in Neural Information Processing Systems (NeurIPS) 2019. arXiv:1907.05242. https://arxiv.org/abs/1907.05242 ↩
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, et al. "Scaling Laws for Fine-Grained Mixture of Experts." Proceedings of the 41st International Conference on Machine Learning (ICML) 2024. arXiv:2402.07871. https://arxiv.org/abs/2402.07871 ↩
Phil Wang (lucidrains). "PEER-pytorch: Pytorch implementation of the PEER block from the paper Mixture of A Million Experts." GitHub. https://github.com/lucidrains/PEER-pytorch

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Neural Network