Dynamic inference
Last reviewed
Apr 30, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,150 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
20 citations
Review status
Source-backed
Revision
v1 ยท 4,150 words
Add missing citations, update stale details, or suggest a clearer explanation.
Dynamic inference, also called input-adaptive inference or adaptive computation, is the family of techniques that adapts a neural network's computation to each input at runtime instead of always running the full static computation graph. Standard inference executes the same set of operations for every example, regardless of how easy or hard the example is. Dynamic inference exploits the fact that many inputs do not need the entire model: an easy image can be classified by a shallow sub-network, a common token in a language model can skip several transformer layers, and a routine query can be handled by a small draft model with the large target model only validating the result. The goal is to reduce the average compute and latency of inference while preserving the accuracy of a comparable static model.
The topic is the subject of a comprehensive survey by Han, Huang, Song, Yang, Wang, and Wang (TPAMI 2022, arXiv:2102.04906), which organises the field into instance-wise, spatial-wise, and temporal-wise dynamic networks. In production large language models the dominant forms of dynamic inference are sparse Mixture of Experts routing (Shazeer et al. 2017, Switch Transformer 2021), Mixture of Depths (Raposo et al. 2024), and Speculative Decoding (Leviathan, Kalman, Matias 2023). On the deployment side, systems such as Orca (Yu et al. 2022) and vLLM (Kwon et al. 2023) extend the idea to scheduling and memory: only the active part of the request gets compute, only the live part of the KV cache gets memory.
A conventional feed-forward pass does the same amount of work for every input. A photo of a clear lemon on a white background runs through every block of a ResNet-50 just like a cluttered, occluded scene. A short, common phrase like "the cat sat on the" runs through every layer and every expert of a large language model just like a multi-step reasoning question. Empirically, much of that compute is wasted: studies behind early-exit work such as BranchyNet (Teerapittayanon, McDanel, Kung 2016) and MSDNet (Huang et al. 2017) report that more than half of standard image-classification examples are already correctly classified at intermediate depths, with high confidence. The same redundancy holds for token-level computation in transformers, which motivates Mixture of Depths and per-token routing.
The second pressure is economic. Frontier model serving has become dominated by inference cost rather than training cost. A single forward pass of a dense 70-billion-parameter model on a long context is too slow and too expensive to run for every user query at web scale. The third pressure is on-device inference: phones, wearables, and edge cameras have a small power envelope, so they need to spend FLOPs only on the inputs that need them.
The sibling concept is static inference, where the full graph runs every time. The contrast is sharpest at the level of compute predictability and accuracy.
| Property | Static inference | Dynamic inference |
|---|---|---|
| Compute per input | Identical for every input | Varies by input difficulty |
| Latency | Predictable, worst-case = average-case | Variable; tail latency depends on hard inputs |
| Hardware utilisation | High; easy to batch | Lower or harder to batch; needs custom kernels |
| Accuracy at fixed average compute | Lower (forced to be uniform) | Higher (compute concentrated on hard cases) |
| Implementation | Plain forward pass | Routers, gates, exit heads, schedulers |
| Determinism | Fully deterministic | Often deterministic per fixed routing seed, but routing depends on input |
Dynamic inference is complementary to static model compression such as quantization, weight pruning, and knowledge distillation. Quantization shrinks every operator uniformly. Dynamic inference instead chooses, per input, how many operators to run. The two are routinely combined: for example, a 4-bit quantised LLM served with speculative decoding and PagedAttention.
The Han et al. 2022 survey divides dynamic neural networks into three top-level categories. A complementary axis, used heavily in modern LLM serving, is whether the dynamism happens inside one forward pass (MoE, MoD, early exit) or across multiple forward passes (speculative decoding, scheduling).
| Family | What varies per input | Representative methods |
|---|---|---|
| Sample-wise (instance-wise) | Depth, width, or path through one model | BranchyNet, MSDNet, DeeBERT, FastBERT, SkipNet, BlockDrop, Slimmable Networks, Once-for-All, cascade models |
| Spatial-wise | Compute spent on different image regions or tokens | Adaptive resolution, A-ViT token pruning, region-of-interest backbones |
| Temporal-wise | Compute spent on different frames or time steps | Adaptive frame skipping in video, dynamic computation in RNNs |
| Model-component dynamic | Which sub-modules of the network fire | Mixture of Experts, Mixture of Depths |
| Multi-pass dynamic | How many forward passes are needed and which model runs them | Speculative decoding, draft-and-verify, self-speculation |
| Serving-level dynamic | Which requests share which batch and which memory | Continuous batching (Orca), PagedAttention (vLLM), KV-cache eviction |
These methods choose, per input, how much of a single model to run.
Early-exit networks attach intermediate classifiers to a backbone and let easy examples leave through one of these heads instead of going all the way to the final layer. Confidence on the early head is the usual exit criterion. BranchyNet (Teerapittayanon, McDanel, Kung 2016, ICPR) was the first widely cited paper to formalise this for vision: side-branch classifiers were added to LeNet, AlexNet, and ResNet, and a softmax-entropy threshold decides whether to exit. MSDNet (Huang, Chen, Li, Wu, van der Maaten, Weinberger 2017) addressed a real failure mode of BranchyNet, namely that early features hurt later ones, by maintaining multi-scale features through a 2-D dense network so each exit can use coarse and fine features.
The same idea was ported to NLP. DeeBERT (Xin, Tang, Lee, Yu, Lin, ACL 2020) attaches early-exit classifiers to BERT layers, and at inference each example exits at the first layer whose entropy is below a tunable threshold. The authors report roughly 40% inference time saved with minimal quality loss. FastBERT and CascadeBERT extend this with self-distillation between early and late heads, and with cascade-style aggregation of multiple early predictions.
A cascade is more general than an early-exit head: it runs a cheap model first, and only invokes the expensive model when the cheap model is uncertain. Bolukbasi, Wang, Dekel, Saligrama (ICML 2017, "Adaptive Neural Networks for Efficient Inference") presented a clean cascade formulation in which a sequence of networks of increasing cost is trained, and at test time a per-layer gate decides whether to keep going. They reported up to 2.8x speedup on ImageNet with less than 1% top-5 accuracy loss. Cascades are widely used in production for content moderation, spam filtering, and image-quality classification, where most inputs are easy.
Skip-connection methods learn a per-input policy for which layers or blocks to execute. SkipNet (Wang et al., ECCV 2018) and BlockDrop (Wu et al., CVPR 2018) train a gating network with reinforcement learning or Gumbel-softmax to decide which residual blocks of a ResNet to run on each example. The result is a model whose effective depth varies per input.
A different sample-wise idea is to make the network's width adjustable at runtime. Slimmable Networks (Yu, Yang, Xu, Yang, Huang, ICLR 2019) train a single network that can be executed at multiple widths through switchable batch normalisation. The same set of weights serves devices with different power budgets. Once-for-All (Cai, Gan, Wang, Zhang, Han, ICLR 2020) generalises this: train one supernet from which over 10^19 specialised sub-networks can be sampled and deployed without retraining. OFA reported 80% ImageNet top-1 accuracy under a 600M-MAC mobile constraint. While the supernet itself is static, the per-deployment sub-network choice makes inference dynamic in practice.
For images and image-like data, different regions are not equally informative. Spatial-wise dynamic networks send more compute to important regions and skip the rest. Examples include adaptive image resolution (low-resolution pass first, high-resolution refinement only where needed), patch-skipping vision transformers (A-ViT, DynamicViT, ALBEF token pruning), and region-of-interest backbones used in detection. The taxonomy is given in detail in the Han et al. 2022 survey, sections on spatial dynamism. Spatial methods plug naturally into the same training pipelines as standard ConvNets and ViTs because they only change which patches participate in each layer.
Video and streaming text contain a lot of redundancy across time. Temporal-wise dynamic networks decide on a per-frame or per-time-step basis whether to run the full model or reuse a previous prediction. Adaptive frame-skipping models for action recognition fall into this group, as do dynamic-depth recurrent networks like the Adaptive Computation Time RNN (Graves 2016). In streaming LLM serving, the same idea reappears: only the new tokens get full attention, and earlier tokens can be summarised, evicted, or partially attended via H2O (Zhang et al. 2023) or StreamingLLM (Xiao, Tian, Chen, Han, Lewis, ICLR 2024).
Mixture of Experts is the most successful dynamic-inference method in production large models. The idea is to replace a single dense feed-forward block with a set of N parallel experts and a small router that, for each token, picks k of them. Total parameter count is high (up to N times the dense block), but per-token compute is k times the dense block, where k is much smaller than N.
The modern form was introduced by Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean in "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017). Their MoE layer used noisy top-k gating, where the gate adds Gaussian noise to logits before selecting the top-k experts. They reported up to 137-billion-parameter language models with only minor compute overhead per token, achieving over 1000x more model capacity than dense baselines for the same compute. They also introduced an importance loss and a load loss to keep experts roughly balanced.
Switch Transformer (Fedus, Zoph, Shazeer, JMLR 2022, originally arXiv 2021) simplified the recipe by routing each token to exactly one expert (top-1). They showed top-1 routing trains stably at scale, reduces the routing overhead, and produces roughly 7x pre-training speedup over dense T5-Base/Large. They popularised the term expert capacity factor, the maximum number of tokens per expert per batch; tokens above the capacity are dropped (token dropping). They scaled to a trillion-parameter model on TPUs.
GShard (Lepikhin, Lee, Xu, Chen, Firat, Huang, Krikun, Shazeer, Chen 2020) provided the systems substrate. It is a set of XLA annotations that automatically shard a top-2 MoE Transformer across TPU pods. They trained a 600-billion-parameter multilingual translation model on 2048 TPU v3 chips in four days. GShard introduced the auxiliary load-balancing loss that became standard.
Left alone, the router collapses to a few experts. Standard mitigations include an auxiliary load-balancing loss that penalises imbalance between expert usage and routing probabilities, expert capacity factors that put a hard cap on tokens per expert, and noise in the router (z-loss in Switch Transformer). DeepSeek-V3 (Liu et al. 2024) introduced an auxiliary-loss-free strategy that adjusts a per-expert bias term during training instead of adding a loss term, claiming better quality at the same balance.
MoE has moved from research to production:
| Model | Year | Architecture | Active per token |
|---|---|---|---|
| GShard | 2020 | 600B params, top-2 routing | ~50B |
| GLaM (Du et al., Google) | 2022 | 1.2T params, top-2 routing | ~97B |
| Switch Transformer | 2021 | Up to 1.6T params, top-1 | ~6B |
| Mixtral 8x7B (Mistral) | 2023 | 8 experts of 7B, top-2 | 12.9B |
| Mixtral 8x22B | 2024 | 8 experts of 22B, top-2 | 39B |
| DeepSeek-V2 | 2024 | 236B params, fine-grained experts | 21B |
| DeepSeek-V3 | 2024 | 671B params, 256 routed + 1 shared expert per layer | 37B |
| Grok-1 (xAI) | 2024 | 314B params, 8 experts, top-2 | 86B |
| Qwen2-MoE / Qwen3-MoE | 2024-2025 | Up to 235B params with shared + routed experts | varies |
DeepSeek-V2 and V3 introduced the shared expert pattern: in addition to N routed experts (selected by the router), every token also passes through a small set of always-on shared experts. The shared expert captures common knowledge; the routed experts specialise. DeepSeek-V3's architecture has 1 shared expert and 256 routed experts per layer, with the router picking 8 of those 256.
Mixture of Depths (Raposo, Ritter, Richards, Lillicrap, Humphreys, Santoro 2024, arXiv:2404.02258) does for layer-skipping what MoE did for feed-forward blocks. Each transformer layer has a router that selects a top-k subset of tokens to actually compute the attention and MLP; the rest of the tokens skip the layer through a residual connection. The compute budget per layer is fixed, so total FLOPs are bounded and predictable, but each token's effective depth depends on the input. Raposo et al. report match or beat baselines at a fraction of the FLOPs, and the method composes with MoE ("MoE-MoD"). Unlike per-token early exit, MoD makes a fresh routing decision at every layer, so a token may participate, skip, then participate again.
Autoregressive decoding is sequential: each token requires one forward pass through the full LLM. Speculative decoding breaks this serial bottleneck by running a small draft model that proposes K tokens, then validating those K tokens with one parallel forward pass of the target model. The longest validated prefix is accepted, the rest is discarded, and the loop continues. The output distribution is exactly that of the target model: speculative decoding is exact, not approximate, when the verification step uses rejection sampling.
The original method appeared in two concurrent 2022 papers:
| Paper | Authors | Year | Venue | Key idea |
|---|---|---|---|---|
| Fast Inference from Transformers via Speculative Decoding | Leviathan, Kalman, Matias (Google) | 2022 / ICML 2023 | ICML 2023 | Draft model proposes K, target verifies in one pass; 2x-3x on T5-XXL |
| Accelerating Large Language Model Decoding with Speculative Sampling | Chen et al. (DeepMind) | 2023 | arXiv | Same recipe, formal acceptance probability with rejection sampling |
Variants and refinements:
| Method | Year | Change vs base recipe |
|---|---|---|
| Medusa (Cai, Li, Geng, Peng, Lee, Chen, Dao) | 2024 | Adds extra prediction heads to the target model itself; tree-based attention verifies several continuations at once. 2.2-3.6x speedup. |
| EAGLE (Li, Wei, Zhang, Zhang) | 2024 | Predicts at the second-to-top hidden-feature level instead of token level; 2.7-3.5x on LLaMA2-Chat 70B |
| Lookahead Decoding | 2024 | No draft model; uses Jacobi iteration plus verification |
| ReDrafter (Apple) | 2024 | Recurrent draft head; tested on MLX |
| Self-speculative decoding (Zhang et al.) | 2024 | Skips layers of the same model to form an internal draft, no separate model needed |
In practice, speculative decoding works well when the draft model agrees often with the target model. Acceptance rates of 60-80% are typical for matched draft-target pairs. The method is now standard in production stacks: vLLM, TensorRT-LLM, and TGI all support several variants out of the box.
For LLM serving, dynamic inference also happens between requests. Two systems are foundational:
Orca: A Distributed Serving System for Transformer-Based Generative Models (Yu, Jeong, Kim, Kim, Chun, OSDI 2022) introduced iteration-level scheduling, also known as continuous batching. Naive batching forces the entire batch to wait until the slowest sequence in the batch finishes generating; continuous batching lets the scheduler insert a new request as soon as any sequence in the current batch finishes. Orca also introduced selective batching for operations whose shape varies per request (attention) so they are not batched while feed-forward blocks are. The reported throughput improvement over NVIDIA FasterTransformer was up to 36.9x at the same latency.
PagedAttention / vLLM (Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica, SOSP 2023) treats the KV cache like virtual memory: the cache is split into fixed-size blocks (pages), each request gets a page table, and pages can be shared across requests (for example, a shared system prompt). This eliminates fragmentation and waste in GPU memory, doubling or quadrupling throughput over FasterTransformer and Orca on the same hardware.
KV-cache compression and eviction methods take dynamic inference into the memory system itself. H2O (Zhang et al., NeurIPS 2023) keeps a small set of "heavy hitter" tokens that contribute most attention mass and evicts the rest. StreamingLLM (Xiao, Tian, Chen, Han, Lewis, ICLR 2024) keeps an attention sink (the very first tokens) plus a sliding window, and shows that with this layout LLMs can serve 4 million-token streams at up to 22.2x speedup. These are dynamic because what gets kept in cache depends on what the model has seen so far.
Other related techniques include Skeleton-of-Thought (Ning et al. 2023), which produces a high-level outline first and then expands sections in parallel, and draft-then-verify retrieval augmentation, where an embedding-based retriever drafts candidate passages that the LLM verifies.
Latency-critical edge inference. Phones, AR/VR headsets, smart cameras, and IoT devices have a hard wall on power and a soft wall on latency. Early-exit and slimmable networks let the same weights serve a 0.5W and a 5W budget.
LLM serving at scale. This is where most of the dollar value of dynamic inference lives. Without speculative decoding and continuous batching, a 70B model serving thousands of concurrent users at sub-second latency is uneconomic. With them, the same hardware budget supports several times the throughput.
Foundation-model compute reduction. Frontier model training has become so expensive that providers want to keep the model itself sparse. MoE makes this work: Mixtral 8x7B, DeepSeek-V3, Qwen3-MoE, and Grok all use MoE because it grows effective parameter count while keeping per-token FLOPs bounded.
Energy efficiency. Data-centre operators increasingly track joules per token. MoE plus speculative decoding plus PagedAttention can drop joules per token by an order of magnitude versus a naive dense forward pass loop.
Content moderation cascades. A common production pattern is a small, fast classifier that filters obvious cases and a heavier model that handles the residual. This is a cascade in the Bolukbasi 2017 sense, applied to abuse detection, NSFW filtering, and ad-policy review.
Dynamic inference is not free. The honest accounting includes several costs.
Hardware utilisation. GPUs love dense, regular matrix multiplication. A sparse top-1 MoE layer needs a scatter-and-gather kernel that fights against this hardware preference. Kernels such as Megablocks (Gale, Narayanan, Young, Zaharia 2023) and ScatterMoE close some of the gap, but the wall-clock speedup of MoE is often less than its FLOPs reduction. Dynamic depth via early exit is even harder to batch.
Routing overhead. A router or exit head adds latency. For very small models the router can cost more than the saved compute. For large models the router is negligible.
Training complexity. Auxiliary balance losses, capacity factors, and special learning-rate schedules all complicate MoE training. Routing instabilities still occasionally show up, which is why DeepSeek-V3 introduced an auxiliary-loss-free balancer.
Determinism and verifiability. Static models are bit-for-bit reproducible given fixed seeds. Dynamic models that depend on small probabilities (load-balanced routing, speculative-decoding acceptance) are deterministic given the same input and seed, but numerical differences in batch composition can change the routing outcome.
Tail latency. A network whose easy inputs are fast and hard inputs are slow has a wider latency distribution than a static one. SLO-driven serving pipelines must size capacity to the p99 case.
Standard GPUs handle dynamic inference through software, with custom CUDA kernels closing the performance gap.
| Workload | Common kernel/runtime |
|---|---|
| MoE forward and backward | Megablocks, ScatterMoE, NVIDIA Tutel, FasterMoE |
| Speculative decoding | TensorRT-LLM, vLLM, TGI |
| PagedAttention / KV-cache paging | vLLM (origin), TGI, TensorRT-LLM |
| Continuous batching | Orca (origin), vLLM, TGI, NVIDIA Triton Inference Server |
| Early exit | Custom; PyTorch torch.compile + dynamic shapes |
TPUs were the original home of GShard and remain a strong target through the Pathways runtime. Cerebras, Groq, and SambaNova chips have advantages for sparse and dynamic patterns because their on-chip memory and dataflow architectures avoid some of the gather/scatter overhead. NVIDIA Hopper and Blackwell GPUs added structured-sparsity and FP8 features that benefit static compression more than dynamic routing, but their faster interconnect helps MoE expert-parallel layouts.
Most dynamic-inference techniques in modern LLM serving are bundled into a few well-known frameworks.
| Framework | Maintainer | Notable dynamic-inference features |
|---|---|---|
| vLLM | UC Berkeley / community | PagedAttention, continuous batching, speculative decoding, prefix caching |
| TensorRT-LLM | NVIDIA | Continuous batching, in-flight batching, speculative decoding, MoE kernels |
| Triton Inference Server | NVIDIA | Multi-model orchestration, dynamic batching, ensembles of cascades |
| DeepSpeed-MII | Microsoft | Dynamic batching, MoE, ZeRO-Inference |
| TGI (Text Generation Inference) | Hugging Face | Continuous batching, speculative decoding, paged attention |
| FasterTransformer | NVIDIA (legacy) | Static batching, optimised kernels |
| llama.cpp | Community | KV-cache management, speculative decoding, partial layer offload |
MoE in frontier models. Through 2024 and 2025, most newly-announced large open-weight models have been MoE: DeepSeek-V2/V3, Qwen2/3-MoE, Grok-1, Mixtral 8x22B. The pattern is now standard rather than experimental.
Speculative decoding becomes the default. By 2025, speculative decoding (often EAGLE or Medusa) is enabled by default in most production LLM serving stacks. vLLM and TensorRT-LLM ship multiple variants.
Test-time compute and reasoning models. A related sense of dynamic inference appeared with OpenAI's o1 (2024) and DeepSeek-R1 (2025): the model spends more compute (more chain-of-thought tokens) on harder problems. The mechanism differs from MoE or speculative decoding, but the philosophy is the same.
On-device dynamic inference. Apple Intelligence, Gemini Nano on Pixel, and on-device Llama variants combine static compression (4-bit weights, structured pruning) with dynamic adapters and small-then-large cascades.
Dynamic inference is one of the three main pillars of inference optimization, alongside model compression and systems-level work. A typical production stack for a 70B-class LLM uses all three at once: 4-bit weight quantisation (compression) plus a draft-target speculative pair (dynamic inference) plus PagedAttention with continuous batching (systems). Each layer of optimisation multiplies, not adds, with the others.