Dynamic inference

AI Inference Mixture of Experts

23 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v2 · 4,611 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Dynamic inference, also called input-adaptive inference, conditional computation, or adaptive computation, is the family of techniques that adapts a neural network's computation to each input at runtime instead of always running the full static computation graph. The goal is to reduce the average compute and latency of inference while preserving the accuracy of a comparable static model, by spending more operations on hard inputs and fewer on easy ones. Standard inference executes the same set of operations for every example, regardless of difficulty; dynamic inference exploits the fact that many inputs do not need the entire model. An easy image can be classified by a shallow sub-network, a common token in a language model can skip several transformer layers, and a routine query can be handled by a small draft model with the large target model only validating the result.

The topic is the subject of a comprehensive survey, "Dynamic Neural Networks: A Survey" by Han, Huang, Song, Yang, Wang, and Wang (IEEE TPAMI 2022, arXiv:2102.04906), which organises the field into instance-wise, spatial-wise, and temporal-wise dynamic networks.^[1] In production large language models the dominant forms of dynamic inference are sparse Mixture of Experts routing (Shazeer et al. 2017; Switch Transformer 2021),^[8]^[10] Mixture of Depths (Raposo et al. 2024),^[11] and Speculative Decoding (Leviathan, Kalman, Matias 2023).^[12] On the deployment side, systems such as Orca (Yu et al. 2022) and vLLM (Kwon et al. 2023) extend the idea to scheduling and memory: only the active part of the request gets compute, only the live part of the KV cache gets memory.^[16]^[17]

Why is dynamic inference needed?

A conventional feed-forward pass does the same amount of work for every input. A photo of a clear lemon on a white background runs through every block of a ResNet-50 just like a cluttered, occluded scene. A short, common phrase like "the cat sat on the" runs through every layer and every expert of a large language model just like a multi-step reasoning question. Empirically, much of that compute is wasted: the early-exit literature, including BranchyNet (Teerapittayanon, McDanel, Kung 2016) and MSDNet (Huang et al. 2017), reports that more than half of standard image-classification examples are already correctly classified at intermediate depths, with high confidence.^[2]^[4] The same redundancy holds for token-level computation in transformers, which motivates Mixture of Depths and per-token routing.^[11]

The second pressure is economic. Frontier model serving has become dominated by inference cost rather than training cost. A single forward pass of a dense 70-billion-parameter model on a long context is too slow and too expensive to run for every user query at web scale. The third pressure is on-device inference: phones, wearables, and edge cameras have a small power envelope, so they need to spend FLOPs only on the inputs that need them.

The idea predates modern LLMs. Alex Graves framed the core question in "Adaptive Computation Time for Recurrent Neural Networks" (2016, arXiv:1603.08983), proposing a mechanism that lets a recurrent network "learn how many computational steps to take between receiving an input and emitting an output."^[3] Graves observed that the model allocated more computation to harder-to-predict transitions, such as the spaces between words and the ends of sentences, which is exactly the behaviour every later dynamic-inference method tries to reproduce.^[3]

How does static inference differ from dynamic inference?

The sibling concept is static inference, where the full graph runs every time. The contrast is sharpest at the level of compute predictability and accuracy.

Property	Static inference	Dynamic inference
Compute per input	Identical for every input	Varies by input difficulty
Latency	Predictable, worst-case = average-case	Variable; tail latency depends on hard inputs
Hardware utilisation	High; easy to batch	Lower or harder to batch; needs custom kernels
Accuracy at fixed average compute	Lower (forced to be uniform)	Higher (compute concentrated on hard cases)
Implementation	Plain forward pass	Routers, gates, exit heads, schedulers
Determinism	Fully deterministic	Often deterministic per fixed routing seed, but routing depends on input

Dynamic inference is complementary to static model compression such as quantization, weight pruning, and knowledge distillation. Quantization shrinks every operator uniformly. Dynamic inference instead chooses, per input, how many operators to run. The two are routinely combined: for example, a 4-bit quantised LLM served with speculative decoding and PagedAttention.^[17]

What are the main types of dynamic inference?

The Han et al. 2022 survey divides dynamic neural networks into three top-level categories: instance-wise (sample-wise), spatial-wise, and temporal-wise.^[1] A complementary axis, used heavily in modern LLM serving, is whether the dynamism happens inside one forward pass (MoE, MoD, early exit) or across multiple forward passes (speculative decoding, scheduling).

Family	What varies per input	Representative methods
Sample-wise (instance-wise)	Depth, width, or path through one model	BranchyNet, MSDNet, DeeBERT, FastBERT, SkipNet, BlockDrop, Slimmable Networks, Once-for-All, cascade models
Spatial-wise	Compute spent on different image regions or tokens	Adaptive resolution, A-ViT token pruning, region-of-interest backbones
Temporal-wise	Compute spent on different frames or time steps	Adaptive frame skipping in video, Adaptive Computation Time in RNNs
Model-component dynamic	Which sub-modules of the network fire	Mixture of Experts, Mixture of Depths
Multi-pass dynamic	How many forward passes are needed and which model runs them	Speculative decoding, draft-and-verify, self-speculation
Serving-level dynamic	Which requests share which batch and which memory	Continuous batching (Orca), PagedAttention (vLLM), KV-cache eviction

How do sample-wise dynamic networks work?

These methods choose, per input, how much of a single model to run.

What are early-exit networks?

Early-exit networks attach intermediate classifiers to a backbone and let easy examples leave through one of these heads instead of going all the way to the final layer. Confidence on the early head is the usual exit criterion. BranchyNet (Teerapittayanon, McDanel, Kung 2016, ICPR) was the first widely cited paper to formalise this for vision: side-branch classifiers were added to LeNet, AlexNet, and ResNet, and a softmax-entropy threshold decides whether to exit.^[2] MSDNet (Huang, Chen, Li, Wu, van der Maaten, Weinberger 2017) addressed a real failure mode of BranchyNet, namely that early features hurt later ones, by maintaining multi-scale features through a 2-D dense network so each exit can use coarse and fine features.^[4]

The same idea was ported to NLP. DeeBERT (Xin, Tang, Lee, Yu, Lin, ACL 2020) attaches early-exit classifiers to BERT layers, and at inference each example exits at the first layer whose entropy is below a tunable threshold. The authors report up to about 40% inference time saved with minimal quality loss.^[5] FastBERT and CascadeBERT extend this with self-distillation between early and late heads, and with cascade-style aggregation of multiple early predictions.

What is a cascade model?

A cascade is more general than an early-exit head: it runs a cheap model first, and only invokes the expensive model when the cheap model is uncertain. Bolukbasi, Wang, Dekel, Saligrama (ICML 2017, "Adaptive Neural Networks for Efficient Inference") presented a clean cascade formulation in which a sequence of networks of increasing cost is trained, and at test time a per-layer gate decides whether to keep going. They reported up to 2.8x speedup on ImageNet with less than 1% top-5 accuracy loss.^[3] Cascades are widely used in production for content moderation, spam filtering, and image-quality classification, where most inputs are easy.

How do layer-skip and skip-connection networks work?

Skip-connection methods learn a per-input policy for which layers or blocks to execute. SkipNet (Wang et al., ECCV 2018) and BlockDrop (Wu et al., CVPR 2018) train a gating network with reinforcement learning or Gumbel-softmax to decide which residual blocks of a ResNet to run on each example. The result is a model whose effective depth varies per input.

What are slimmable and once-for-all networks?

A different sample-wise idea is to make the network's width adjustable at runtime. Slimmable Networks (Yu, Yang, Xu, Yang, Huang, ICLR 2019) train a single network that can be executed at multiple widths through switchable batch normalisation. The same set of weights serves devices with different power budgets.^[6] Once-for-All (Cai, Gan, Wang, Zhang, Han, ICLR 2020) generalises this: train one supernet from which over 10^19 specialised sub-networks can be sampled and deployed without retraining. OFA reported 80% ImageNet top-1 accuracy under a 600M-MAC mobile constraint.^[7] While the supernet itself is static, the per-deployment sub-network choice makes inference dynamic in practice.

How do spatial-wise dynamic networks work?

For images and image-like data, different regions are not equally informative. Spatial-wise dynamic networks send more compute to important regions and skip the rest. Examples include adaptive image resolution (low-resolution pass first, high-resolution refinement only where needed), patch-skipping vision transformers (A-ViT, DynamicViT, ALBEF token pruning), and region-of-interest backbones used in detection. The taxonomy is given in detail in the Han et al. 2022 survey, in the sections on spatial dynamism.^[1] Spatial methods plug naturally into the same training pipelines as standard ConvNets and ViTs because they only change which patches participate in each layer.

How do temporal-wise dynamic networks work?

Video and streaming text contain a lot of redundancy across time. Temporal-wise dynamic networks decide on a per-frame or per-time-step basis whether to run the full model or reuse a previous prediction. Adaptive frame-skipping models for action recognition fall into this group, as do dynamic-depth recurrent networks like the Adaptive Computation Time RNN (Graves 2016).^[3] In streaming LLM serving, the same idea reappears: only the new tokens get full attention, and earlier tokens can be summarised, evicted, or partially attended via H2O (Zhang et al. 2023) or StreamingLLM (Xiao, Tian, Chen, Han, Lewis, ICLR 2024).^[18]^[19]

What is Mixture of Experts and how does it make inference dynamic?

Mixture of Experts is the most successful dynamic-inference method in production large models. The idea is to replace a single dense feed-forward block with a set of N parallel experts and a small router that, for each token, picks k of them. Total parameter count is high (up to N times the dense block), but per-token compute is k times the dense block, where k is much smaller than N.

What is the sparsely-gated MoE layer?

The modern form was introduced by Shazeer, Mirhoseini, Maziarz, Davis, Le, Hinton, Dean in "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer" (ICLR 2017).^[8] Their MoE layer used noisy top-k gating, where the gate adds Gaussian noise to logits before selecting the top-k experts. They reported up to 137-billion-parameter language models with only minor compute overhead per token, achieving over 1000x more model capacity than dense baselines for the same compute.^[8] They also introduced an importance loss and a load loss to keep experts roughly balanced.

How does the Switch Transformer route tokens?

Switch Transformer (Fedus, Zoph, Shazeer, JMLR 2022, originally arXiv:2101.03961) simplified the recipe by routing each token to exactly one expert (top-1).^[10] They showed top-1 routing trains stably at scale, reduces the routing overhead, and produces up to 7x pre-training speedup over dense T5-Base/Large at the same compute.^[10] They popularised the term expert capacity factor, the maximum number of tokens per expert per batch; tokens above the capacity are dropped (token dropping). They scaled to a trillion-parameter model on TPUs.^[10]

What is GShard?

GShard (Lepikhin, Lee, Xu, Chen, Firat, Huang, Krikun, Shazeer, Chen 2020) provided the systems substrate.^[9] It is a set of XLA annotations that automatically shard a top-2 MoE Transformer across TPU pods. They trained a 600-billion-parameter multilingual translation model on 2048 TPU v3 chips in four days.^[9] GShard introduced the auxiliary load-balancing loss that became standard.

How do MoE models keep experts balanced?

Left alone, the router collapses to a few experts. Standard mitigations include an auxiliary load-balancing loss that penalises imbalance between expert usage and routing probabilities, expert capacity factors that put a hard cap on tokens per expert, and noise in the router (z-loss in Switch Transformer).^[9]^[10] DeepSeek-V3 (Liu et al. 2024) introduced an auxiliary-loss-free strategy that adjusts a per-expert bias term during training instead of adding a loss term, claiming better quality at the same balance.^[20]

Which production LLMs use Mixture of Experts?

MoE has moved from research to production:

Model	Year	Architecture	Active per token
GShard	2020	600B params, top-2 routing	~50B
GLaM (Du et al., Google, ICML 2022)	2022	1.2T params, top-2 routing	~97B
Switch Transformer	2021	Up to 1.6T params, top-1	~6B
Mixtral 8x7B (Mistral)	2023	46.7B total, 8 experts, top-2	12.9B
Mixtral 8x22B	2024	8 experts of 22B, top-2	39B
DeepSeek-V2	2024	236B params, fine-grained experts	21B
DeepSeek-V3	2024	671B params, 256 routed + 1 shared expert per layer	37B
Grok-1 (xAI)	2024	314B params, 8 experts, top-2	86B
Qwen2-MoE / Qwen3-MoE	2024-2025	Up to 235B params with shared + routed experts	varies

DeepSeek-V2 and V3 introduced the shared expert pattern: in addition to N routed experts (selected by the router), every token also passes through a small set of always-on shared experts. The shared expert captures common knowledge; the routed experts specialise. DeepSeek-V3's architecture has 1 shared expert and 256 routed experts per layer, with the router picking 8 of those 256, which yields 37B active parameters out of 671B total.^[20]

What is Mixture of Depths?

Mixture of Depths (Raposo, Ritter, Richards, Lillicrap, Humphreys, Santoro 2024, arXiv:2404.02258) does for layer-skipping what MoE did for feed-forward blocks.^[11] Each transformer layer has a router that selects a top-k subset of tokens to actually compute the attention and MLP; the rest of the tokens skip the layer through a residual connection. The compute budget per layer is fixed, so total FLOPs are bounded and predictable, but each token's effective depth depends on the input. Raposo et al. report match or beat baselines at a fraction of the FLOPs, and the method composes with MoE ("MoE-MoD").^[11] Unlike per-token early exit, MoD makes a fresh routing decision at every layer, so a token may participate, skip, then participate again.

What is speculative decoding?

Autoregressive decoding is sequential: each token requires one forward pass through the full LLM. Speculative decoding breaks this serial bottleneck by running a small draft model that proposes K tokens, then validating those K tokens with one parallel forward pass of the target model. The longest validated prefix is accepted, the rest is discarded, and the loop continues. The output distribution is exactly that of the target model: speculative decoding is exact, not approximate, when the verification step uses rejection sampling.^[12]^[13]

The original method appeared in two concurrent papers:

Paper	Authors	Year	Venue	Key idea
Fast Inference from Transformers via Speculative Decoding	Leviathan, Kalman, Matias (Google)	2022 / ICML 2023	ICML 2023	Draft model proposes K, target verifies in one pass; 2x-3x on T5-XXL
Accelerating Large Language Model Decoding with Speculative Sampling	Chen et al. (DeepMind)	2023	arXiv	Same recipe, formal acceptance probability with rejection sampling

Variants and refinements:

Method	Year	Change vs base recipe
Medusa (Cai, Li, Geng, Peng, Lee, Chen, Dao)	2024	Adds extra prediction heads to the target model itself; tree-based attention verifies several continuations at once. 2.2-3.6x speedup.
EAGLE (Li, Wei, Zhang, Zhang)	2024	Predicts at the second-to-top hidden-feature level instead of token level; 2.7-3.5x on LLaMA2-Chat 70B
Lookahead Decoding	2024	No draft model; uses Jacobi iteration plus verification
ReDrafter (Apple)	2024	Recurrent draft head; tested on MLX
Self-speculative decoding (Zhang et al.)	2024	Skips layers of the same model to form an internal draft, no separate model needed

In practice, speculative decoding works well when the draft model agrees often with the target model. Acceptance rates of 60-80% are typical for matched draft-target pairs. The method is now standard in production stacks: vLLM, TensorRT-LLM, and TGI all support several variants out of the box, with vLLM reporting up to 2.8x throughput gains from speculative decoding.^[12]^[14]^[15]

How does dynamic inference work at the serving level?

For LLM serving, dynamic inference also happens between requests. Two systems are foundational.

Orca: A Distributed Serving System for Transformer-Based Generative Models (Yu, Jeong, Kim, Kim, Chun, OSDI 2022) introduced iteration-level scheduling, also known as continuous batching.^[16] Naive batching forces the entire batch to wait until the slowest sequence in the batch finishes generating; continuous batching lets the scheduler insert a new request as soon as any sequence in the current batch finishes. Orca also introduced selective batching for operations whose shape varies per request (attention) so they are not batched while feed-forward blocks are. The reported throughput improvement over NVIDIA FasterTransformer was up to 36.9x at the same latency.^[16]

PagedAttention / vLLM (Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, Stoica, SOSP 2023) treats the KV cache like virtual memory: the cache is split into fixed-size blocks (pages), each request gets a page table, and pages can be shared across requests (for example, a shared system prompt).^[17] This eliminates fragmentation and waste in GPU memory, improving throughput by 2-4x over FasterTransformer and Orca on the same hardware by cutting cache waste from the 60-80% typical of naive allocators to under 4%.^[17]

KV-cache compression and eviction methods take dynamic inference into the memory system itself. H2O (Zhang et al., NeurIPS 2023) keeps a small set of "heavy hitter" tokens that contribute most attention mass and evicts the rest.^[18] StreamingLLM (Xiao, Tian, Chen, Han, Lewis, ICLR 2024) keeps an attention sink (the very first tokens) plus a sliding window, and shows that with this layout LLMs can serve 4 million-token streams at up to 22.2x speedup.^[19] These are dynamic because what gets kept in cache depends on what the model has seen so far.

Other related techniques include Skeleton-of-Thought (Ning et al. 2023), which produces a high-level outline first and then expands sections in parallel, and draft-then-verify retrieval augmentation, where an embedding-based retriever drafts candidate passages that the LLM verifies.

What is dynamic inference used for?

Latency-critical edge inference. Phones, AR/VR headsets, smart cameras, and IoT devices have a hard wall on power and a soft wall on latency. Early-exit and slimmable networks let the same weights serve a 0.5W and a 5W budget.^[6]

LLM serving at scale. This is where most of the dollar value of dynamic inference lives. Without speculative decoding and continuous batching, a 70B model serving thousands of concurrent users at sub-second latency is uneconomic. With them, the same hardware budget supports several times the throughput.^[16]^[17]

Foundation-model compute reduction. Frontier model training has become so expensive that providers want to keep the model itself sparse. MoE makes this work: Mixtral 8x7B, DeepSeek-V3, Qwen3-MoE, and Grok all use MoE because it grows effective parameter count while keeping per-token FLOPs bounded.^[20]

Energy efficiency. Data-centre operators increasingly track joules per token. MoE plus speculative decoding plus PagedAttention can drop joules per token by an order of magnitude versus a naive dense forward pass loop.

Content moderation cascades. A common production pattern is a small, fast classifier that filters obvious cases and a heavier model that handles the residual. This is a cascade in the Bolukbasi 2017 sense, applied to abuse detection, NSFW filtering, and ad-policy review.^[3]

What are the trade-offs of dynamic inference?

Dynamic inference is not free. The honest accounting includes several costs.

Hardware utilisation. GPUs love dense, regular matrix multiplication. A sparse top-1 MoE layer needs a scatter-and-gather kernel that fights against this hardware preference. Kernels such as Megablocks (Gale, Narayanan, Young, Zaharia 2023) and ScatterMoE close some of the gap, but the wall-clock speedup of MoE is often less than its FLOPs reduction. Dynamic depth via early exit is even harder to batch.

Routing overhead. A router or exit head adds latency. For very small models the router can cost more than the saved compute. For large models the router is negligible.

Training complexity. Auxiliary balance losses, capacity factors, and special learning-rate schedules all complicate MoE training. Routing instabilities still occasionally show up, which is why DeepSeek-V3 introduced an auxiliary-loss-free balancer.^[20]

Determinism and verifiability. Static models are bit-for-bit reproducible given fixed seeds. Dynamic models that depend on small probabilities (load-balanced routing, speculative-decoding acceptance) are deterministic given the same input and seed, but numerical differences in batch composition can change the routing outcome.

Tail latency. A network whose easy inputs are fast and hard inputs are slow has a wider latency distribution than a static one. SLO-driven serving pipelines must size capacity to the p99 case.

What hardware supports dynamic inference?

Standard GPUs handle dynamic inference through software, with custom CUDA kernels closing the performance gap.

Workload	Common kernel/runtime
MoE forward and backward	Megablocks, ScatterMoE, NVIDIA Tutel, FasterMoE
Speculative decoding	TensorRT-LLM, vLLM, TGI
PagedAttention / KV-cache paging	vLLM (origin), TGI, TensorRT-LLM
Continuous batching	Orca (origin), vLLM, TGI, NVIDIA Triton Inference Server
Early exit	Custom; PyTorch torch.compile + dynamic shapes

TPUs were the original home of GShard and remain a strong target through the Pathways runtime.^[9] Cerebras, Groq, and SambaNova chips have advantages for sparse and dynamic patterns because their on-chip memory and dataflow architectures avoid some of the gather/scatter overhead. NVIDIA Hopper and Blackwell GPUs added structured-sparsity and FP8 features that benefit static compression more than dynamic routing, but their faster interconnect helps MoE expert-parallel layouts.

Which open-source frameworks implement dynamic inference?

Most dynamic-inference techniques in modern LLM serving are bundled into a few well-known frameworks.

Framework	Maintainer	Notable dynamic-inference features
vLLM	UC Berkeley / community	PagedAttention, continuous batching, speculative decoding, prefix caching
TensorRT-LLM	NVIDIA	Continuous batching, in-flight batching, speculative decoding, MoE kernels
Triton Inference Server	NVIDIA	Multi-model orchestration, dynamic batching, ensembles of cascades
DeepSpeed-MII	Microsoft	Dynamic batching, MoE, ZeRO-Inference
TGI (Text Generation Inference)	Hugging Face	Continuous batching, speculative decoding, paged attention
FasterTransformer	NVIDIA (legacy)	Static batching, optimised kernels
llama.cpp	Community	KV-cache management, speculative decoding, partial layer offload

What are the recent trends in dynamic inference (2024-2025)?

MoE in frontier models. Through 2024 and 2025, most newly-announced large open-weight models have been MoE: DeepSeek-V2/V3, Qwen2/3-MoE, Grok-1, Mixtral 8x22B.^[20] The pattern is now standard rather than experimental.

Speculative decoding becomes the default. By 2025, speculative decoding (often EAGLE or Medusa) is enabled by default in most production LLM serving stacks. vLLM and TensorRT-LLM ship multiple variants.^[14]^[15]

Test-time compute and reasoning models. A related sense of dynamic inference appeared with OpenAI's o1 (September 2024) and DeepSeek-R1 (2025): the model spends more compute (more chain-of-thought tokens) on harder problems, and performance keeps improving as the inference-time compute budget grows. The mechanism differs from MoE or speculative decoding, but the philosophy is the same: harder inputs should not cost the same as easy ones. See test-time compute and inference-time scaling.

On-device dynamic inference. Apple Intelligence, Gemini Nano on Pixel, and on-device Llama variants combine static compression (4-bit weights, structured pruning) with dynamic adapters and small-then-large cascades.

How does dynamic inference relate to model compression?

Dynamic inference is one of the three main pillars of inference optimization, alongside model compression and systems-level work. A typical production stack for a 70B-class LLM uses all three at once: 4-bit weight quantisation (compression) plus a draft-target speculative pair (dynamic inference) plus PagedAttention with continuous batching (systems).^[17] Each layer of optimisation multiplies, not adds, with the others.

ELI5: dynamic inference

Imagine a test with 100 questions, some very easy and some very hard. A static model is a student who spends exactly the same amount of time on every question, even the ones they could answer instantly. Dynamic inference is a smarter student who glances at each question, answers the easy ones in a second, and saves their effort for the hard ones. The total work goes down, but the score stays the same. Inside an AI, the "effort" is computation: easy inputs take a short path through the network, hard inputs take the long path.

References

Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y. "Dynamic Neural Networks: A Survey." IEEE TPAMI, 2022. arXiv:2102.04906. ↩
Teerapittayanon, S., McDanel, B., Kung, H.T. "BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks." ICPR 2016. arXiv:1709.01686. ↩
Bolukbasi, T., Wang, J., Dekel, O., Saligrama, V. "Adaptive Neural Networks for Efficient Inference." ICML 2017. arXiv:1702.07811. (Cascade speedups.) See also Graves, A. "Adaptive Computation Time for Recurrent Neural Networks." 2016. arXiv:1603.08983. ↩
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K. "Multi-Scale Dense Networks for Resource Efficient Image Classification." ICLR 2018. arXiv:1703.09844. ↩
Xin, J., Tang, R., Lee, J., Yu, Y., Lin, J. "DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference." ACL 2020. arXiv:2004.12993. ↩
Yu, J., Yang, L., Xu, N., Yang, J., Huang, T. "Slimmable Neural Networks." ICLR 2019. arXiv:1812.08928. ↩
Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S. "Once-for-All: Train One Network and Specialize it for Efficient Deployment." ICLR 2020. arXiv:1908.09791. ↩
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. arXiv:1701.06538. ↩
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z. "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." 2020. arXiv:2006.16668. ↩
Fedus, W., Zoph, B., Shazeer, N. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022. arXiv:2101.03961. ↩
Raposo, D., Ritter, S., Richards, B., Lillicrap, T., Humphreys, P.C., Santoro, A. "Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models." 2024. arXiv:2404.02258. ↩
Leviathan, Y., Kalman, M., Matias, Y. "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arXiv:2211.17192. ↩
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., Jumper, J. (DeepMind). "Accelerating Large Language Model Decoding with Speculative Sampling." 2023. arXiv:2302.01318. ↩
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J.D., Chen, D., Dao, T. "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads." 2024. arXiv:2401.10774. ↩
Li, Y., Wei, F., Zhang, C., Zhang, H. "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty." 2024. arXiv:2401.15077. ↩
Yu, G., Jeong, J.S., Kim, G., Kim, S., Chun, B. "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI 2022. ↩
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., Stoica, I. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv:2309.06180. ↩
Zhang, Z., et al. "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." NeurIPS 2023. arXiv:2306.14048. ↩
Xiao, G., Tian, Y., Chen, B., Han, S., Lewis, M. "Efficient Streaming Language Models with Attention Sinks." ICLR 2024. arXiv:2309.17453. ↩
DeepSeek-AI. "DeepSeek-V3 Technical Report." 2024. arXiv:2412.19437. ↩
Du, N., Huang, Y., Dai, A.M., et al. "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022. arXiv:2112.06905.
Graves, A. "Adaptive Computation Time for Recurrent Neural Networks." 2016. arXiv:1603.08983.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Dynamic Offline inference Online inference Static inference

Why is dynamic inference needed?

How does static inference differ from dynamic inference?

What are the main types of dynamic inference?

How do sample-wise dynamic networks work?

What are early-exit networks?

What is a cascade model?

How do layer-skip and skip-connection networks work?

What are slimmable and once-for-all networks?

How do spatial-wise dynamic networks work?

How do temporal-wise dynamic networks work?

What is Mixture of Experts and how does it make inference dynamic?

What is the sparsely-gated MoE layer?

How does the Switch Transformer route tokens?

What is GShard?

How do MoE models keep experts balanced?

Which production LLMs use Mixture of Experts?

What is Mixture of Depths?

What is speculative decoding?

How does dynamic inference work at the serving level?

What is dynamic inference used for?

What are the trade-offs of dynamic inference?

What hardware supports dynamic inference?

Which open-source frameworks implement dynamic inference?

What are the recent trends in dynamic inference (2024-2025)?

How does dynamic inference relate to model compression?

ELI5: dynamic inference

See also

References

Improve this article

Related Articles

Jamba

Mixtral

Switch Transformer

DeepSeek V4

Kimi K2

DeepSeek V3

What links here

Related Articles

Jamba

Mixtral

Switch Transformer

DeepSeek V4

Kimi K2

DeepSeek V3

What links here