Inference optimization

Large Language Models Machine Learning

19 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v7 · 3,750 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Inference optimization is the set of techniques that make running a trained artificial intelligence model, especially a large language model, faster, more memory-efficient, and cheaper to serve in production. The core methods are KV-cache management (PagedAttention), continuous batching, quantization, speculative decoding, FlashAttention, and tensor or pipeline parallelism, each targeting a specific bottleneck in how a model produces output one token at a time. The single most influential result in the field came from the vLLM project, which reported that its PagedAttention design "delivers up to 24x higher throughput than HuggingFace Transformers, without requiring any model architecture changes" on the same hardware.^[1]^[13]

While training a large language model can cost millions of dollars, the cumulative cost of serving that model to users often exceeds the training budget within months. For any organization running LLMs in production, inference optimization is where the real cost savings happen.

The field has evolved rapidly since 2023, driven by the explosive growth of LLM-powered applications. Techniques range from mathematical reformulations of the attention mechanism to systems-level innovations borrowed from operating system design. Some methods reduce the size of the model itself, while others focus on how requests are scheduled, how memory is managed, or how multiple GPUs coordinate their work.

Why does inference optimization matter?

Inference costs dominate the total cost of ownership for production AI systems. A single GPT-4-class model serving millions of users can consume thousands of GPU-hours per day. According to a 2025 analysis by Andreessen Horowitz, the cost of LLM inference has dropped by roughly 1,000x over three years, with prices declining at approximately 10x per year for equivalent performance.^[10] Even so, the sheer volume of inference requests means that organizations routinely spend more on serving than they did on training.

Several factors make LLM inference expensive. The autoregressive nature of text generation means tokens are produced one at a time, each requiring a full forward pass through the model. The KV cache (which stores intermediate attention states) grows linearly with sequence length and can consume more GPU memory than the model weights themselves for long contexts. And GPU utilization is often poor: during the token-by-token decode phase, the computation is memory-bandwidth-bound rather than compute-bound, leaving most of the GPU's arithmetic capacity idle.

These constraints have motivated a wide range of optimization techniques, each targeting a different bottleneck.

What is the difference between prefill and decode?

LLM inference runs in two distinct phases with opposite hardware profiles, which is why so many optimizations target one phase or the other. The prefill phase processes the entire input prompt in parallel through large matrix multiplications, giving it high arithmetic intensity: it is compute-bound and saturates the GPU's tensor cores. The decode phase then generates output tokens one at a time, and for each new token the GPU must read the full set of model weights and the growing KV cache from memory while doing relatively little arithmetic. This makes decode memory-bandwidth-bound: the GPU spends most of its time waiting for data to arrive from high-bandwidth memory rather than computing.^[12]

The practical consequence is that throughput in the decode phase scales with memory bandwidth, not raw FLOPS. Techniques such as quantization, KV-cache compression, FlashAttention, and speculative decoding all work primarily by reducing the volume of data that must move across the memory bus during decode, while batching keeps the compute units busy across many concurrent requests.

Key techniques overview

The following table summarizes the major categories of inference optimization, along with the bottleneck each one addresses and typical performance gains.

Technique	Bottleneck addressed	Typical speedup	Trade-offs
Quantization	Memory capacity, bandwidth	2-4x memory reduction; 1.5-2x throughput	Small accuracy loss at aggressive bit widths
KV cache optimization	Memory waste, long-context cost	Up to 4x memory efficiency	Added system complexity
Continuous batching	GPU underutilization	3-8x throughput over static batching	Higher implementation complexity
Speculative decoding	Decode latency	2-3x at low batch sizes	Diminishing returns at high batch sizes
Flash Attention	Memory bandwidth in attention	2-4x faster attention; linear memory	Requires compatible hardware
Model distillation	Model size, compute cost	2-8x faster inference	Accuracy gap with teacher model
Pruning	Model size, compute cost	1.5-3x depending on sparsity	Accuracy loss at high sparsity
Tensor parallelism	Single-GPU memory limit	Scales to models too large for one GPU	Inter-GPU communication overhead

How does KV cache management reduce memory waste?

The KV cache is one of the biggest memory bottlenecks in LLM inference. During autoregressive generation, the model stores the key and value tensors for every token in every layer so they do not need to be recomputed. For a 70-billion-parameter model processing a 128K-token context, the KV cache alone can require over 100 GB of GPU memory.

Traditional serving systems allocate a contiguous block of memory for each request's KV cache based on the maximum possible sequence length. Because actual output lengths vary widely, this approach wastes 60-80% of KV cache memory on average.^[1] Two innovations have dramatically improved this situation.

PagedAttention

PagedAttention, introduced by Kwon et al. in the vLLM project (SOSP 2023), borrows the concept of virtual memory paging from operating systems. Instead of allocating one large contiguous buffer per sequence, PagedAttention divides the KV cache into fixed-size blocks (pages) that can be allocated on demand as a sequence grows and freed when it finishes. A page table maps logical sequence positions to physical memory locations, allowing the physical storage to be non-contiguous while the model sees a continuous address space.^[1]

This design reduces KV cache memory waste to under 4%, compared to 60-80% in prior systems.^[1] It also enables memory sharing between requests: if two sequences share a common prefix (for example, the same system prompt), their KV cache pages for that prefix can point to the same physical memory. The result is a roughly 4x improvement in memory utilization, which translates directly into higher batch sizes and throughput. The vLLM authors summarized the payoff bluntly in their launch post: PagedAttention "redefines the new state of the art in LLM serving," delivering up to 24x the throughput of a standard HuggingFace Transformers pipeline.^[13]

Prefix caching

vLLM's Automatic Prefix Caching (APC) extends the PagedAttention concept further. Each KV block is hashed based on its token content, and a global hash table tracks all physical blocks. When a new request arrives with a prefix that matches an existing cached block, the system reuses the cached KV data instead of recomputing it. This is particularly valuable for workloads where many requests share common prefixes, such as chatbots with fixed system prompts or retrieval-augmented generation pipelines that prepend retrieved documents.

The eviction policy follows a least-recently-used (LRU) strategy: when GPU memory is full, blocks with zero active references are evicted, prioritizing those that have not been accessed recently.

KV cache quantization

Beyond memory management, quantizing the KV cache itself has become standard practice. NVIDIA's NVFP4 KV cache quantization compresses KV data to 4-bit floating point, cutting the cache's memory footprint by up to 50% with less than 1% accuracy loss across benchmarks.^[9] For models handling ultralong contexts (over 1 million tokens), KV cache quantization to INT4 or even INT2 is now considered necessary to fit within GPU memory constraints.

What are the main quantization methods?

Quantization reduces the numerical precision of model weights (and sometimes activations) from 16-bit or 32-bit floating point down to 8-bit, 4-bit, or even lower representations. The goal is to shrink the model's memory footprint and increase throughput without losing too much accuracy.

GPTQ

GPTQ (Generative Pre-trained Transformer Quantization) was among the first methods to compress LLMs to 4-bit weights while maintaining reasonable accuracy.^[5] It works by analyzing each layer's weight matrix and applying a mathematical optimization (based on the Optimal Brain Surgeon framework) to minimize quantization error. GPTQ requires a calibration dataset, and the quality of the quantized model depends on how representative that dataset is. It runs well on GPUs and has broad framework support.

AWQ

AWQ (Activation-Aware Weight Quantization) takes a different approach. It observes that a small fraction of weights are disproportionately important for model accuracy (because they correspond to high-magnitude activations) and preserves those weights at higher precision while quantizing the rest more aggressively.^[6] AWQ does not rely on backpropagation or layer-by-layer reconstruction, making the quantization process faster and less data-intensive than GPTQ. In benchmarks with optimized Marlin kernels, AWQ achieves the best throughput at 741 tokens per second, slightly ahead of GPTQ's 712 tokens per second.^[11]

GGUF and llama.cpp

GGUF (GPT-Generated Unified Format) is a file format designed for the llama.cpp inference engine. Unlike GPTQ and AWQ, which target GPU inference, GGUF is optimized for CPU execution and mixed CPU-GPU offloading. This makes it the standard choice for running models on consumer hardware, Apple Silicon devices, and edge deployments. GGUF supports a range of quantization levels (from Q2 to Q8) and allows users to balance quality against resource constraints.

FP8

FP8 (8-bit floating point) has emerged as a sweet spot for data center deployments. Modern GPUs like the H100, H200, and B200 include native FP8 tensor cores that deliver roughly 2x performance improvement over FP16 operations. Because FP8 retains more dynamic range than integer quantization formats, it offers near-lossless accuracy for most models. Many recent models, including those from Meta and Mistral, ship with built-in FP8 support.

The following table compares these quantization methods.

Method	Bit width	Target hardware	Calibration needed	Best use case
GPTQ	4-bit (weights)	GPU	Yes (dataset)	GPU serving with broad compatibility
AWQ	4-bit (weights)	GPU	Yes (lightweight)	GPU serving with best throughput
GGUF	2-8 bit (flexible)	CPU, mixed CPU/GPU	No	Local/edge deployment, consumer hardware
FP8	8-bit (weights + activations)	Data center GPU (H100+)	No	Production serving on modern hardware

How does continuous batching improve throughput?

Static batching vs. continuous batching

The simplest approach to batching inference requests is static batching: collect a group of requests, process them together, and return results when the entire batch is done. The problem is that LLM requests generate variable-length outputs. A request that produces 10 tokens finishes long before one that produces 500, but in a static batch, the short request's GPU slot sits idle until the longest request completes.

Continuous batching (also called iteration-level scheduling) solves this. Introduced in the Orca system (OSDI 2022), it operates at the granularity of individual decode iterations rather than entire requests.^[2] When a sequence finishes generating, its slot is immediately filled by a new incoming request, without waiting for the rest of the batch. This keeps GPU utilization consistently high.

The impact is substantial. Orca demonstrated up to 36.9x higher throughput than NVIDIA FasterTransformer at the same latency target, by pairing iteration-level scheduling with selective batching.^[2] Typical production workloads (conversational applications where output lengths vary from 10 to 500 tokens) see 3-8x gains. Continuous batching is now the default scheduling strategy in all major serving frameworks, including vLLM, TensorRT-LLM, and SGLang.

Speculative decoding

Autoregressive decoding is inherently sequential: each token depends on the previous one. Speculative decoding breaks this bottleneck by introducing a small, fast "draft" model that guesses multiple tokens ahead. The larger "target" model then verifies these guesses in a single forward pass (which is parallelizable). If the draft model guessed correctly, multiple tokens are accepted at once; if not, the target model generates the correct token at the divergence point and the process restarts.

Because the verification step produces the same probability distribution as normal decoding, speculative decoding is mathematically lossless: the output is identical to what the target model would produce on its own. The original method was introduced by Yaniv Leviathan and colleagues at Google in 2023, who reported a "2X-3X acceleration compared to the standard T5X implementation, with identical outputs."^[7] The speedup depends on how well the draft and target models align (measured by the acceptance rate). At batch sizes of 1-4, speculative decoding typically delivers 2-3x speedups. The benefit diminishes at higher batch sizes, dropping to 1.3-1.6x at batch size 8 and becoming negligible at batch size 32 or above, because the target model's forward pass becomes compute-bound rather than memory-bound.

By 2025, speculative decoding moved from a research curiosity to a production standard, with built-in support in vLLM, SGLang, TensorRT-LLM, and most other serious serving frameworks.

What is Flash Attention and how fast is it?

Flash Attention, developed by Tri Dao at Stanford (later at Together AI), is an algorithm that restructures the attention computation to be IO-aware. Standard attention implementations compute the full attention matrix and store it in GPU high-bandwidth memory (HBM), which creates a quadratic memory bottleneck. Flash Attention instead tiles the computation into blocks that fit in the GPU's faster SRAM, computing attention incrementally and never materializing the full attention matrix.^[3]

Flash Attention reduces the memory complexity of attention from O(n^2) to O(n) in sequence length, while producing numerically identical results to standard attention. The original paper reported a 3x speedup on GPT-2 and a 15% end-to-end speedup over the prior MLPerf 1.1 BERT-large training record, with 2-4x wall-clock speedup attributable to avoiding writes of the large intermediate attention matrices to HBM.^[3] This enables much longer context windows without running out of memory.

FlashAttention-2 improved on the original with better work partitioning across GPU warps, achieving up to 2x speedup over the first version. FlashAttention-3, optimized for NVIDIA Hopper GPUs (H100/H200), introduced three additional techniques: overlapping computation and data movement through warp specialization, interleaving matrix multiplication with softmax operations, and leveraging hardware FP8 support. FlashAttention-3 achieves up to 740 TFLOPS in FP16 on H100 (75% of theoretical peak) and close to 1.2 PFLOPS in FP8.^[4]

Flash Attention is now integrated into PyTorch, all major serving frameworks, and most model training pipelines.

How does model distillation speed up inference?

Knowledge distillation transfers the capabilities of a large "teacher" model into a smaller "student" model. The student is trained to match the teacher's output probability distributions ("soft targets") rather than just the hard labels in the training data. This approach captures richer information than standard training because the teacher's probability distribution over all tokens encodes relationships between concepts that a binary correct/incorrect signal does not.

Distilled models typically achieve 2-8x faster inference than their teachers while retaining up to 95% of accuracy, depending on the architecture and the size gap between teacher and student. The canonical example is DistilBERT, which its authors showed "reduces the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster."^[14] On an iPhone 7 Plus, DistilBERT ran 71% faster than BERT, illustrating why distillation is central to on-device deployment. Various distilled versions of Llama and Mistral models follow the same recipe.

The main challenge is that distillation can lose nuanced capabilities. Tasks requiring deep reasoning or rare factual knowledge tend to degrade more than simpler tasks. NVIDIA's NeMo framework and TensorRT Model Optimizer both provide integrated pipelines for pruning followed by distillation, which often produces better results than either technique alone.^[8]

Pruning

Pruning removes redundant or low-impact parameters from a trained model to reduce its size and computational requirements. There are two main categories.

Unstructured pruning zeroes out individual weights, typically those with the smallest magnitude. This can achieve very high sparsity (90%+ of weights removed) with minimal accuracy loss, but the resulting sparse weight matrices are difficult to accelerate on standard GPU hardware because the non-zero elements are scattered irregularly.

Structured pruning removes entire rows, columns, attention heads, or feed-forward layers. The resulting model remains dense and runs efficiently on standard hardware, but accuracy tends to degrade more quickly as sparsity increases. Recent methods like STUN (Structured-Then-Unstructured) combine both approaches: first removing redundant components at a structural level, then applying fine-grained unstructured pruning within the remaining components.

In 2025, structured pruning research advanced with methods like SlimLLM and NIRVANA, which evaluate importance at the channel or head level rather than aggregating individual weight magnitudes, leading to better accuracy retention at the same sparsity levels.

How is inference split across multiple GPUs?

When a model is too large to fit on a single GPU, it must be split across multiple devices. Tensor parallelism divides individual layers of the model across GPUs. Each GPU holds a slice of every layer and computes its portion in parallel. The partial results are then combined through all-reduce communication operations. This approach works well within a single server where GPUs are connected by fast interconnects like NVLink.

Pipeline parallelism takes a different approach: it assigns different layers to different GPUs, with data flowing through them sequentially like an assembly line. This reduces inter-GPU communication but introduces pipeline bubbles (idle time while data propagates through stages). Pipeline parallelism is most useful across nodes, where network bandwidth is lower than intra-node interconnects.

In practice, production deployments often combine both: tensor parallelism within a node and pipeline parallelism across nodes. vLLM, TensorRT-LLM, and other frameworks support configuring both dimensions independently.

Hardware-specific optimizations

TensorRT-LLM

NVIDIA's TensorRT-LLM is an open-source library that applies GPU-specific optimizations including fused CUDA kernels (combining multiple operations into a single kernel launch to reduce overhead), CUDA graph capture (recording a sequence of GPU operations and replaying them without CPU intervention to eliminate launch latency), and hardware-specific quantization using FP8 or INT4 tensor cores.

TensorRT-LLM's AutoDeploy feature, introduced in 2025, automates the process of applying these optimizations, reducing the time to onboard new model architectures from weeks to days. On NVIDIA Blackwell GPUs, AutoDeploy performs on par with manually optimized baselines.

CUDA graphs

CUDA graphs are a general NVIDIA optimization that pre-records a sequence of GPU operations (kernel launches, memory copies) into a graph that can be replayed without CPU involvement. For LLM inference, where the decode phase repeats the same computation pattern for each token, CUDA graphs eliminate the CPU overhead of launching individual kernels. This is particularly impactful for latency-sensitive applications and low-batch-size scenarios where kernel launch overhead represents a significant fraction of total time.

How much does inference cost?

The following table illustrates how inference costs have changed and how optimization techniques contribute to cost reductions.

Scenario	Hardware	Cost per 1M output tokens (approx.)	Notes
GPT-4 API (late 2022)	Cloud (OpenAI)	~$60.00	Pre-optimization baseline
GPT-4-equivalent API (2025)	Cloud (OpenAI)	~$2.50	Provider-side optimizations
Self-hosted 70B, FP16	2x H100	~$1.20	No quantization, continuous batching
Self-hosted 70B, AWQ 4-bit	1x H100	~$0.50	Quantization halves GPU requirement
Self-hosted 70B, FP8 + speculative	1x H100	~$0.35	Combined optimizations
Gemini Flash-Lite API (2025)	Cloud (Google)	~$0.30	Aggressive provider optimization
Self-hosted 8B distilled, Q4	1x consumer GPU	~$0.05	Distillation + aggressive quantization

Cloud H100 prices have stabilized at $1.49-$3.90 per hour depending on the provider (as of early 2026), down from peaks of $7.00+ in 2023. H200 GPUs with 141 GB HBM3e are available at $2.15-$6.00 per hour and can serve 70B models on a single GPU that previously required two H100s.

Current state (2025-2026)

Several trends are shaping the inference optimization landscape as of early 2026.

Reasoning model optimization is a new frontier. The rise of chain-of-thought and reasoning models (such as OpenAI o1 and DeepSeek-R1) has created demand for optimizing long internal reasoning traces. These models generate hundreds or thousands of tokens of intermediate reasoning before producing a final answer, making decode-phase efficiency even more important.

Mixture-of-experts (MoE) scheduling has become a focus area as models like Mixtral and GPT-4 use sparse mixture-of-experts architectures. Efficiently routing tokens to the correct expert and batching computations across experts on different GPUs requires specialized scheduling logic that existing frameworks are actively developing.

Disaggregated inference separates the prefill phase (processing the input prompt, which is compute-bound) from the decode phase (generating output tokens, which is memory-bound) onto different hardware. This allows each phase to run on hardware optimized for its specific bottleneck.

Energy-aware inference is gaining attention as the environmental and operational costs of running GPU clusters become harder to ignore. Techniques like dynamic voltage and frequency scaling, workload-aware power management, and carbon-aware scheduling are being integrated into serving platforms.

Edge deployment continues to expand, with frameworks like llama.cpp and the GGUF format enabling capable models to run on laptops, phones, and embedded devices. Quantization to 2-4 bits combined with architecture-specific optimizations (such as Apple's MLX framework for Metal GPUs) makes local inference practical for an increasingly wide range of models.

The overall trajectory is clear: each generation of optimization brings roughly an order-of-magnitude cost reduction, making LLM inference accessible to a broader range of applications and organizations.

References

Kwon, Woosuk, et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), 2023. https://arxiv.org/abs/2309.06180 ↩
Yu, Gyeong-In, et al. "Orca: A Distributed Serving System for Transformer-Based Generative Models." OSDI, 2022. https://www.usenix.org/conference/osdi22/presentation/yu ↩
Dao, Tri, et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS, 2022. https://arxiv.org/abs/2205.14135 ↩
Shah, Jay, et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." 2024. https://arxiv.org/abs/2407.08608 ↩
Frantar, Elias, et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023. https://arxiv.org/abs/2210.17323 ↩
Lin, Ji, et al. "AWQ: Activation-Aware Weight Quantization for LLM Compression and Acceleration." MLSys, 2024. https://arxiv.org/abs/2306.00978 ↩
Leviathan, Yaniv, et al. "Fast Inference from Transformers via Speculative Decoding." ICML, 2023. https://arxiv.org/abs/2211.17192 ↩
"Top 5 AI Model Optimization Techniques for Faster, Smarter Inference." NVIDIA Technical Blog. https://developer.nvidia.com/blog/top-5-ai-model-optimization-techniques-for-faster-smarter-inference ↩
"Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache." NVIDIA Technical Blog. https://developer.nvidia.com/blog/optimizing-inference-for-long-context-and-large-batch-sizes-with-nvfp4-kv-cache/ ↩
"LLMflation: LLM inference cost is going down fast." Andreessen Horowitz, 2025. https://a16z.com/llmflation-llm-inference-cost/ ↩
"The Complete Guide to LLM Quantization with vLLM: Benchmarks and Best Practices." Jarvislabs.ai. https://docs.jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks ↩
"LLM Inference Benchmarking: How Much Does Your LLM Inference Cost?" NVIDIA Technical Blog. https://developer.nvidia.com/blog/llm-inference-benchmarking-how-much-does-your-llm-inference-cost/ ↩
"vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention." vLLM Blog, June 20, 2023. https://blog.vllm.ai/2023/06/20/vllm.html ↩
Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." 2019. https://arxiv.org/abs/1910.01108 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

AI Pricing BentoML Dynamic inference EAGLE-2 FP4 (4-bit floating point)H2O (Heavy-Hitter Oracle for KV Cache)Lookahead Decoding Medusa NVIDIA Dynamo Offline inference Online inference Optimum-Quanto PyTorch StreamingLLM TensorFlow Lite (LiteRT)Text Generation Inference (TGI)YOCO (You Only Cache Once)

Why does inference optimization matter?

What is the difference between prefill and decode?

Key techniques overview

How does KV cache management reduce memory waste?

PagedAttention

Prefix caching

KV cache quantization

What are the main quantization methods?

GPTQ

AWQ

GGUF and llama.cpp

FP8

How does continuous batching improve throughput?

Static batching vs. continuous batching

Speculative decoding

What is Flash Attention and how fast is it?

How does model distillation speed up inference?

Pruning

How is inference split across multiple GPUs?

Hardware-specific optimizations

TensorRT-LLM

CUDA graphs

How much does inference cost?

Current state (2025-2026)

References

Improve this article

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

MMLU-Pro

What links here

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

MMLU-Pro

What links here