Inference optimization refers to a collection of techniques designed to make artificial intelligence model inference faster, more memory-efficient, and cheaper to run. While training a large language model can cost millions of dollars, the cumulative cost of serving that model to users often exceeds the training budget within months. For any organization running LLMs in production, inference optimization is where the real cost savings happen.
The field has evolved rapidly since 2023, driven by the explosive growth of LLM-powered applications. Techniques range from mathematical reformulations of the attention mechanism to systems-level innovations borrowed from operating system design. Some methods reduce the size of the model itself, while others focus on how requests are scheduled, how memory is managed, or how multiple GPUs coordinate their work.
Inference costs dominate the total cost of ownership for production AI systems. A single GPT-4-class model serving millions of users can consume thousands of GPU-hours per day. According to a 2025 analysis by Andreessen Horowitz, the cost of LLM inference has dropped by roughly 1,000x over three years, with prices declining at approximately 10x per year for equivalent performance. Even so, the sheer volume of inference requests means that organizations routinely spend more on serving than they did on training.
Several factors make LLM inference expensive. The autoregressive nature of text generation means tokens are produced one at a time, each requiring a full forward pass through the model. The KV cache (which stores intermediate attention states) grows linearly with sequence length and can consume more GPU memory than the model weights themselves for long contexts. And GPU utilization is often poor: during the token-by-token decode phase, the computation is memory-bandwidth-bound rather than compute-bound, leaving most of the GPU's arithmetic capacity idle.
These constraints have motivated a wide range of optimization techniques, each targeting a different bottleneck.
The following table summarizes the major categories of inference optimization, along with the bottleneck each one addresses and typical performance gains.
| Technique | Bottleneck addressed | Typical speedup | Trade-offs |
|---|---|---|---|
| Quantization | Memory capacity, bandwidth | 2-4x memory reduction; 1.5-2x throughput | Small accuracy loss at aggressive bit widths |
| KV cache optimization | Memory waste, long-context cost | Up to 4x memory efficiency | Added system complexity |
| Continuous batching | GPU underutilization | 3-8x throughput over static batching | Higher implementation complexity |
| Speculative decoding | Decode latency | 2-3x at low batch sizes | Diminishing returns at high batch sizes |
| Flash Attention | Memory bandwidth in attention | 2-4x faster attention; linear memory | Requires compatible hardware |
| Model distillation | Model size, compute cost | 2-8x faster inference | Accuracy gap with teacher model |
| Pruning | Model size, compute cost | 1.5-3x depending on sparsity | Accuracy loss at high sparsity |
| Tensor parallelism | Single-GPU memory limit | Scales to models too large for one GPU | Inter-GPU communication overhead |
The KV cache is one of the biggest memory bottlenecks in LLM inference. During autoregressive generation, the model stores the key and value tensors for every token in every layer so they do not need to be recomputed. For a 70-billion-parameter model processing a 128K-token context, the KV cache alone can require over 100 GB of GPU memory.
Traditional serving systems allocate a contiguous block of memory for each request's KV cache based on the maximum possible sequence length. Because actual output lengths vary widely, this approach wastes 60-80% of KV cache memory on average. Two innovations have dramatically improved this situation.
PagedAttention, introduced by Kwon et al. in the vLLM project (SOSP 2023), borrows the concept of virtual memory paging from operating systems. Instead of allocating one large contiguous buffer per sequence, PagedAttention divides the KV cache into fixed-size blocks (pages) that can be allocated on demand as a sequence grows and freed when it finishes. A page table maps logical sequence positions to physical memory locations, allowing the physical storage to be non-contiguous while the model sees a continuous address space.
This design reduces KV cache memory waste to under 4%, compared to 60-80% in prior systems. It also enables memory sharing between requests: if two sequences share a common prefix (for example, the same system prompt), their KV cache pages for that prefix can point to the same physical memory. The result is a roughly 4x improvement in memory utilization, which translates directly into higher batch sizes and throughput.
vLLM's Automatic Prefix Caching (APC) extends the PagedAttention concept further. Each KV block is hashed based on its token content, and a global hash table tracks all physical blocks. When a new request arrives with a prefix that matches an existing cached block, the system reuses the cached KV data instead of recomputing it. This is particularly valuable for workloads where many requests share common prefixes, such as chatbots with fixed system prompts or retrieval-augmented generation pipelines that prepend retrieved documents.
The eviction policy follows a least-recently-used (LRU) strategy: when GPU memory is full, blocks with zero active references are evicted, prioritizing those that have not been accessed recently.
Beyond memory management, quantizing the KV cache itself has become standard practice. NVIDIA's NVFP4 KV cache quantization compresses KV data to 4-bit floating point, cutting the cache's memory footprint by up to 50% with less than 1% accuracy loss across benchmarks. For models handling ultralong contexts (over 1 million tokens), KV cache quantization to INT4 or even INT2 is now considered necessary to fit within GPU memory constraints.
Quantization reduces the numerical precision of model weights (and sometimes activations) from 16-bit or 32-bit floating point down to 8-bit, 4-bit, or even lower representations. The goal is to shrink the model's memory footprint and increase throughput without losing too much accuracy.
GPTQ (Generative Pre-trained Transformer Quantization) was among the first methods to compress LLMs to 4-bit weights while maintaining reasonable accuracy. It works by analyzing each layer's weight matrix and applying a mathematical optimization (based on the Optimal Brain Surgeon framework) to minimize quantization error. GPTQ requires a calibration dataset, and the quality of the quantized model depends on how representative that dataset is. It runs well on GPUs and has broad framework support.
AWQ (Activation-Aware Weight Quantization) takes a different approach. It observes that a small fraction of weights are disproportionately important for model accuracy (because they correspond to high-magnitude activations) and preserves those weights at higher precision while quantizing the rest more aggressively. AWQ does not rely on backpropagation or layer-by-layer reconstruction, making the quantization process faster and less data-intensive than GPTQ. In benchmarks with optimized Marlin kernels, AWQ achieves the best throughput at 741 tokens per second, slightly ahead of GPTQ's 712 tokens per second.
GGUF (GPT-Generated Unified Format) is a file format designed for the llama.cpp inference engine. Unlike GPTQ and AWQ, which target GPU inference, GGUF is optimized for CPU execution and mixed CPU-GPU offloading. This makes it the standard choice for running models on consumer hardware, Apple Silicon devices, and edge deployments. GGUF supports a range of quantization levels (from Q2 to Q8) and allows users to balance quality against resource constraints.
FP8 (8-bit floating point) has emerged as a sweet spot for data center deployments. Modern GPUs like the H100, H200, and B200 include native FP8 tensor cores that deliver roughly 2x performance improvement over FP16 operations. Because FP8 retains more dynamic range than integer quantization formats, it offers near-lossless accuracy for most models. Many recent models, including those from Meta and Mistral, ship with built-in FP8 support.
The following table compares these quantization methods.
| Method | Bit width | Target hardware | Calibration needed | Best use case |
|---|---|---|---|---|
| GPTQ | 4-bit (weights) | GPU | Yes (dataset) | GPU serving with broad compatibility |
| AWQ | 4-bit (weights) | GPU | Yes (lightweight) | GPU serving with best throughput |
| GGUF | 2-8 bit (flexible) | CPU, mixed CPU/GPU | No | Local/edge deployment, consumer hardware |
| FP8 | 8-bit (weights + activations) | Data center GPU (H100+) | No | Production serving on modern hardware |
The simplest approach to batching inference requests is static batching: collect a group of requests, process them together, and return results when the entire batch is done. The problem is that LLM requests generate variable-length outputs. A request that produces 10 tokens finishes long before one that produces 500, but in a static batch, the short request's GPU slot sits idle until the longest request completes.
Continuous batching (also called iteration-level scheduling) solves this. Introduced in the Orca system (OSDI 2022), it operates at the granularity of individual decode iterations rather than entire requests. When a sequence finishes generating, its slot is immediately filled by a new incoming request, without waiting for the rest of the batch. This keeps GPU utilization consistently high.
The impact is substantial. Orca demonstrated 2-36x throughput improvements over static batching, with typical production workloads (conversational applications where output lengths vary from 10 to 500 tokens) seeing 3-8x gains. Continuous batching is now the default scheduling strategy in all major serving frameworks, including vLLM, TensorRT-LLM, and SGLang.
Autoregressive decoding is inherently sequential: each token depends on the previous one. Speculative decoding breaks this bottleneck by introducing a small, fast "draft" model that guesses multiple tokens ahead. The larger "target" model then verifies these guesses in a single forward pass (which is parallelizable). If the draft model guessed correctly, multiple tokens are accepted at once; if not, the target model generates the correct token at the divergence point and the process restarts.
Because the verification step produces the same probability distribution as normal decoding, speculative decoding is mathematically lossless: the output is identical to what the target model would produce on its own. The speedup depends on how well the draft and target models align (measured by the acceptance rate). At batch sizes of 1-4, speculative decoding typically delivers 2-3x speedups. The benefit diminishes at higher batch sizes, dropping to 1.3-1.6x at batch size 8 and becoming negligible at batch size 32 or above, because the target model's forward pass becomes compute-bound rather than memory-bound.
By 2025, speculative decoding moved from a research curiosity to a production standard, with built-in support in vLLM, SGLang, TensorRT-LLM, and most other serious serving frameworks.
Flash Attention, developed by Tri Dao at Stanford (later at Together AI), is an algorithm that restructures the attention computation to be IO-aware. Standard attention implementations compute the full attention matrix and store it in GPU high-bandwidth memory (HBM), which creates a quadratic memory bottleneck. Flash Attention instead tiles the computation into blocks that fit in the GPU's faster SRAM, computing attention incrementally and never materializing the full attention matrix.
Flash Attention reduces the memory complexity of attention from O(n^2) to O(n) in sequence length, while producing numerically identical results to standard attention. This enables much longer context windows without running out of memory.
FlashAttention-2 improved on the original with better work partitioning across GPU warps, achieving up to 2x speedup over the first version. FlashAttention-3, optimized for NVIDIA Hopper GPUs (H100/H200), introduced three additional techniques: overlapping computation and data movement through warp specialization, interleaving matrix multiplication with softmax operations, and leveraging hardware FP8 support. FlashAttention-3 achieves up to 740 TFLOPS in FP16 on H100 (75% of theoretical peak) and close to 1.2 PFLOPS in FP8.
Flash Attention is now integrated into PyTorch, all major serving frameworks, and most model training pipelines.
Knowledge distillation transfers the capabilities of a large "teacher" model into a smaller "student" model. The student is trained to match the teacher's output probability distributions ("soft targets") rather than just the hard labels in the training data. This approach captures richer information than standard training because the teacher's probability distribution over all tokens encodes relationships between concepts that a binary correct/incorrect signal does not.
Distilled models typically achieve 2-8x faster inference than their teachers while retaining up to 95% of accuracy, depending on the architecture and the size gap between teacher and student. Notable examples include DistilBERT (which retained 97% of BERT's performance at 60% of the size) and various distilled versions of Llama and Mistral models.
The main challenge is that distillation can lose nuanced capabilities. Tasks requiring deep reasoning or rare factual knowledge tend to degrade more than simpler tasks. NVIDIA's NeMo framework and TensorRT Model Optimizer both provide integrated pipelines for pruning followed by distillation, which often produces better results than either technique alone.
Pruning removes redundant or low-impact parameters from a trained model to reduce its size and computational requirements. There are two main categories.
Unstructured pruning zeroes out individual weights, typically those with the smallest magnitude. This can achieve very high sparsity (90%+ of weights removed) with minimal accuracy loss, but the resulting sparse weight matrices are difficult to accelerate on standard GPU hardware because the non-zero elements are scattered irregularly.
Structured pruning removes entire rows, columns, attention heads, or feed-forward layers. The resulting model remains dense and runs efficiently on standard hardware, but accuracy tends to degrade more quickly as sparsity increases. Recent methods like STUN (Structured-Then-Unstructured) combine both approaches: first removing redundant components at a structural level, then applying fine-grained unstructured pruning within the remaining components.
In 2025, structured pruning research advanced with methods like SlimLLM and NIRVANA, which evaluate importance at the channel or head level rather than aggregating individual weight magnitudes, leading to better accuracy retention at the same sparsity levels.
When a model is too large to fit on a single GPU, it must be split across multiple devices. Tensor parallelism divides individual layers of the model across GPUs. Each GPU holds a slice of every layer and computes its portion in parallel. The partial results are then combined through all-reduce communication operations. This approach works well within a single server where GPUs are connected by fast interconnects like NVLink.
Pipeline parallelism takes a different approach: it assigns different layers to different GPUs, with data flowing through them sequentially like an assembly line. This reduces inter-GPU communication but introduces pipeline bubbles (idle time while data propagates through stages). Pipeline parallelism is most useful across nodes, where network bandwidth is lower than intra-node interconnects.
In practice, production deployments often combine both: tensor parallelism within a node and pipeline parallelism across nodes. vLLM, TensorRT-LLM, and other frameworks support configuring both dimensions independently.
NVIDIA's TensorRT-LLM is an open-source library that applies GPU-specific optimizations including fused CUDA kernels (combining multiple operations into a single kernel launch to reduce overhead), CUDA graph capture (recording a sequence of GPU operations and replaying them without CPU intervention to eliminate launch latency), and hardware-specific quantization using FP8 or INT4 tensor cores.
TensorRT-LLM's AutoDeploy feature, introduced in 2025, automates the process of applying these optimizations, reducing the time to onboard new model architectures from weeks to days. On NVIDIA Blackwell GPUs, AutoDeploy performs on par with manually optimized baselines.
CUDA graphs are a general NVIDIA optimization that pre-records a sequence of GPU operations (kernel launches, memory copies) into a graph that can be replayed without CPU involvement. For LLM inference, where the decode phase repeats the same computation pattern for each token, CUDA graphs eliminate the CPU overhead of launching individual kernels. This is particularly impactful for latency-sensitive applications and low-batch-size scenarios where kernel launch overhead represents a significant fraction of total time.
The following table illustrates how inference costs have changed and how optimization techniques contribute to cost reductions.
| Scenario | Hardware | Cost per 1M output tokens (approx.) | Notes |
|---|---|---|---|
| GPT-4 API (late 2022) | Cloud (OpenAI) | ~$60.00 | Pre-optimization baseline |
| GPT-4-equivalent API (2025) | Cloud (OpenAI) | ~$2.50 | Provider-side optimizations |
| Self-hosted 70B, FP16 | 2x H100 | ~$1.20 | No quantization, continuous batching |
| Self-hosted 70B, AWQ 4-bit | 1x H100 | ~$0.50 | Quantization halves GPU requirement |
| Self-hosted 70B, FP8 + speculative | 1x H100 | ~$0.35 | Combined optimizations |
| Gemini Flash-Lite API (2025) | Cloud (Google) | ~$0.30 | Aggressive provider optimization |
| Self-hosted 8B distilled, Q4 | 1x consumer GPU | ~$0.05 | Distillation + aggressive quantization |
Cloud H100 prices have stabilized at $1.49-$3.90 per hour depending on the provider (as of early 2026), down from peaks of $7.00+ in 2023. H200 GPUs with 141 GB HBM3e are available at $2.15-$6.00 per hour and can serve 70B models on a single GPU that previously required two H100s.
Several trends are shaping the inference optimization landscape as of early 2026.
Reasoning model optimization is a new frontier. The rise of chain-of-thought and reasoning models (such as OpenAI o1 and DeepSeek-R1) has created demand for optimizing long internal reasoning traces. These models generate hundreds or thousands of tokens of intermediate reasoning before producing a final answer, making decode-phase efficiency even more important.
Mixture-of-experts (MoE) scheduling has become a focus area as models like Mixtral and GPT-4 use sparse MoE architectures. Efficiently routing tokens to the correct expert and batching computations across experts on different GPUs requires specialized scheduling logic that existing frameworks are actively developing.
Disaggregated inference separates the prefill phase (processing the input prompt, which is compute-bound) from the decode phase (generating output tokens, which is memory-bound) onto different hardware. This allows each phase to run on hardware optimized for its specific bottleneck.
Energy-aware inference is gaining attention as the environmental and operational costs of running GPU clusters become harder to ignore. Techniques like dynamic voltage and frequency scaling, workload-aware power management, and carbon-aware scheduling are being integrated into serving platforms.
Edge deployment continues to expand, with frameworks like llama.cpp and the GGUF format enabling capable models to run on laptops, phones, and embedded devices. Quantization to 2-4 bits combined with architecture-specific optimizations (such as Apple's MLX framework for Metal GPUs) makes local inference practical for an increasingly wide range of models.
The overall trajectory is clear: each generation of optimization brings roughly an order-of-magnitude cost reduction, making LLM inference accessible to a broader range of applications and organizations.