vLLM is an open-source large language model serving engine designed for high-throughput, memory-efficient inference. Created by Woosuk Kwon and collaborators at UC Berkeley's Sky Computing Lab in 2023, vLLM introduced PagedAttention, an algorithm that borrows virtual memory concepts from operating systems to manage the key-value (KV) cache during LLM inference. The project has since grown into one of the most widely adopted LLM serving frameworks in production, with over 66,000 GitHub stars and contributions from hundreds of developers across organizations including UC Berkeley, Neural Magic, Anyscale, IBM, AMD, Intel, and NVIDIA.
vLLM powers production systems at companies including Meta, Mistral AI, Cohere, IBM, and Roblox. It also serves as the backend for Amazon Rufus and LinkedIn's AI features. The project offers an OpenAI-compatible API server, making it straightforward for teams to migrate from hosted API providers to self-hosted inference without changing application code.
vLLM originated from a research observation: existing LLM serving systems wasted enormous amounts of GPU memory. When a model generates text, it stores key and value tensors (the "KV cache") for every previous token at every layer of the transformer. Traditional systems allocated a contiguous block of GPU memory for each request's KV cache based on the maximum possible sequence length. Because actual output lengths vary widely, these systems wasted 60-80% of KV cache memory on average.
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica presented their solution in the paper "Efficient Memory Management for Large Language Model Serving with PagedAttention" at the 29th ACM Symposium on Operating Systems Principles (SOSP) in October 2023. The paper drew an analogy between the memory management problem in LLM serving and the virtual memory problem that operating systems solved decades ago.
The core insight was that KV cache memory did not need to be contiguous. Just as an operating system maps virtual memory pages to scattered physical memory frames, PagedAttention maps logical positions in a sequence to non-contiguous physical blocks of GPU memory. This seemingly simple idea had a dramatic practical impact.
PagedAttention partitions the KV cache for each request into fixed-size blocks (analogous to memory pages). Each block holds the key and value vectors for a fixed number of tokens. A block table, similar to a page table in an operating system, maps logical token positions to physical block locations in GPU memory.
This design provides three benefits.
Near-zero memory waste. Blocks are allocated on demand as a sequence grows and freed when it completes. Because blocks are small and independent, the only wasted memory is the partially filled last block of each sequence. In practice, this reduces KV cache waste from 60-80% to under 4%.
Memory sharing across requests. If two requests share a common prefix (for instance, the same system prompt), their KV cache blocks for that prefix can point to the same physical memory through copy-on-write semantics. This is analogous to how operating systems share memory pages between forked processes.
Flexible memory management. Blocks can be allocated, freed, and moved independently, enabling sophisticated cache management policies like LRU eviction and prefix caching without the constraints of contiguous allocation.
In benchmarks against the state of the art at the time (NVIDIA FasterTransformer and Orca), vLLM achieved 2-4x higher throughput at the same latency levels. Compared to naive serving approaches without KV cache optimization, the improvement was 14-24x.
vLLM implements continuous batching (also called iteration-level scheduling), first proposed by the Orca system. Rather than grouping requests into fixed batches and waiting for every request in a batch to finish, vLLM processes requests at the granularity of individual decode steps. When one request finishes generating, its slot is immediately filled by a new request from the queue. This keeps GPU utilization consistently high regardless of how much output lengths vary across requests.
For models too large to fit on a single GPU, vLLM supports tensor parallelism, which splits each layer of the model across multiple GPUs. Each GPU computes its portion of the layer in parallel, and partial results are combined through all-reduce operations. vLLM also supports pipeline parallelism (assigning different layers to different GPUs) for multi-node deployments. A common production configuration uses tensor parallelism within a node (where GPUs are connected by fast NVLink) and pipeline parallelism across nodes.
Speculative decoding uses a smaller draft model to propose multiple tokens ahead, which the main model then verifies in a single forward pass. If the draft tokens match what the main model would have generated, all of them are accepted at once. This technique produces output identical to standard decoding (it is mathematically lossless) while achieving 2-3x speedups at low batch sizes. vLLM integrates speculative decoding as a configurable feature that users can enable by specifying a draft model.
vLLM's Automatic Prefix Caching (APC) hashes each KV block based on its token content and maintains a global hash table of all physical blocks. When a new request arrives with a prefix that matches cached blocks, the system reuses the existing KV data instead of recomputing it. This is particularly valuable for applications where many requests share the same system prompt, few-shot examples, or retrieved documents.
The caching system manages eviction automatically using a least-recently-used policy. Blocks with zero active references are candidates for eviction, and the system prioritizes removing those that have not been accessed recently.
vLLM supports serving LoRA (Low-Rank Adaptation) fine-tuned models on a per-request basis with minimal overhead. A single base model can serve dozens of LoRA adapters simultaneously, with each incoming request specifying which adapter to apply. Adapters can be loaded from local files or remote storage (such as S3) and can be dynamically added or removed at runtime through API endpoints.
This capability is important for organizations that fine-tune models for different customers or tasks but want to avoid running separate model instances for each adapter.
vLLM supports vision-language models (VLMs) including Qwen2-VL and LLaVA variants, handling image and video inputs alongside text. The vLLM community released vllm-omni in November 2025 as a dedicated project for omni-modality model serving.
vLLM provides an OpenAI-compatible HTTP server that mirrors the OpenAI Chat Completions and Completions API formats. Applications built against the OpenAI API can switch to a self-hosted vLLM backend by changing only the base URL. The system also supports the OpenAI Responses API for tool-use workflows.
Performance comparisons depend heavily on the specific model, hardware, workload pattern, and optimization settings. The following numbers provide general context rather than definitive rankings, since benchmark conditions vary across sources.
vLLM's original benchmarks (SOSP 2023) showed 2-4x throughput improvement over FasterTransformer and Orca at equivalent latency. On H100 GPUs with combined quantization, speculative decoding, and PagedAttention, vLLM has been measured at over 500 tokens per second for single-request scenarios.
The V1 architecture (discussed below) demonstrated consistently lower latency than the prior architecture (referred to as V0), especially under high query rates, with particularly large improvements for vision-language models due to offloading input processing to a separate process.
The following table compares vLLM with other major LLM serving frameworks. Note that this landscape changes rapidly; feature comparisons reflect the state of these projects in early 2026.
| Feature | vLLM | TensorRT-LLM | SGLang | Ollama | TGI |
|---|---|---|---|---|---|
| Developer | UC Berkeley / Community | NVIDIA | LMSYS | Ollama Inc. | Hugging Face |
| Core innovation | PagedAttention | Fused CUDA kernels, CUDA graphs | RadixAttention | User-friendly local serving | HF model hub integration |
| Continuous batching | Yes | Yes | Yes | Limited | Yes |
| Tensor parallelism | Yes | Yes | Yes | No | Yes |
| Speculative decoding | Yes | Yes | Yes | Yes | Yes |
| Prefix caching | Yes (automatic) | Yes | Yes (RadixAttention) | No | Yes |
| LoRA serving | Yes (per-request) | Limited | Yes | Yes | Yes |
| Multi-modal | Yes | Yes | Yes | Yes | Yes |
| Hardware support | NVIDIA, AMD, Intel, CPU | NVIDIA only | NVIDIA, AMD | NVIDIA, AMD, Apple, CPU | NVIDIA, AMD |
| Quantization formats | GPTQ, AWQ, FP8, GGUF, more | FP8, INT4, INT8 | GPTQ, AWQ, FP8 | GGUF | GPTQ, AWQ, FP8 |
| OpenAI-compatible API | Yes | Yes | Yes | Yes | Yes |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | MIT | Apache 2.0 |
| Primary use case | Production GPU serving | Max throughput on NVIDIA GPUs | Complex LLM programs, agents | Local/desktop use | HF ecosystem integration |
| Status (2026) | Active development | Active development | Active development | Active development | Maintenance mode (Dec 2025) |
A few notes on this comparison. TensorRT-LLM typically achieves the highest raw throughput on NVIDIA hardware due to its deep integration with CUDA, but it only supports NVIDIA GPUs and has a more complex setup process. SGLang, which emerged from the LMSYS group, uses RadixAttention (a radix-tree-based approach to automatic KV cache reuse) and often outperforms vLLM in workloads involving multi-turn conversations and agent workflows where requests share dynamic context, with 10-20% performance improvements reported for such patterns. Ollama is designed primarily for ease of use on personal machines rather than production throughput. Hugging Face's Text Generation Inference (TGI) entered maintenance mode in December 2025, with the organization recommending vLLM or SGLang for new deployments.
vLLM has become the default serving solution for much of the open-source AI ecosystem. Production deployments include:
The project has grown from a research prototype to a community with 15+ full-time contributors across 6+ organizations, 20+ active organizational stakeholders, and hundreds of individual contributors. A recent release (v0.15.1) included 440 commits from 203 contributors.
vLLM has an active collaboration with OpenAI, with several OpenAI team members (including Zhuohan Li, who co-authored the original PagedAttention paper) contributing to the project.
The most visible result of this partnership is vLLM's support for OpenAI's gpt-oss model, announced in August 2025. Through collaboration with OpenAI and NVIDIA, vLLM integrated specialized GPU kernels to run MXFP4 (microscaling 4-bit floating point) mixture-of-experts inference efficiently. For Hopper GPUs (H100, H200), vLLM uses the Triton matmul_ogs kernel, implemented by the OpenAI Triton team and optimized for Hopper architectures. vLLM also natively supports gpt-oss capabilities through integration with the OpenAI Responses API and the gpt-oss toolkit.
In January 2025, the vLLM team announced V1, a ground-up redesign of vLLM's core architecture. After 1.5 years of development on the original codebase, the team identified several architectural decisions that limited performance and maintainability. V1 revisited the scheduler, KV cache manager, worker, sampler, and API server components.
The design goals for V1 were:
V1 was initially released as an alpha behind the VLLM_USE_V1=1 environment variable, with the existing API preserved so users could switch without code changes. Performance testing showed V1 consistently outperforming V0, especially under high query rates. The improvement was particularly notable for vision-language models, where V1 offloads input processing to a separate process and implements more flexible scheduling for multimodal queries.
At launch, V1 supported decoder-only transformers (such as Llama), mixture-of-experts models (such as Mixtral), and several VLMs (such as Qwen2-VL). All quantization methods were supported. Encoder-decoder architectures, Mamba-based models, and embedding models were not initially supported but were planned for future releases.
By mid-2025, V1 became the default engine in vLLM, with V0 retained as a fallback for architectures not yet ported to the new codebase.
vLLM's GitHub repository has accumulated over 66,000 stars, placing it among the most popular open-source AI infrastructure projects. The project hosts bi-monthly meetups that bring together contributors and users from industry and academia.
The contributor base spans a broad range of hardware vendors and cloud providers. AMD's ROCm became a first-class platform in the vLLM ecosystem, enabling vLLM to run on AMD Instinct GPUs. Intel contributed CPU and Gaudi accelerator support. Baidu contributed vLLM-Kunlun for Kunlun XPU support. Huawei contributed vLLM-Ascend for Ascend NPU support.
The project expanded beyond its core inference engine in 2025. vLLM Semantic Router, launched experimentally in September 2025 and reaching its first major release (Iris) in January 2026, addresses request routing in multi-model deployments. The project attracted over 600 pull requests and contributions from more than 50 engineers within its first few months.
In December 2025, reports indicated that the vLLM project was seeking $160 million in funding, reflecting its growing ambitions beyond the open-source project. The vllm.ai website launched in December 2025 as a central hub for the project's documentation, blog, and community resources.
vLLM can be installed via pip:
pip install vllm
A minimal serving example starts a server compatible with the OpenAI API:
vllm serve meta-llama/Llama-3-8B-Instruct
Clients can then send requests using the standard OpenAI client library by pointing it to the local server's URL. The server handles continuous batching, KV cache management, and all other optimizations automatically.
For multi-GPU serving with tensor parallelism:
vllm serve meta-llama/Llama-3-70B-Instruct --tensor-parallel-size 4
Quantized models, speculative decoding, LoRA adapters, and other features are configured through additional command-line arguments or the Python API.
vLLM continues to evolve rapidly. The V1 architecture is now the default, with ongoing work to expand model coverage. Hardware support continues to broaden, with AMD, Intel, and custom accelerator support maturing alongside NVIDIA GPU support.
The competitive landscape has intensified, with SGLang emerging as a strong alternative that sometimes outperforms vLLM on specific workloads (particularly agent and multi-turn conversation patterns). TensorRT-LLM remains the throughput leader on NVIDIA hardware for teams willing to accept its NVIDIA-only constraint. The vLLM team has responded by incorporating ideas from competing projects and continuing to improve performance.
The Q4 2025 roadmap included further V1 optimizations, expanded model support, improved disaggregated inference (separating prefill and decode onto different hardware), and deeper integration with the Kubernetes ecosystem through the vLLM Production Stack project.
For most organizations deploying open-source LLMs in production, vLLM remains the most common starting point, combining competitive performance with broad hardware support, extensive model compatibility, and an active community.