SGLang is a high-performance serving framework for large language models and multimodal models, developed at UC Berkeley's Sky Computing Lab. The project was introduced in a December 2023 paper by Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, and collaborators from Stanford University, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University [1]. SGLang was accepted at NeurIPS 2024 and has since become one of the leading open-source LLM inference engines, competing directly with vLLM and TensorRT-LLM.
SGLang's distinguishing contributions include RadixAttention, a novel prefix caching mechanism based on a radix tree data structure, and a compressed finite state machine for efficient constrained decoding. The framework supports continuous batching, tensor parallelism, speculative decoding, structured output generation, and multi-modal inference, among other features. As of early 2026, SGLang is used in production by multiple companies and has joined the PyTorch ecosystem as an official project [2].
Serving large language models at scale presents a unique set of engineering challenges. Unlike traditional deep learning inference, where a single forward pass produces a complete output, LLM inference is autoregressive: the model generates tokens one at a time, with each token depending on all previously generated tokens. This makes LLM serving inherently sequential at the token level, even as the system must handle many concurrent requests.
Several techniques have emerged to address these challenges: KV cache management (storing and reusing key-value pairs from the attention mechanism to avoid redundant computation), continuous batching (dynamically adding and removing requests from the batch as they arrive and complete), and quantization (reducing model precision to decrease memory usage and increase throughput).
SGLang's creators observed that existing serving systems did not fully exploit opportunities for KV cache reuse, particularly in workloads that involve shared prefixes (such as few-shot prompting, multi-turn conversations, or agentic workflows where multiple generation calls share a common system prompt). They also noted that structured output generation (constraining the model to produce valid JSON, for example) was handled inefficiently by existing systems. SGLang was designed from the ground up to address both of these problems.
RadixAttention is SGLang's most distinctive technical contribution. It provides automatic and efficient KV cache reuse across multiple LLM generation calls using a radix tree (also known as a Patricia trie) data structure [3].
In a standard LLM serving system, when a request completes, its KV cache (the stored attention key-value pairs for the prompt and generated tokens) is typically discarded. If a subsequent request shares the same prefix (for example, the same system prompt or the same few-shot examples), the system must recompute the KV cache for that shared prefix from scratch. In workloads with significant prefix sharing, this represents a large amount of redundant computation.
Some systems (including vLLM) support prefix caching, but they typically require manual configuration and handle only simple cases like exact prefix matches.
RadixAttention retains KV cache data for both prompts and generation results in a radix tree. A radix tree is a compressed trie where each edge represents a sequence of tokens. The tree efficiently stores all cached token sequences, with shared prefixes stored only once.
When a new request arrives, SGLang performs a prefix search in the radix tree to find the longest matching prefix. The KV cache for the matching portion is reused, and only the new (unmatched) tokens need to be processed. After the request completes, the newly computed KV cache is inserted into the radix tree for potential reuse by future requests.
The system uses an LRU (Least Recently Used) eviction policy to manage GPU memory: when the cache is full and space is needed for new entries, the least recently used cache entries are evicted. SGLang also includes a cache-aware scheduling policy that prioritizes requests with longer cache matches, further increasing the cache hit rate [4].
RadixAttention provides significant performance benefits, particularly for workloads with prefix sharing:
| Workload type | Cache hit rate with RadixAttention | Cache hit rate without | Speedup |
|---|---|---|---|
| Few-shot learning | 85-95% | 15-25% | Up to 5x |
| Multi-turn chat | 70-90% | 0-30% | 2-4x |
| Agentic workflows (shared system prompt) | 80-95% | 10-20% | 3-5x |
| Single-turn, unique prompts | ~0% | ~0% | ~1x (no benefit) |
The advantage is most pronounced in scenarios where multiple requests share significant prefix content, which is common in production LLM deployments. For single-turn requests with unique prompts, RadixAttention adds minimal overhead but provides no caching benefit.
SGLang's second major technical contribution is its approach to constrained decoding, also called structured output generation or guided decoding. This feature ensures that the model's output conforms to a specified format, such as valid JSON, a regular expression pattern, or a context-free grammar.
Many LLM applications require outputs in a specific structured format. An API endpoint might need the model to return valid JSON matching a particular schema. A code generation tool might need syntactically valid code. A data extraction pipeline might need outputs that match a predefined regex pattern. Without constrained decoding, applications must parse and validate LLM outputs, retrying on failure, which wastes compute and adds latency.
SGLang supports three main approaches to constrained decoding:
Regex constraints: The system converts a regular expression into a finite state machine (FSM). During decoding, SGLang maintains the current FSM state and sets the logit (probability) of any token that would produce an invalid transition to negative infinity, effectively preventing the model from generating tokens that would violate the pattern. The gen primitive supports a regex argument for this purpose [5].
JSON schema constraints: Given a JSON schema, SGLang generates a corresponding regex or grammar that matches valid JSON conforming to that schema. This enables applications to specify structured output requirements in terms of familiar JSON schemas rather than low-level regex patterns.
Context-free grammar (CFG) constraints: For output formats too complex for regular expressions (such as programming languages with nested structures), SGLang supports context-free grammars. The system maintains a parse state during decoding and restricts token generation to only those tokens that produce valid partial parses.
SGLang's key optimization for constrained decoding is the compressed finite state machine (FSM). Instead of checking token validity one state transition at a time, the compressed FSM pre-computes which tokens are valid at each state and stores this information in a compact lookup table. This reduces the per-token overhead of constrained decoding to near zero, enabling structured output generation at speeds comparable to unconstrained generation [6].
Benchmarks show that SGLang's constrained decoding can generate valid JSON up to an order of magnitude faster than naive approaches while guaranteeing well-formed output.
SGLang also provides specialized support for reasoning models that use special tokens to denote reasoning sections (like chain-of-thought blocks). The framework can disable grammar restrictions within reasoning sections, allowing the model to reason freely before producing a structured output. This is important for models that perform complex multi-step reasoning before arriving at a final answer [7].
Beyond RadixAttention and constrained decoding, SGLang includes a comprehensive set of features for high-performance LLM serving.
SGLang uses continuous batching (also called iteration-level batching) to maximize GPU utilization. Unlike static batching, which waits for all requests in a batch to complete before starting new ones, continuous batching adds new requests to the running batch at every decoding iteration. As soon as one request finishes generating, its slot is immediately filled by a waiting request. This eliminates idle GPU time and significantly improves throughput under concurrent load.
For models too large to fit on a single GPU, SGLang supports several parallelism strategies:
SGLang supports speculative decoding, a technique that uses a smaller, faster "draft" model to propose multiple tokens at once, which are then verified in parallel by the larger "target" model. When the draft model's predictions are correct (which happens frequently), this produces multiple tokens per forward pass of the large model, reducing latency.
In 2025, the SGLang team introduced SpecForge, a training framework for Eagle3-based speculative decoding that is tightly integrated with the SGLang inference engine. SpecForge enables seamless transition from training a draft model to deploying it for speculative decoding [8].
SGLang supports separating the prefill phase (processing the input prompt) from the decode phase (generating output tokens) onto different GPU resources. This is beneficial because prefill is compute-bound (lots of matrix multiplications on a large input) while decode is memory-bound (small sequential operations that are limited by memory bandwidth). Disaggregating these phases allows each to be optimized independently.
SGLang supports vision-language models and other multi-modal architectures, enabling serving of models that process both text and images (or other modalities). This includes support for models like LLaVA, Qwen-VL, and other multi-modal LLMs.
SGLang supports multiple quantization formats to reduce memory usage and increase throughput:
| Format | Precision | Typical use |
|---|---|---|
| FP8 | 8-bit floating point | Moderate compression with minimal quality loss |
| FP4 | 4-bit floating point | Aggressive compression for Blackwell GPUs |
| INT4 (GPTQ) | 4-bit integer | Weight-only quantization |
| INT4 (AWQ) | 4-bit integer | Activation-aware weight quantization |
SGLang can serve multiple LoRA adapters simultaneously, batching requests across different LoRA variants of the same base model. This is valuable for applications that use fine-tuned model variants for different tasks or users.
SGLang uses an efficient CPU-based scheduler that introduces minimal overhead when managing the request queue, batch formation, and KV cache allocation. The scheduler is designed so that scheduling decisions do not become a bottleneck even at high request rates.
SGLang competes primarily with two other major LLM serving frameworks: vLLM (an open-source project from UC Berkeley) and TensorRT-LLM (NVIDIA's proprietary inference engine). Each framework has distinct strengths.
| Feature | SGLang | vLLM | TensorRT-LLM |
|---|---|---|---|
| Developer | UC Berkeley (open-source) | UC Berkeley (open-source) | NVIDIA (source-available) |
| KV cache management | RadixAttention (radix tree) | PagedAttention (paged memory) | Custom NVIDIA implementation |
| Prefix caching | Automatic, built-in | Available, requires configuration | Available |
| Constrained decoding | Built-in (compressed FSM) | Supported via Outlines integration | Supported via Logits Processor |
| Tensor parallelism | Yes | Yes | Yes |
| Speculative decoding | Yes (Eagle3/SpecForge) | Yes | Yes |
| Hardware support | NVIDIA, AMD, TPU (via SGLang-Jax) | NVIDIA, AMD, TPU, CPU | NVIDIA only |
| Multi-modal support | Yes | Yes | Yes |
| Quantization | FP4/FP8/INT4/AWQ/GPTQ | FP8/INT4/AWQ/GPTQ/GGUF | FP8/INT4/AWQ/GPTQ |
| Ease of setup | Moderate | Easy (pip install) | Complex (NVIDIA-specific build) |
| OpenAI-compatible API | Yes | Yes | Yes |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Performance comparisons between these frameworks depend heavily on the specific model, hardware, workload pattern, and configuration. Based on benchmarks from multiple independent sources as of 2025 [9][10]:
Throughput: At high concurrency (100+ concurrent requests), vLLM and SGLang tend to achieve the highest throughput, with vLLM reaching up to 4,741 tokens per second on Llama 3.1 70B benchmarks. SGLang shows strong performance at moderate concurrency (50 requests). TensorRT-LLM demonstrates the best single-request throughput but scales less well at extreme concurrency levels.
Latency: vLLM is consistently the fastest to generate the first token (TTFT) across concurrency levels, with excellent scaling characteristics. SGLang has the most stable per-token latency, consistently around 4-21 ms across different loads. TensorRT-LLM shows the slowest time to first token but maintains competitive per-token performance at lower concurrency.
Multi-turn and prefix-sharing workloads: SGLang's RadixAttention provides approximately 10-20% better performance on multi-turn workloads with shared context, because the automatic prefix caching reduces redundant computation that vLLM requires manual configuration to avoid.
NVIDIA-optimized hardware: When deployed on NVIDIA's latest hardware (B200 GPUs), TensorRT-LLM consistently outperforms both SGLang and vLLM across all metrics, thanks to its deeper optimization for NVIDIA's hardware architecture.
| Use case | Recommended framework | Rationale |
|---|---|---|
| Interactive chat with high concurrency | vLLM | Best TTFT and throughput scaling |
| Multi-turn/agentic workloads with shared prefixes | SGLang | RadixAttention excels at prefix reuse |
| Structured output (JSON, regex) at scale | SGLang | Compressed FSM provides fastest constrained decoding |
| Maximum single-request performance on NVIDIA hardware | TensorRT-LLM | Deepest NVIDIA-specific optimizations |
| AMD GPU deployment | SGLang or vLLM | Both support ROCm; TensorRT-LLM is NVIDIA-only |
| TPU deployment | SGLang (via SGLang-Jax) | Native TPU support via JAX backend |
| Quick setup and broad model support | vLLM | Simplest installation and largest model compatibility list |
SGLang's architecture consists of two main components: a frontend domain-specific language (DSL) embedded in Python, and a high-performance serving backend.
The SGLang frontend provides a Python-embedded DSL for expressing complex LLM programs. Rather than treating each LLM call as an independent request, SGLang programs can express multi-step generation workflows, branching logic, and constrained generation within a single program. The DSL includes primitives like gen (generate text), select (choose from a set of options), and fork (create parallel generation branches).
This programming model is particularly useful for agentic applications, where an LLM program might involve multiple generation steps with structured intermediate outputs.
The backend handles the actual model execution, including:
SGLang has seen rapid adoption since its introduction.
SGLang is used in production by multiple organizations. It powers part of the LMSYS Chatbot Arena infrastructure, one of the largest public LLM evaluation platforms. The framework provides day-one support for major new model releases, including DeepSeek V3 and R1 models on both NVIDIA and AMD GPUs with model-specific optimizations [11].
In 2025, SGLang joined the PyTorch ecosystem as an official project, reflecting its maturity and community adoption. This integration means SGLang benefits from PyTorch's governance, community, and testing infrastructure [2].
In October 2025, SGLang-Jax was released, enabling SGLang to run natively on Google TPUs. Built on JAX and XLA, SGLang-Jax delivers fast native TPU inference while maintaining support for advanced features like continuous batching, prefix caching, tensor and expert parallelism, speculative decoding, kernel fusion, and highly optimized TPU kernels [12].
SGLang maintains rapid support for newly released models. In late 2025 and early 2026, the project added support for:
In January 2026, SGLang Diffusion was released to accelerate video and image generation workloads, extending the framework beyond text generation [13].
As of early 2026, SGLang is one of the three leading LLM serving frameworks alongside vLLM and TensorRT-LLM. The project continues to develop rapidly, with active contributions from both the academic team at UC Berkeley and a growing open-source community.
The LLM serving landscape has matured considerably since SGLang's introduction. Techniques that were once research contributions, such as continuous batching, prefix caching, and speculative decoding, are now standard features across all major serving frameworks. The competition between frameworks has shifted toward:
SGLang's core strengths remain its RadixAttention prefix caching (which provides automatic, zero-configuration KV cache reuse) and its compressed FSM constrained decoding (which is among the fastest implementations available). For workloads that involve significant prefix sharing or structured output requirements, SGLang offers clear performance advantages over alternatives. For maximum raw throughput on NVIDIA hardware without these specific requirements, vLLM and TensorRT-LLM remain strong competitors.
The broader trend in LLM serving is toward convergence: all major frameworks now support the same core features, and the differences are increasingly about implementation quality, hardware support breadth, and specific optimization choices rather than fundamental architectural differences. SGLang's academic roots and continued research output (such as SpecForge for speculative decoding training) position it well to continue contributing novel techniques that push the state of the art in LLM serving.