LMDeploy
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,499 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,499 words
Add missing citations, update stale details, or suggest a clearer explanation.
LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models, developed primarily by the InternLM team at the Shanghai AI Laboratory. The project is structured around two interchangeable inference engines: TurboMind, a C++ runtime built on a fork of NVIDIA's FasterTransformer that emphasizes raw throughput and quantized inference, and a PyTorch engine, a pure Python implementation that prioritizes broad model coverage and ease of contribution. LMDeploy is distributed under the Apache 2.0 license and is hosted at github.com/InternLM/lmdeploy.
LMDeploy is one of the serving frameworks that dominate open-source LLM inference in 2026, alongside vLLM, SGLang, and NVIDIA's TensorRT-LLM. It is distinguished by its tight integration with the InternLM and InternVL model families, by its early and aggressive investment in 4-bit weight quantization via the AWQ algorithm, and by the TurboMind C++ backend, which removes the Python interpreter from the inference hot path on NVIDIA hardware. Benchmarks commonly place TurboMind among the fastest open-source runtimes on NVIDIA Hopper-class GPUs, with throughput numbers in 2026 cited near 16,200 tokens per second on a single H100 for popular dense models in the 7 to 13 billion parameter range, compared with roughly 12,500 tokens per second for a comparable vLLM configuration.
Beyond the InternLM ecosystem, LMDeploy supports more than fifty LLM architectures, including Llama 2 and 3, Qwen through Qwen3, Mistral, Mixtral and other mixture of experts variants, DeepSeek-V3 and its derivatives, Gemma, Baichuan, ChatGLM, and Yi. It also supports roughly forty vision-language models, including the LLaVA and InternVL families, Qwen-VL, DeepSeek-VL, MiniCPM-V, CogVLM, Phi-3 Vision, and Llama 3.2 Vision.
LMDeploy was created by engineers inside the Shanghai AI Laboratory who had previously worked on the MMRazor model compression toolkit and the MMDeploy model deployment toolkit, both part of the OpenMMLab computer vision ecosystem. When Shanghai AI Lab launched the InternLM language model series in June 2023, the lab needed a serving stack that could run InternLM efficiently on Chinese cloud infrastructure. The group that had built MMDeploy for vision models was tasked with the LLM equivalent.
The project's first public release on GitHub appeared in mid 2023, only a few months after vLLM's June 2023 announcement and well before SGLang's January 2024 launch. The initial codebase exposed a single inference engine, TurboMind, which started as a fork of NVIDIA's open-source FasterTransformer library. FasterTransformer had been the de facto high-performance C++ implementation of transformer inference for several years, and the team chose to adapt it rather than write a new engine from scratch, since the kernels were already well tuned for NVIDIA hardware and a C++ backend would avoid Python overhead.
LMDeploy was open sourced under the Apache 2.0 license and developed in lockstep with the rest of the InternLM toolchain. The InternLM organization on GitHub hosts a coordinated set of projects including the InternLM and InternLM2/2.5/3 base models, the InternVL vision-language model family, the XTuner fine-tuning library, the OpenCompass evaluation harness, the MindSearch agent system, and LMDeploy as the serving and quantization layer. New InternLM and InternVL models almost always ship with day-zero LMDeploy support.
Shanghai AI Laboratory is a government-affiliated research institute headquartered in Shanghai, founded in 2020 with backing from the Shanghai municipal government and academic partners including Shanghai Jiao Tong University and Fudan University. It has become one of the most prolific producers of open-source foundation models in China.
The TurboMind engine is what most users encounter first when they install LMDeploy, because it is the default backend whenever the loaded model architecture is supported. TurboMind is implemented almost entirely in C++ and CUDA, with a thin Python wrapper that exposes a request scheduler and an OpenAI-compatible HTTP API. The Python layer's only job is to accept requests, marshal them into C++ data structures, and stream tokens back; all heavy work happens in the C++ runtime.
This design choice is the engine's main architectural distinction. vLLM and SGLang both run their schedulers, batchers, and tokenizers in Python, calling out to C++ or CUDA only for the underlying kernels. For long-running, high-concurrency workloads this Python orchestration adds measurable overhead, particularly on the fastest GPUs where each kernel launch is short relative to the time it takes the Python interpreter to prepare the next batch. TurboMind avoids this overhead entirely by keeping the scheduler and the batcher in C++, which is a major reason the engine often outpaces Python-orchestrated runtimes on H100 and H200 hardware.
TurboMind inherits several optimizations from its FasterTransformer ancestry, including fused multi-head attention kernels, fused MLP and LayerNorm operations, NCCL-based tensor parallelism, and an efficient implementation of grouped-query attention. The LMDeploy team has added persistent batching (their term for continuous batching), a blocked KV cache modeled on the paging idea popularized by PagedAttention, dynamic split and fuse scheduling that breaks large prefill requests into smaller chunks for better interleaving with decode requests, automatic prefix caching, FlashAttention and split-K decoding kernels, and a custom path for AWQ-quantized 4-bit weights that uses dedicated CUTLASS-based GEMMs.
For mixture-of-experts models, TurboMind integrates with DeepGemm, the FP8 GEMM library released by DeepSeek, and with FlashMLA, the multi-head latent attention kernel DeepSeek developed for the V2 and V3 architectures. These additions allow TurboMind to serve the 685B-parameter DeepSeek-V3 model efficiently on multi-node H100 deployments. A practical limitation is that adding support for a new model architecture requires writing C++ code, which is slower than adding a Python class in vLLM or SGLang. This is the reason LMDeploy maintains a second engine.
The PyTorch engine, sometimes called pytorch_engine in the code, was added in 2024 to address the model coverage gap. It is a pure Python and PyTorch implementation of the LMDeploy inference stack, with no dependency on the TurboMind C++ runtime. New architectures can be added simply by writing a Python module that defines the model's layers using standard PyTorch primitives, with the engine's scheduler, batcher, and KV cache manager handling the rest.
The PyTorch engine is slower than TurboMind for the model families that both engines support, sometimes by 20 to 40 percent depending on hardware and workload, but it serves a critical purpose: it lets users run models not yet ported to TurboMind, and it gives community contributors a low-barrier path to adding new architectures. Many models are first added to the PyTorch engine and then ported to TurboMind once their popularity justifies the engineering effort.
The PyTorch engine supports the same OpenAI-compatible API, quantization workflow, persistent batching scheduler, and most of the same kernel-level optimizations (FlashAttention, prefix caching, paged KV cache) as TurboMind. It also adds support for hardware backends that TurboMind does not target directly, including AMD ROCm GPUs and Huawei Ascend NPUs, the latter via a graph compilation mode that was reported in late 2025 to roughly double throughput on Ascend hardware.
The table below summarizes the features that LMDeploy advertises on its README and in its documentation.
| Feature | Description |
|---|---|
| TurboMind engine | C++ inference runtime forked from FasterTransformer; default for supported models |
| PyTorch engine | Pure Python and PyTorch runtime for broader model coverage |
| Persistent batching | Continuous batching of incoming requests at every decode step |
| Blocked KV cache | Paged key-value cache modeled on PagedAttention |
| Dynamic split and fuse | Chunked prefill that interleaves prefill and decode work |
| Automatic prefix caching | Reuse of KV cache across requests that share a prompt prefix |
| Tensor parallelism | Multi-GPU sharding via NCCL |
| Pipeline parallelism | Cross-node sharding for very large models |
| OpenAI-compatible API | Drop-in replacement for OpenAI chat and completions endpoints |
| Quantization toolkit | AWQ, SmoothQuant W8A8, KV cache INT4/INT8, MXFP4 |
| Vision-language support | Native serving for forty-plus VLM architectures |
| Multi-model serving | Ability to host multiple models on the same server |
| Speculative decoding | Draft model and self-speculative paths (added 2025) |
| Distributed deployment | Multi-node, multi-card service mode with built-in router |
| Hardware support | NVIDIA CUDA (primary), AMD ROCm, Huawei Ascend NPU |
Quantization is one of LMDeploy's strongest areas, and the project was an early adopter of 4-bit weight-only quantization for production inference. The toolkit exposes several quantization workflows through its lmdeploy lite command-line interface, and the TurboMind engine includes specialized fused kernels that read the quantized weights directly without dequantizing them into higher precision first.
| Method | Precision | Description |
|---|---|---|
| AWQ | W4A16 | Activation-aware weight quantization, ported from the MIT AWQ paper; 4-bit weights, 16-bit activations |
| GPTQ | W4A16 | Alternative 4-bit weight quantization scheme; TurboMind can also infer GPTQ-quantized checkpoints |
| SmoothQuant | W8A8 | 8-bit weights and 8-bit activations using the SmoothQuant smoothing transform |
| KV INT8 | KV cache | Online per-head, per-token 8-bit quantization of the key-value cache |
| KV INT4 | KV cache | Online 4-bit asymmetric quantization of the key-value cache |
| MXFP4 | W4 | Microscaling FP4 format for select NVIDIA GPUs that support it |
| FP8 | W8A8 | FP8 weights and activations for Hopper and Blackwell hardware |
The AWQ pipeline is the most commonly used. A user runs lmdeploy lite auto_awq with a model path and a calibration dataset; the tool produces a quantized checkpoint that TurboMind can then load directly. The resulting 4-bit model uses roughly one quarter of the GPU memory of an FP16 baseline, and LMDeploy documents a 2.4x throughput improvement on AWQ-quantized weights compared with FP16 inference on the same hardware.
The SmoothQuant workflow targets INT8 deployments where activations also need quantization to use integer tensor cores. The KV cache quantization is applied online rather than offline: the engine quantizes the keys and values into INT8 or INT4 as it writes them into the cache, using a per-head and per-token asymmetric scheme that LMDeploy reports preserves accuracy more reliably than per-tensor schemes. Later additions include MXFP4 support for Blackwell hardware, reported to give approximately 1.5x the throughput of vLLM on H800 GPUs for compatible models, and integration with the LLM Compressor project from the vLLM ecosystem.
LMDeploy's model coverage spans dense decoder-only LLMs, vision-language models, and sparse mixture-of-experts models. TurboMind and the PyTorch engine have overlapping but not identical coverage; the PyTorch engine supports more architectures, while TurboMind generally provides higher throughput for the models it does support.
| Model family | Sizes supported | Engine |
|---|---|---|
| InternLM, InternLM2, InternLM2.5, InternLM3 | 1.8B to 20B | TurboMind, PyTorch |
| InternVL, InternVL2, InternVL3 | 1B to 78B | TurboMind, PyTorch |
| Llama 2 | 7B to 70B | TurboMind, PyTorch |
| Llama 3, Llama 3.1, Llama 3.2, Llama 3.3 | 1B to 405B | TurboMind, PyTorch |
| Llama 4 (Scout, Maverick) | MoE | PyTorch |
| Qwen, Qwen1.5, Qwen2, Qwen2.5, Qwen3 | 0.5B to 110B | TurboMind, PyTorch |
| Qwen-VL, Qwen2-VL | 2B to 72B | TurboMind, PyTorch |
| Mistral 7B, Mistral Nemo | 7B to 12B | TurboMind, PyTorch |
| Mixtral | 8x7B, 8x22B | TurboMind, PyTorch |
| DeepSeek, DeepSeek-V2, DeepSeek-V3 | up to 685B (MoE) | TurboMind, PyTorch |
| DeepSeek-VL, DeepSeek-VL2 | up to 27B | PyTorch |
| Gemma, Gemma2, Gemma3 | 2B to 27B | TurboMind, PyTorch |
| Baichuan, Baichuan2 | 7B, 13B | TurboMind |
| ChatGLM3, GLM-4, GLM-4.5 | 6B to 9B | TurboMind, PyTorch |
| Yi, Yi-1.5 | 6B to 34B | TurboMind |
| LLaVA, LLaVA-Next | 7B to 34B | TurboMind, PyTorch |
| MiniCPM, MiniCPM-V | 1B to 8B | PyTorch |
| Phi-3, Phi-3 Vision, Phi-4 | 3.8B to 14B | PyTorch |
| CogVLM, CogVLM2 | 17B to 19B | PyTorch |
| InternLM-XComposer, InternLM-XComposer2 | 7B | TurboMind |
LMDeploy generally adds support for flagship Chinese open-source models on or near their release day, including new Qwen and DeepSeek releases, while support for new Western models lands in the following weeks. The project has averaged roughly one new supported architecture per week through 2025 and into 2026.
LMDeploy publishes its own benchmark numbers on the project README, and several third-party benchmarks have appeared in the past year. The headline number LMDeploy itself cites is that TurboMind delivers up to 1.8x the request throughput of vLLM under high-concurrency workloads, with a 2.4x improvement when using AWQ 4-bit quantization compared with FP16 inference.
Independent comparisons published in late 2025 and early 2026 paint a consistent picture, although the exact numbers depend heavily on the workload and the GPU. The table below summarizes throughput figures commonly cited for Llama 3.1 8B on a single H100, decode-heavy chat workload, measured in output tokens per second across many concurrent requests.
| Engine | Throughput on H100 (tokens/s) | Notes |
|---|---|---|
| LMDeploy TurboMind | approximately 16,200 | Default settings; C++ backend |
| SGLang | approximately 16,200 | With RadixAttention prefix caching |
| TensorRT-LLM | approximately 15,500 to 16,000 | Requires per-model engine build |
| vLLM with FlashInfer | approximately 12,500 | Python-orchestrated; widely deployed |
| Hugging Face TGI | approximately 8,000 | Lower-effort production stack |
The close tie between LMDeploy TurboMind and SGLang at the top of the table reflects the fact that both engines reduce Python-side overhead aggressively, although by different routes: LMDeploy moves the scheduler into C++, while SGLang keeps a Python scheduler but spends substantial engineering effort on zero-overhead batch scheduling. vLLM trails the leaders by roughly 25 to 30 percent on this workload, a gap that vLLM contributors have been actively closing through FlashInfer, CUDA graphs, and partial C++ rewrites of the scheduler.
For lower-concurrency workloads with longer prompts and significant prefix sharing, SGLang's RadixAttention often pulls ahead. For inference of AWQ-quantized 4-bit models, LMDeploy TurboMind typically leads by a wider margin, because its 4-bit GEMM kernels are among the most heavily optimized in the open-source ecosystem. LMDeploy also reports approximately 2x the throughput of the eager-mode PyTorch baseline on Huawei Ascend NPUs running in graph mode, and parity with NVIDIA A100 on AMD MI300 GPUs for several mid-size dense models.
The four major open-source serving frameworks for LLMs in 2026 are vLLM, SGLang, LMDeploy, and TensorRT-LLM. They overlap heavily in feature set, since techniques such as continuous batching, prefix caching, paged KV cache, and tensor parallelism are now standard. They differ in implementation choices, hardware coverage, and ecosystem positioning.
| Aspect | LMDeploy | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|---|
| Primary developer | Shanghai AI Lab / InternLM | UC Berkeley / Inferact | UC Berkeley / LMSYS | NVIDIA |
| First public release | mid 2023 | June 2023 | January 2024 | October 2023 |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 | Apache 2.0 |
| Scheduler language | C++ (TurboMind) / Python (PyTorch engine) | Python | Python | C++ |
| Default attention | Paged + FlashAttention | PagedAttention | RadixAttention | TensorRT kernels |
| Quantization breadth | Very broad: AWQ, GPTQ, SmoothQuant, KV INT8/4, MXFP4, FP8 | Broad: AWQ, GPTQ, FP8, INT8, KV cache | Broad: AWQ, FP8, KV cache | Broad: AWQ, FP8, INT8 |
| Hardware coverage | NVIDIA, AMD ROCm, Huawei Ascend | NVIDIA, AMD, Intel, TPU, Inferentia | NVIDIA, AMD, TPU (via SGLang-Jax) | NVIDIA only |
| Model coverage breadth | Approximately 50 LLMs, 40 VLMs | 100+ architectures | Approximately 60 LLMs | NVIDIA-curated set |
| Notable strength | TurboMind throughput, AWQ kernels, InternLM/InternVL native support | Largest community, most production deployments, OpenAI compatibility | RadixAttention prefix sharing, structured outputs, DeepSeek leadership | Peak performance on NVIDIA-only deployments |
| Notable weakness | Smaller community than vLLM; adding TurboMind models requires C++ | Slightly lower peak throughput than C++-scheduled engines | Smaller ecosystem than vLLM | Limited to NVIDIA hardware; per-model engine build step |
| Governance | Shanghai AI Lab open-source program | PyTorch Foundation (since May 2025) | PyTorch ecosystem project (since 2025) | NVIDIA-owned |
In practice, choice often comes down to deployment context. Users running InternLM, InternVL, or other Chinese open-source models tend to pick LMDeploy for its native model support and Chinese-language documentation. Users on Western public clouds with heterogeneous fleets typically pick vLLM. Workloads with heavy prefix sharing or structured outputs, or DeepSeek deployments, tend to favor SGLang. Pure NVIDIA fleets at hyperscale willing to invest in per-model engine builds tend toward TensorRT-LLM.
LMDeploy is installed from PyPI with pip install lmdeploy. Default wheels in 2026 ship with CUDA 12.8 support, targeting NVIDIA Ampere, Hopper, and Blackwell GPUs. The package exposes three primary entry points: an interactive command-line chat tool, an offline batch inference API in Python, and an OpenAI-compatible HTTP server.
The HTTP server is the most common production deployment mode. The command lmdeploy serve api_server <model> starts a service on port 23333 that accepts both legacy completions and modern chat completions requests, with streaming responses, tool calls, and function-calling extensions. Vision-language models accept image inputs through the same API using the OpenAI multimodal message format.
For large models that do not fit on a single GPU, LMDeploy supports tensor parallelism within a node via the tp argument, and pipeline parallelism across nodes via a distributed runner. A built-in proxy server can load-balance requests across multiple LMDeploy worker instances and host multiple models simultaneously, exposing them through the OpenAI API's model selection field. The cache_max_entry_count parameter defaults to 0.8 and controls the fraction of free GPU memory after model loading reserved for the KV cache, a deliberate departure from vLLM's default of approximately 0.9 to leave more headroom against out-of-memory errors.
LMDeploy is hosted on GitHub under the InternLM organization. The project has accumulated approximately 7,900 stars and 700 forks as of mid 2026, with a contributor base of more than 100 developers. Core maintainers are employees of Shanghai AI Laboratory, with significant external contributions from Chinese cloud providers including Alibaba Cloud, Huawei, and Baidu Smart Cloud.
The project does not have an independent foundation home as vLLM and SGLang do under the PyTorch Foundation umbrella. Governance remains with the InternLM organization, with releases driven by a maintainer team roughly every two to four weeks; version 0.13.0 was the current stable release as of May 2026. Documentation is maintained in both English and Chinese on Read the Docs.
LMDeploy in 2026 sits in the top tier of open-source LLM serving frameworks. Its TurboMind backend provides among the highest throughput numbers on NVIDIA Hopper hardware, its quantization toolkit is broader than most competitors, and its model coverage is unusually deep for the InternLM, InternVL, Qwen, and DeepSeek families that dominate Chinese open-source AI. The PyTorch backend has steadily closed the model-coverage gap with vLLM, although vLLM still leads on the long tail of less popular architectures.
Competitive pressure comes from two directions. TensorRT-LLM continues to lead on raw NVIDIA-only throughput when an organization is willing to do the per-model engine-build work. SGLang has overtaken LMDeploy in mindshare for prefix-caching-heavy workloads, and vLLM remains the default choice for new production deployments in Western enterprises because of its larger community.
Recent releases have added DeepSeek-V3 support with the DeepGemm and FlashMLA kernels on day one, brought MXFP4 quantization to Blackwell hardware, extended the PyTorch engine to AMD ROCm and Huawei Ascend, integrated with the LLM Compressor toolchain, and added speculative decoding paths in both engines. LMDeploy is a credible alternative to vLLM and SGLang on NVIDIA hardware, the most natural choice for serving InternLM and InternVL models anywhere, and an interesting option for Ascend NPU deployments where alternatives are much weaker.