LMDeploy

AI Infrastructure Developer Tools Open Source AI

20 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v3 · 4,027 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LMDeploy is an open-source toolkit for compressing, deploying, and serving large language models, developed by the MMRazor and MMDeploy teams associated with the InternLM project at the Shanghai AI Laboratory.^[1]^[2] Its official one-line description is "a toolkit for compressing, deploying, and serving LLM," and its central performance claim is that it "delivers up to 1.8x higher request throughput than vLLM."^[1] The project is structured around two interchangeable inference engines: TurboMind, a C++ runtime built on a fork of NVIDIA's FasterTransformer that emphasizes raw throughput and quantized inference, and a PyTorch engine, a pure Python implementation that prioritizes broad model coverage and ease of contribution.^[1]^[2] LMDeploy is distributed under the Apache 2.0 license and is hosted at github.com/InternLM/lmdeploy.^[1]

LMDeploy is one of the serving frameworks that dominate open-source LLM inference in 2026, alongside vLLM, SGLang, and NVIDIA's TensorRT-LLM. It is distinguished by its tight integration with the InternLM and InternVL model families, by its early and aggressive investment in 4-bit weight quantization via the AWQ algorithm, and by the TurboMind C++ backend, which removes the Python interpreter from the inference hot path on NVIDIA hardware.^[1]^[17] LMDeploy reports that 4-bit inference performance is 2.4x higher than FP16 on the same hardware.^[1] Benchmarks commonly place TurboMind among the fastest open-source runtimes on NVIDIA Hopper-class GPUs, with throughput numbers in 2026 cited near 16,200 tokens per second on a single H100 for popular dense models in the 7 to 13 billion parameter range, compared with roughly 12,500 tokens per second for a comparable vLLM configuration.^[6]^[7]

Beyond the InternLM ecosystem, LMDeploy supports more than fifty LLM architectures, including Llama 2 and 3, Qwen through Qwen3, Mistral, Mixtral and other mixture of experts variants, DeepSeek-V3 and its derivatives, Gemma, Baichuan, ChatGLM, and Yi.^[1]^[2] It also supports roughly forty vision-language models, including the LLaVA and InternVL families, Qwen-VL, DeepSeek-VL, MiniCPM-V, CogVLM, Phi-3 Vision, and Llama 3.2 Vision.^[2]

What is LMDeploy?

LMDeploy is a production inference and quantization stack for large language models and vision-language models. According to the project README, "LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams."^[1] It packages three things that are usually separate in the open-source ecosystem: a quantization toolkit (weight-only and KV-cache quantization), a pair of high-throughput inference engines, and a distributed, OpenAI-compatible serving layer with a built-in request router.^[1]^[2]

The README groups its capabilities under four headings: Efficient Inference, Effective Quantization, Effortless Distribution Server, and Excellent Compatibility.^[1] In practice this means a user can take an unquantized checkpoint from Hugging Face, quantize it to 4-bit with one command, and serve it behind an OpenAI-compatible HTTP endpoint with continuous batching and a paged KV cache, all from a single pip install lmdeploy.^[1]^[15]

What is the origin of LMDeploy and the InternLM toolchain?

LMDeploy was created by engineers inside the Shanghai AI Laboratory who had previously worked on the MMRazor model compression toolkit and the MMDeploy model deployment toolkit, both part of the OpenMMLab computer vision ecosystem.^[1]^[4] When Shanghai AI Lab launched the InternLM language model series in June 2023, the lab needed a serving stack that could run InternLM efficiently on Chinese cloud infrastructure. The group that had built MMDeploy for vision models was tasked with the LLM equivalent.

The project's first public release on GitHub appeared in mid 2023, only a few months after vLLM's June 2023 announcement and well before SGLang's January 2024 launch. The initial codebase exposed a single inference engine, TurboMind, which started as a fork of NVIDIA's open-source FasterTransformer library.^[1]^[14] FasterTransformer had been the de facto high-performance C++ implementation of transformer inference for several years, and the team chose to adapt it rather than write a new engine from scratch, since the kernels were already well tuned for NVIDIA hardware and a C++ backend would avoid Python overhead.

LMDeploy was open sourced under the Apache 2.0 license and developed in lockstep with the rest of the InternLM toolchain.^[1] The InternLM organization on GitHub hosts a coordinated set of projects including the InternLM and InternLM2/2.5/3 base models, the InternVL vision-language model family, the XTuner fine-tuning library, the OpenCompass evaluation harness, the MindSearch agent system, and LMDeploy as the serving and quantization layer.^[4] New InternLM and InternVL models almost always ship with day-zero LMDeploy support.

Shanghai AI Laboratory is a government-affiliated research institute headquartered in Shanghai, founded in 2020 with backing from the Shanghai municipal government and academic partners including Shanghai Jiao Tong University and Fudan University.^[5] It has become one of the most prolific producers of open-source foundation models in China.

What is TurboMind?

TurboMind is the C++ inference engine at the heart of LMDeploy. The 2025 LMDeploy paper describes it as "a generalizable and efficient mixed-precision LLM inference engine of LMDeploy."^[17] It is the engine most users encounter first when they install LMDeploy, because it is the default backend whenever the loaded model architecture is supported. TurboMind is implemented almost entirely in C++ and CUDA, with a thin Python wrapper that exposes a request scheduler and an OpenAI-compatible HTTP API.^[2] The Python layer's only job is to accept requests, marshal them into C++ data structures, and stream tokens back; all heavy work happens in the C++ runtime.

This design choice is the engine's main architectural distinction. vLLM and SGLang both run their schedulers, batchers, and tokenizers in Python, calling out to C++ or CUDA only for the underlying kernels. For long-running, high-concurrency workloads this Python orchestration adds measurable overhead, particularly on the fastest GPUs where each kernel launch is short relative to the time it takes the Python interpreter to prepare the next batch. TurboMind avoids this overhead entirely by keeping the scheduler and the batcher in C++, which is a major reason the engine often outpaces Python-orchestrated runtimes on H100 and H200 hardware.^[7]

TurboMind inherits several optimizations from its FasterTransformer ancestry, including fused multi-head attention kernels, fused MLP and LayerNorm operations, NCCL-based tensor parallelism, and an efficient implementation of grouped-query attention.^[14] The LMDeploy team has added persistent batching (their term for continuous batching), a blocked KV cache modeled on the paging idea popularized by PagedAttention, dynamic split and fuse scheduling that breaks large prefill requests into smaller chunks for better interleaving with decode requests, automatic prefix caching, FlashAttention and split-K decoding kernels, and a custom path for AWQ-quantized 4-bit weights that uses dedicated CUTLASS-based GEMMs.^[1]^[2]

The 2025 TurboMind paper also reports a concrete gain from aggressive KV-cache compression: with 4-bit KV cache quantization, the engine "delivers an average improvement of 18.3% over the 16-bit baseline, with maximum speedups of 57.9% observed in long sequence scenarios."^[17]

For mixture-of-experts models, TurboMind integrates with DeepGemm, the FP8 GEMM library released by DeepSeek, and with FlashMLA, the multi-head latent attention kernel DeepSeek developed for the V2 and V3 architectures.^[2] These additions allow TurboMind to serve the 685B-parameter DeepSeek-V3 model efficiently on multi-node H100 deployments. A practical limitation is that adding support for a new model architecture requires writing C++ code, which is slower than adding a Python class in vLLM or SGLang. This is the reason LMDeploy maintains a second engine.

What is the PyTorch engine?

The PyTorch engine, sometimes called pytorch_engine in the code, was added in 2024 to address the model coverage gap. It is a pure Python and PyTorch implementation of the LMDeploy inference stack, with no dependency on the TurboMind C++ runtime. New architectures can be added simply by writing a Python module that defines the model's layers using standard PyTorch primitives, with the engine's scheduler, batcher, and KV cache manager handling the rest.^[2]

The PyTorch engine is slower than TurboMind for the model families that both engines support, sometimes by 20 to 40 percent depending on hardware and workload, but it serves a critical purpose: it lets users run models not yet ported to TurboMind, and it gives community contributors a low-barrier path to adding new architectures. Many models are first added to the PyTorch engine and then ported to TurboMind once their popularity justifies the engineering effort.

The PyTorch engine supports the same OpenAI-compatible API, quantization workflow, persistent batching scheduler, and most of the same kernel-level optimizations (FlashAttention, prefix caching, paged KV cache) as TurboMind.^[2] It also adds support for hardware backends that TurboMind does not target directly, including AMD ROCm GPUs and Huawei Ascend NPUs, the latter via a graph compilation mode that was reported in late 2025 to roughly double throughput on Ascend hardware.

What are the key features of LMDeploy?

The table below summarizes the features that LMDeploy advertises on its README and in its documentation.^[1]^[2]

Feature	Description
TurboMind engine	C++ inference runtime forked from FasterTransformer; default for supported models
PyTorch engine	Pure Python and PyTorch runtime for broader model coverage
Persistent batching	Continuous batching of incoming requests at every decode step
Blocked KV cache	Paged key-value cache modeled on PagedAttention
Dynamic split and fuse	Chunked prefill that interleaves prefill and decode work
Automatic prefix caching	Reuse of KV cache across requests that share a prompt prefix
Tensor parallelism	Multi-GPU sharding via NCCL
Pipeline parallelism	Cross-node sharding for very large models
OpenAI-compatible API	Drop-in replacement for OpenAI chat and completions endpoints
Quantization toolkit	AWQ, SmoothQuant W8A8, KV cache INT4/INT8, MXFP4
Vision-language support	Native serving for forty-plus VLM architectures
Multi-model serving	Ability to host multiple models on the same server
Speculative decoding	Draft model and self-speculative paths (added 2025)
Distributed deployment	Multi-node, multi-card service mode with built-in router
Hardware support	NVIDIA CUDA (primary), AMD ROCm, Huawei Ascend NPU

How does LMDeploy handle quantization?

Quantization is one of LMDeploy's strongest areas, and the project was an early adopter of 4-bit weight-only quantization for production inference. The toolkit exposes several quantization workflows through its lmdeploy lite command-line interface, and the TurboMind engine includes specialized fused kernels that read the quantized weights directly without dequantizing them into higher precision first.^[12]

Method	Precision	Description
AWQ	W4A16	Activation-aware weight quantization, ported from the MIT AWQ paper; 4-bit weights, 16-bit activations
GPTQ	W4A16	Alternative 4-bit weight quantization scheme; TurboMind can also infer GPTQ-quantized checkpoints
SmoothQuant	W8A8	8-bit weights and 8-bit activations using the SmoothQuant smoothing transform
KV INT8	KV cache	Online per-head, per-token 8-bit quantization of the key-value cache
KV INT4	KV cache	Online 4-bit asymmetric quantization of the key-value cache
MXFP4	W4	Microscaling FP4 format for select NVIDIA GPUs that support it
FP8	W8A8	FP8 weights and activations for Hopper and Blackwell hardware

The AWQ pipeline is the most commonly used.^[12]^[13] A user runs lmdeploy lite auto_awq with a model path and a calibration dataset; the tool produces a quantized checkpoint that TurboMind can then load directly. The resulting 4-bit model uses roughly one quarter of the GPU memory of an FP16 baseline, and LMDeploy documents a 2.4x throughput improvement on AWQ-quantized weights compared with FP16 inference on the same hardware.^[1]

The SmoothQuant workflow targets INT8 deployments where activations also need quantization to use integer tensor cores.^[11] The KV cache quantization is applied online rather than offline: the engine quantizes the keys and values into INT8 or INT4 as it writes them into the cache, using a per-head and per-token asymmetric scheme that LMDeploy reports preserves accuracy more reliably than per-tensor schemes.^[10] As measured in the 2025 TurboMind paper, 4-bit KV cache quantization yields an average 18.3 percent throughput improvement over the 16-bit baseline and up to 57.9 percent in long-sequence scenarios.^[17] Later additions include MXFP4 support for Blackwell hardware, reported to give approximately 1.5x the throughput of vLLM on H800 GPUs for compatible models, and integration with the LLM Compressor project from the vLLM ecosystem.

Which models does LMDeploy support?

LMDeploy's model coverage spans dense decoder-only LLMs, vision-language models, and sparse mixture-of-experts models. TurboMind and the PyTorch engine have overlapping but not identical coverage; the PyTorch engine supports more architectures, while TurboMind generally provides higher throughput for the models it does support.^[2]

Model family	Sizes supported	Engine
InternLM, InternLM2, InternLM2.5, InternLM3	1.8B to 20B	TurboMind, PyTorch
InternVL, InternVL2, InternVL3	1B to 78B	TurboMind, PyTorch
Llama 2	7B to 70B	TurboMind, PyTorch
Llama 3, Llama 3.1, Llama 3.2, Llama 3.3	1B to 405B	TurboMind, PyTorch
Llama 4 (Scout, Maverick)	MoE	PyTorch
Qwen, Qwen1.5, Qwen2, Qwen2.5, Qwen3	0.5B to 110B	TurboMind, PyTorch
Qwen-VL, Qwen2-VL	2B to 72B	TurboMind, PyTorch
Mistral 7B, Mistral Nemo	7B to 12B	TurboMind, PyTorch
Mixtral	8x7B, 8x22B	TurboMind, PyTorch
DeepSeek, DeepSeek-V2, DeepSeek-V3	up to 685B (MoE)	TurboMind, PyTorch
DeepSeek-VL, DeepSeek-VL2	up to 27B	PyTorch
Gemma, Gemma2, Gemma3	2B to 27B	TurboMind, PyTorch
Baichuan, Baichuan2	7B, 13B	TurboMind
ChatGLM3, GLM-4, GLM-4.5	6B to 9B	TurboMind, PyTorch
Yi, Yi-1.5	6B to 34B	TurboMind
LLaVA, LLaVA-Next	7B to 34B	TurboMind, PyTorch
MiniCPM, MiniCPM-V	1B to 8B	PyTorch
Phi-3, Phi-3 Vision, Phi-4	3.8B to 14B	PyTorch
CogVLM, CogVLM2	17B to 19B	PyTorch
InternLM-XComposer, InternLM-XComposer2	7B	TurboMind

LMDeploy generally adds support for flagship Chinese open-source models on or near their release day, including new Qwen and DeepSeek releases, while support for new Western models lands in the following weeks. The project has averaged roughly one new supported architecture per week through 2025 and into 2026.

How fast is LMDeploy in benchmarks?

LMDeploy publishes its own benchmark numbers on the project README, and several third-party benchmarks have appeared in the past year. The headline number LMDeploy itself cites is that TurboMind delivers up to 1.8x the request throughput of vLLM under high-concurrency workloads, with a 2.4x improvement when using AWQ 4-bit quantization compared with FP16 inference.^[1]

Independent comparisons published in late 2025 and early 2026 paint a consistent picture, although the exact numbers depend heavily on the workload and the GPU.^[6]^[7]^[16] The table below summarizes throughput figures commonly cited for Llama 3.1 8B on a single H100, decode-heavy chat workload, measured in output tokens per second across many concurrent requests.

Engine	Throughput on H100 (tokens/s)	Notes
LMDeploy TurboMind	approximately 16,200	Default settings; C++ backend
SGLang	approximately 16,200	With RadixAttention prefix caching
TensorRT-LLM	approximately 15,500 to 16,000	Requires per-model engine build
vLLM with FlashInfer	approximately 12,500	Python-orchestrated; widely deployed
Hugging Face TGI	approximately 8,000	Lower-effort production stack

The close tie between LMDeploy TurboMind and SGLang at the top of the table reflects the fact that both engines reduce Python-side overhead aggressively, although by different routes: LMDeploy moves the scheduler into C++, while SGLang keeps a Python scheduler but spends substantial engineering effort on zero-overhead batch scheduling.^[7] vLLM trails the leaders by roughly 25 to 30 percent on this workload, a gap that vLLM contributors have been actively closing through FlashInfer, CUDA graphs, and partial C++ rewrites of the scheduler.^[7]

For lower-concurrency workloads with longer prompts and significant prefix sharing, SGLang's RadixAttention often pulls ahead. For inference of AWQ-quantized 4-bit models, LMDeploy TurboMind typically leads by a wider margin, because its 4-bit GEMM kernels are among the most heavily optimized in the open-source ecosystem.^[1] LMDeploy also reports approximately 2x the throughput of the eager-mode PyTorch baseline on Huawei Ascend NPUs running in graph mode, and parity with NVIDIA A100 on AMD MI300 GPUs for several mid-size dense models.

How does LMDeploy compare to vLLM, SGLang, and TensorRT-LLM?

The four major open-source serving frameworks for LLMs in 2026 are vLLM, SGLang, LMDeploy, and TensorRT-LLM. They overlap heavily in feature set, since techniques such as continuous batching, prefix caching, paged KV cache, and tensor parallelism are now standard. They differ in implementation choices, hardware coverage, and ecosystem positioning.^[7]^[8]^[9]

Aspect	LMDeploy	vLLM	SGLang	TensorRT-LLM
Primary developer	Shanghai AI Lab / InternLM	UC Berkeley / Inferact	UC Berkeley / LMSYS	NVIDIA
First public release	mid 2023	June 2023	January 2024	October 2023
License	Apache 2.0	Apache 2.0	Apache 2.0	Apache 2.0
Scheduler language	C++ (TurboMind) / Python (PyTorch engine)	Python	Python	C++
Default attention	Paged + FlashAttention	PagedAttention	RadixAttention	TensorRT kernels
Quantization breadth	Very broad: AWQ, GPTQ, SmoothQuant, KV INT8/4, MXFP4, FP8	Broad: AWQ, GPTQ, FP8, INT8, KV cache	Broad: AWQ, FP8, KV cache	Broad: AWQ, FP8, INT8
Hardware coverage	NVIDIA, AMD ROCm, Huawei Ascend	NVIDIA, AMD, Intel, TPU, Inferentia	NVIDIA, AMD, TPU (via SGLang-Jax)	NVIDIA only
Model coverage breadth	Approximately 50 LLMs, 40 VLMs	100+ architectures	Approximately 60 LLMs	NVIDIA-curated set
Notable strength	TurboMind throughput, AWQ kernels, InternLM/InternVL native support	Largest community, most production deployments, OpenAI compatibility	RadixAttention prefix sharing, structured outputs, DeepSeek leadership	Peak performance on NVIDIA-only deployments
Notable weakness	Smaller community than vLLM; adding TurboMind models requires C++	Slightly lower peak throughput than C++-scheduled engines	Smaller ecosystem than vLLM	Limited to NVIDIA hardware; per-model engine build step
Governance	Shanghai AI Lab open-source program	PyTorch Foundation (since May 2025)	PyTorch ecosystem project (since 2025)	NVIDIA-owned

In practice, choice often comes down to deployment context. Users running InternLM, InternVL, or other Chinese open-source models tend to pick LMDeploy for its native model support and Chinese-language documentation. Users on Western public clouds with heterogeneous fleets typically pick vLLM. Workloads with heavy prefix sharing or structured outputs, or DeepSeek deployments, tend to favor SGLang. Pure NVIDIA fleets at hyperscale willing to invest in per-model engine builds tend toward TensorRT-LLM.

How is LMDeploy deployed and what is its API surface?

LMDeploy is installed from PyPI with pip install lmdeploy.^[3]^[15] Default wheels in 2026 ship with CUDA 12.8 support, targeting NVIDIA Ampere, Hopper, and Blackwell GPUs. The package exposes three primary entry points: an interactive command-line chat tool, an offline batch inference API in Python, and an OpenAI-compatible HTTP server.^[15]

The HTTP server is the most common production deployment mode. The command lmdeploy serve api_server <model> starts a service on port 23333 that accepts both legacy completions and modern chat completions requests, with streaming responses, tool calls, and function-calling extensions.^[15] Vision-language models accept image inputs through the same API using the OpenAI multimodal message format.

For large models that do not fit on a single GPU, LMDeploy supports tensor parallelism within a node via the tp argument, and pipeline parallelism across nodes via a distributed runner. A built-in proxy server can load-balance requests across multiple LMDeploy worker instances and host multiple models simultaneously, exposing them through the OpenAI API's model selection field. The cache_max_entry_count parameter defaults to 0.8 and controls the fraction of free GPU memory after model loading reserved for the KV cache, a deliberate departure from vLLM's default of approximately 0.9 to leave more headroom against out-of-memory errors.^[15]

Is LMDeploy open source, and who governs it?

Yes. LMDeploy is open source under the Apache 2.0 license and is hosted on GitHub under the InternLM organization.^[1] The project has accumulated approximately 7,900 stars and 700 forks as of mid 2026, with a contributor base of more than 100 developers. Core maintainers are employees of Shanghai AI Laboratory, with significant external contributions from Chinese cloud providers including Alibaba Cloud, Huawei, and Baidu Smart Cloud.

The project does not have an independent foundation home as vLLM and SGLang do under the PyTorch Foundation umbrella. Governance remains with the InternLM organization, with releases driven by a maintainer team roughly every two to four weeks; version 0.13.0 was the current stable release as of mid 2026, built against CUDA 12.8 and adding features such as Qwen3.5 dense and MoE support and Anthropic-compatible serving endpoints.^[2]^[3] Documentation is maintained in both English and Chinese on Read the Docs.

What is the current state of LMDeploy in 2026?

LMDeploy in 2026 sits in the top tier of open-source LLM serving frameworks. Its TurboMind backend provides among the highest throughput numbers on NVIDIA Hopper hardware, its quantization toolkit is broader than most competitors, and its model coverage is unusually deep for the InternLM, InternVL, Qwen, and DeepSeek families that dominate Chinese open-source AI.^[6]^[7] The PyTorch backend has steadily closed the model-coverage gap with vLLM, although vLLM still leads on the long tail of less popular architectures.

Competitive pressure comes from two directions. TensorRT-LLM continues to lead on raw NVIDIA-only throughput when an organization is willing to do the per-model engine-build work. SGLang has overtaken LMDeploy in mindshare for prefix-caching-heavy workloads, and vLLM remains the default choice for new production deployments in Western enterprises because of its larger community.

Recent releases have added DeepSeek-V3 support with the DeepGemm and FlashMLA kernels on day one, brought MXFP4 quantization to Blackwell hardware (with the official announcement reporting that LMDeploy outperforms vLLM across all scenarios on H800 and A100 for OpenAI's GPT-OSS models), extended the PyTorch engine to AMD ROCm and Huawei Ascend, integrated with the LLM Compressor toolchain, and added speculative decoding paths in both engines.^[2] LMDeploy is a credible alternative to vLLM and SGLang on NVIDIA hardware, the most natural choice for serving InternLM and InternVL models anywhere, and an interesting option for Ascend NPU deployments where alternatives are much weaker.

ELI5: What does LMDeploy do?

A large language model is a giant program that predicts text, but running one quickly and cheaply for many users at once is hard. LMDeploy is a free toolkit that does two jobs. First, it shrinks the model (quantization), squeezing the numbers inside it down to 4 bits so the model uses about a quarter of the memory and runs faster.^[1] Second, it serves the model: it answers many people's questions at the same time, packing their requests together efficiently (continuous batching) and reusing earlier work (KV cache) so nobody waits long. Its fast engine, TurboMind, is written in C++ so the computer spends its time doing math instead of bookkeeping, and the team says this lets it answer up to 1.8 times more requests per second than the popular vLLM tool.^[1]

References

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

AWQ (Activation-aware Weight Quantization)BentoML InternVL NVIDIA TensorRT-LLM Text Generation Inference (TGI)

What is LMDeploy?

What is the origin of LMDeploy and the InternLM toolchain?

What is TurboMind?

What is the PyTorch engine?

What are the key features of LMDeploy?

How does LMDeploy handle quantization?

Which models does LMDeploy support?

How fast is LMDeploy in benchmarks?

How does LMDeploy compare to vLLM, SGLang, and TensorRT-LLM?

How is LMDeploy deployed and what is its API surface?

Is LMDeploy open source, and who governs it?

What is the current state of LMDeploy in 2026?

ELI5: What does LMDeploy do?

See also

References

Improve this article

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB

What links here

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB

What links here