Tokens per second

AI Benchmarks AI Hardware Large Language Models

15 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v4 · 2,913 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Tokens per second (TPS) is a key performance metric for measuring the speed of large language model (LLM) inference. It quantifies how many tokens a model can generate or process in one second. Higher TPS values indicate faster generation, which translates to shorter wait times for users and lower costs for providers. TPS is one of the most commonly cited metrics when evaluating and comparing LLM serving infrastructure, and it plays a central role in benchmarks published by organizations such as Artificial Analysis, NVIDIA, and cloud service providers.

As LLMs have grown in capability and deployment scale, TPS has become a critical factor in hardware procurement decisions, model selection, and infrastructure design. The metric is closely related to other inference performance measures including time to first token (TTFT), inter-token latency (ITL), and throughput.

Definition and measurement

At its most basic, tokens per second measures the rate at which tokens are produced during inference. However, the precise definition varies depending on what phase of inference is being measured and whether the metric refers to a single request or the entire system.

Output TPS (per-request)

Output TPS measures how many tokens a model generates per second for a single request, excluding the initial prompt processing time. This is the metric most relevant to the end-user experience during streaming, as it determines how quickly text appears on screen after generation begins.^[1]^[9]

Output TPS = (number of output tokens) / (time spent generating output tokens)

Total TPS (per-request)

Total TPS accounts for the complete request lifecycle, including both prompt processing (prefill) and token generation (decode):

Total TPS = (number of output tokens) / (total request time)

Total TPS is always lower than output TPS because it includes the prefill phase, during which no output tokens are produced.^[1]

System throughput TPS

System throughput TPS measures the total number of tokens generated per second across all concurrent requests being served by the system:

System throughput TPS = (total output tokens across all requests) / (time period)

This metric is most relevant for infrastructure planning and cost analysis, as it determines how many requests a deployment can handle simultaneously.^[1]

TPS metric	Scope	Includes prefill?	Most relevant for
Output TPS	Single request	No	User experience during streaming
Total TPS	Single request	Yes	End-to-end latency assessment
System throughput TPS	All concurrent requests	Varies	Infrastructure planning, cost analysis

Time to first token (TTFT)

Time to first token (TTFT) measures the time between sending a request and receiving the first output token.^[1]^[10] It is distinct from TPS but closely related, as both contribute to the overall user experience.

TTFT = time when first token is received - time when request is sent

TTFT includes several components:

Component	Description
Network latency	Time for the request to travel from client to server and the first token to travel back
Queue wait time	Time the request spends waiting in the serving queue before processing begins
Prefill time	Time to process the input prompt through the model and populate the KV cache

TTFT is particularly important for interactive applications like chatbots and AI coding assistants, where users expect near-instant acknowledgment that the model has started generating. A TTFT of under 500 milliseconds is generally considered acceptable for real-time conversational applications, while batch processing pipelines are less sensitive to TTFT.

TTFT vs. output speed

TTFT and output TPS measure different phases of inference and are affected by different factors:

Metric	Phase	Affected by prompt length?	Affected by output length?	User impact
TTFT	Prefill	Strongly: longer prompts increase prefill time	No	How long the user waits before seeing any response
Output TPS	Decode	Weakly (via KV cache size)	Weakly	How fast text appears during streaming

For reasoning models that produce internal chain-of-thought tokens before the visible response, TTFT may include the time to generate reasoning tokens, which can be substantial (seconds to minutes for complex queries).

Factors affecting TPS

Many factors influence tokens per second, spanning hardware, model architecture, software optimization, and serving configuration.

Model size

Model size (measured in parameters) is the most fundamental determinant of TPS. Larger models require more computation per token and more memory bandwidth to load weights, directly reducing generation speed.

Model size	Typical output TPS (single A100 GPU, FP16)	Notes
1-3B parameters	100-200+	Small models run very fast
7-8B parameters	50-100	The most popular size for local deployment
13-14B parameters	30-60	Good quality-speed balance
30-34B parameters	15-30	Requires more VRAM or quantization
65-70B parameters	8-20	Usually requires multi-GPU or quantization
180B+ parameters	3-10	Requires multi-node deployment

These are rough estimates and vary significantly depending on hardware, software stack, quantization, and batch size.

Quantization

Quantization reduces the precision of model weights from higher-precision formats (FP32, FP16, BF16) to lower-precision formats (INT8, INT4, FP8). This reduces memory usage and increases TPS, often with minimal impact on output quality.

Quantization format	Bits per weight	Memory reduction vs. FP16	Typical speedup	Quality impact
FP16 / BF16	16	Baseline	Baseline	None
FP8	8	~2x	~1.5-2x	Minimal
INT8 (W8A8)	8	~2x	~1.5-2x	Minimal
INT4 (GPTQ, AWQ)	4	~4x	~2-3x	Small; model-dependent
2-3 bit	2-3	~5-8x	~3-4x	Noticeable degradation

Quantization improves TPS primarily by reducing memory bandwidth requirements. During the decode phase, LLM inference is typically memory-bandwidth bound: the bottleneck is loading model weights from GPU memory, not performing computations.^[1] Smaller weights mean less data to transfer, which translates directly to higher TPS.

Notable quantization methods include GPTQ (post-training quantization using Hessian-based optimization), AWQ (activation-aware weight quantization), and GGUF (the format used by llama.cpp for CPU and mixed CPU-GPU inference).^[7]^[8]

Hardware

The choice of hardware profoundly affects TPS. The two most important hardware characteristics for LLM inference are memory bandwidth and compute throughput.

GPU	Memory	Memory bandwidth	FP16 TFLOPS	Relative LLM inference speed
NVIDIA A100 80GB	80 GB HBM2e	2.0 TB/s	312	1x (baseline)
NVIDIA H100 SXM	80 GB HBM3	3.35 TB/s	989 (with sparsity)	~2-3x
NVIDIA H200	141 GB HBM3e	4.8 TB/s	989	~3-4x
NVIDIA B200	192 GB HBM3e	8.0 TB/s	2,250 (with sparsity)	~5-8x (projected)
Apple M2 Ultra	192 GB unified	0.8 TB/s	~27	~0.3-0.5x
AMD MI300X	192 GB HBM3	5.3 TB/s	1,307	~2.5-3.5x

For the memory-bandwidth-bound decode phase, memory bandwidth is the primary determinant of TPS. The NVIDIA H100 achieves roughly 2.8 times the inference throughput of the A100, closely tracking its ~1.7x advantage in memory bandwidth combined with architectural improvements in the Transformer Engine and FP8 support.^[1]

Batch size

Batch size (the number of requests processed simultaneously) significantly affects both per-request TPS and system throughput TPS, but in opposite directions.

Batch size	Per-request output TPS	System throughput TPS	Hardware utilization
1	Highest	Lowest	Low (memory-bandwidth bound)
8-16	Moderate	Moderate	Moderate
32-64	Lower	Higher	High
128+	Lowest per-request	Highest total	Maximum (compute bound)

At small batch sizes, inference is memory-bandwidth bound, and increasing the batch size improves total throughput without proportionally increasing latency. At large batch sizes, inference becomes compute bound, and further increases in batch size begin to slow down individual requests.

Continuous batching (also called iteration-level batching), used by serving frameworks like vLLM and TensorRT-LLM, dynamically adds new requests to the batch as existing requests complete. This maintains high throughput without the latency penalty of waiting for a full batch to accumulate.

Software optimization

The inference software stack has a large impact on TPS. Key optimizations include:

Optimization	Description	TPS impact
KV cache management	Reuses computed attention key-value pairs across generation steps	Essential; without it, generation would be impractically slow
Flash Attention	Memory-efficient attention algorithm that reduces memory reads/writes	1.5-3x speedup for long sequences^[5]
PagedAttention (vLLM)	Manages KV cache like virtual memory pages, reducing fragmentation	Up to 2-4x throughput improvement^[4]
Tensor parallelism	Splits model layers across multiple GPUs	Enables larger models; near-linear scaling with good interconnect
Speculative decoding	Uses a small draft model to predict multiple tokens, verified in parallel	2-3x speedup without quality loss^[6]
Kernel fusion	Combines multiple operations into single GPU kernels	10-30% improvement

Context length

The length of the input prompt (context) affects both TTFT and, to a lesser extent, output TPS. Longer prompts require more time for the prefill phase and produce larger KV caches, which increase memory pressure during the decode phase.

Context length	Effect on TTFT	Effect on output TPS
Short (< 1K tokens)	Minimal	Negligible
Medium (1K-8K tokens)	Moderate increase	Small decrease
Long (8K-32K tokens)	Significant increase	Moderate decrease (due to KV cache size)
Very long (32K-128K+ tokens)	Large increase (seconds)	Noticeable decrease

Benchmarking methodology

Accurate TPS benchmarking requires careful methodology to produce meaningful, reproducible results.

Key considerations

Standardized tokenization. Different models use different tokenizers, so the same text may produce different token counts depending on the model. Artificial Analysis standardizes all measurements to OpenAI GPT-4 tokens (o200k_base tokenizer) to allow fair cross-model comparisons.^[2] Other benchmarks may use each model's native tokenizer, which can make comparisons misleading.
Warm-up runs. Initial requests may be slower due to model loading, kernel compilation (for JIT-compiled frameworks), and cache population. Benchmarks should include warm-up requests that are excluded from measurements.
Statistical reporting. TPS can vary between requests due to server load, network conditions, and other factors. Robust benchmarks report statistical aggregates. Artificial Analysis reports the median (P50) over the past 72 hours to reflect sustained performance.^[3] Other benchmarks may report P50, P90, P95, or P99 latencies.
Fixed prompt and output lengths. For comparable results, benchmarks should use standardized prompt lengths and target output lengths. Artificial Analysis uses several standardized workloads including short prompts and 100-token outputs.^[3]
Concurrent requests. Benchmarks should specify the level of concurrency. Single-request benchmarks measure per-user speed; high-concurrency benchmarks measure system throughput.

Artificial Analysis

Artificial Analysis is the most widely referenced independent benchmarking platform for LLM API performance. Founded in 2023, it continuously benchmarks over 300 models and 500 API endpoints across providers including OpenAI, Anthropic, Google, Mistral, and numerous open-model hosting providers.^[2]

Key aspects of their methodology:

All token measurements use OpenAI's o200k_base tokenizer for cross-model fairness.^[2]
Performance metrics are reported as median (P50) values over rolling 72-hour windows.^[3]
Testing servers are located in Google Cloud's us-central1-a zone.^[3]
For reasoning models, TTFT measures the time to the first reasoning token.^[3]
They test multiple prompt lengths including a 100K-token workload tested weekly.^[3]

As of early 2026, some of the fastest models on the Artificial Analysis leaderboard include Mercury Coder (835+ tokens/second), NVIDIA Nemotron models (400+ tokens/second), and various optimized 7-8B parameter models (300+ tokens/second).

NVIDIA benchmarking tools

NVIDIA provides benchmarking tools through its NIM (NVIDIA Inference Microservices) platform and TensorRT-LLM library. Their AIPerf benchmark measures throughput, TTFT, and inter-token latency under controlled conditions.^[1] NVIDIA's benchmarks are particularly useful for evaluating the impact of hardware choices and optimization settings.

Other benchmarking frameworks

Framework	Provider	Focus
LLMPerf	Anyscale	API endpoint benchmarking with configurable concurrency
vLLM benchmarks	vLLM project	Serving throughput with PagedAttention
llama.cpp benchmarks	llama.cpp community	Local inference on consumer hardware
MLPerf Inference	MLCommons	Standardized ML inference benchmarking across hardware

TPS across deployment scenarios

TPS requirements and achievable values differ dramatically across deployment scenarios.

Scenario	Typical TPS requirement	Typical hardware	Key optimization priority
Cloud API (consumer chatbot)	30-80 output TPS per user	H100/H200 clusters	Low TTFT, consistent speed
Cloud API (batch processing)	Maximize system throughput	H100/H200 clusters	High concurrent throughput, cost per token
On-premise enterprise	20-60 output TPS per user	A100/H100	Balance of speed and cost
Edge/mobile deployment	5-20 output TPS	Apple Silicon, mobile NPUs	Low memory usage, energy efficiency
Local hobbyist	10-40 output TPS	Consumer GPUs (RTX 4090), Apple M-series	Quantized models, llama.cpp/Ollama
Research/experimentation	Variable	Mixed	Flexibility and reproducibility

TPS and user experience

The relationship between TPS and user experience follows a pattern of diminishing returns. Research and user studies suggest the following thresholds:

TPS range	User experience
< 5 TPS	Noticeably slow; users report frustration
5-15 TPS	Acceptable for non-interactive use
15-30 TPS	Comfortable reading speed; most users find this satisfactory for chat
30-60 TPS	Fast; exceeds typical reading speed for most users
60+ TPS	Diminishing perceptual benefit; text appears nearly instantaneous

The average adult reads at approximately 200-300 words per minute, which corresponds to roughly 5-8 words per second or about 7-10 tokens per second (given average tokenization ratios). Output speeds of 30+ TPS therefore significantly exceed human reading speed, and further improvements primarily benefit batch processing, programmatic use cases, and TTFT-sensitive applications rather than direct user perception.

TPS and cost

TPS is directly related to the cost of serving LLM requests. Higher TPS means each GPU-second produces more tokens, reducing the cost per token.

Cost per million tokens (approx.) = (GPU cost per second * 1,000,000) / (system throughput TPS)

For example, if an H100 server costs $3 per hour ($0.000833 per second) and achieves a system throughput of 1,000 TPS, the cost per million output tokens is approximately $0.83. Improving throughput to 2,000 TPS would halve this cost to approximately $0.42 per million tokens.

This economic relationship drives the intense industry focus on inference optimization, as even small improvements in TPS translate to significant cost savings at scale.

Historical trends

TPS for frontier models has improved steadily due to advances in hardware, software optimization, and model architecture.

Year	Milestone
2020	GPT-3 (175B) served at ~10-20 TPS per request on A100 clusters
2022	Introduction of Flash Attention significantly improves throughput for long contexts^[5]
2023	vLLM introduces PagedAttention, improving serving throughput by 2-4x; speculative decoding gains traction^[4]^[6]
2024	FP8 quantization on H100 becomes standard; specialized inference chips (Groq LPU, Cerebras) achieve 300-500+ TPS
2025	NVIDIA Blackwell (B200) begins deployment; Groq and competitors push toward 1,000+ TPS for medium-sized models
2026	Mercury Coder achieves 835+ TPS on Artificial Analysis leaderboard; inference optimization continues to accelerate

Common misconceptions

Misconception	Reality
Higher TPS always means better model	TPS measures speed, not quality. A small, fast model may generate worse text than a larger, slower one.
TPS is constant throughout generation	TPS can vary during generation due to KV cache growth, batch scheduling, and attention computation scaling with sequence length.
Comparing TPS across models is straightforward	Different tokenizers mean the same text produces different token counts. Cross-model comparisons require standardized tokenization.
Doubling GPU count doubles TPS	Tensor parallelism introduces communication overhead. Scaling is sub-linear, especially beyond 4-8 GPUs.
Quantization always reduces quality significantly	Modern quantization methods (GPTQ, AWQ) at 4-bit precision often produce negligible quality degradation for most tasks.^[7]^[8]

References

NVIDIA. "LLM Inference Benchmarking: Fundamental Concepts." NVIDIA Technical Blog, 2025. https://developer.nvidia.com/blog/llm-benchmarking-fundamental-concepts/ ↩
Artificial Analysis. "Language Model Benchmarking Methodology." https://artificialanalysis.ai/methodology ↩
Artificial Analysis. "Language Model API Performance Benchmarking." https://artificialanalysis.ai/methodology/performance-benchmarking ↩
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., & Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. ↩
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. ↩
Leviathan, Y., Kalman, M., & Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." ICML 2023. ↩
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023. ↩
Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2024). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024. ↩
BentoML. "Key metrics for LLM inference." LLM Inference Handbook. https://bentoml.com/llm/inference-optimization/llm-inference-metrics ↩
Anyscale. "Understand LLM latency and throughput metrics." Anyscale Documentation. https://docs.anyscale.com/llm/serving/benchmarking/metrics ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

EAGLE-2