Tokens per second (TPS) is a key performance metric for measuring the speed of large language model (LLM) inference. It quantifies how many tokens a model can generate or process in one second. Higher TPS values indicate faster generation, which translates to shorter wait times for users and lower costs for providers. TPS is one of the most commonly cited metrics when evaluating and comparing LLM serving infrastructure, and it plays a central role in benchmarks published by organizations such as Artificial Analysis, NVIDIA, and cloud service providers.
As LLMs have grown in capability and deployment scale, TPS has become a critical factor in hardware procurement decisions, model selection, and infrastructure design. The metric is closely related to other inference performance measures including time to first token (TTFT), inter-token latency (ITL), and throughput.
At its most basic, tokens per second measures the rate at which tokens are produced during inference. However, the precise definition varies depending on what phase of inference is being measured and whether the metric refers to a single request or the entire system.
Output TPS measures how many tokens a model generates per second for a single request, excluding the initial prompt processing time. This is the metric most relevant to the end-user experience during streaming, as it determines how quickly text appears on screen after generation begins.
Output TPS = (number of output tokens) / (time spent generating output tokens)
Total TPS accounts for the complete request lifecycle, including both prompt processing (prefill) and token generation (decode):
Total TPS = (number of output tokens) / (total request time)
Total TPS is always lower than output TPS because it includes the prefill phase, during which no output tokens are produced.
System throughput TPS measures the total number of tokens generated per second across all concurrent requests being served by the system:
System throughput TPS = (total output tokens across all requests) / (time period)
This metric is most relevant for infrastructure planning and cost analysis, as it determines how many requests a deployment can handle simultaneously.
| TPS metric | Scope | Includes prefill? | Most relevant for |
|---|---|---|---|
| Output TPS | Single request | No | User experience during streaming |
| Total TPS | Single request | Yes | End-to-end latency assessment |
| System throughput TPS | All concurrent requests | Varies | Infrastructure planning, cost analysis |
Time to first token (TTFT) measures the time between sending a request and receiving the first output token. It is distinct from TPS but closely related, as both contribute to the overall user experience.
TTFT = time when first token is received - time when request is sent
TTFT includes several components:
| Component | Description |
|---|---|
| Network latency | Time for the request to travel from client to server and the first token to travel back |
| Queue wait time | Time the request spends waiting in the serving queue before processing begins |
| Prefill time | Time to process the input prompt through the model and populate the KV cache |
TTFT is particularly important for interactive applications like chatbots and AI coding assistants, where users expect near-instant acknowledgment that the model has started generating. A TTFT of under 500 milliseconds is generally considered acceptable for real-time conversational applications, while batch processing pipelines are less sensitive to TTFT.
TTFT and output TPS measure different phases of inference and are affected by different factors:
| Metric | Phase | Affected by prompt length? | Affected by output length? | User impact |
|---|---|---|---|---|
| TTFT | Prefill | Strongly: longer prompts increase prefill time | No | How long the user waits before seeing any response |
| Output TPS | Decode | Weakly (via KV cache size) | Weakly | How fast text appears during streaming |
For reasoning models that produce internal chain-of-thought tokens before the visible response, TTFT may include the time to generate reasoning tokens, which can be substantial (seconds to minutes for complex queries).
Many factors influence tokens per second, spanning hardware, model architecture, software optimization, and serving configuration.
Model size (measured in parameters) is the most fundamental determinant of TPS. Larger models require more computation per token and more memory bandwidth to load weights, directly reducing generation speed.
| Model size | Typical output TPS (single A100 GPU, FP16) | Notes |
|---|---|---|
| 1-3B parameters | 100-200+ | Small models run very fast |
| 7-8B parameters | 50-100 | The most popular size for local deployment |
| 13-14B parameters | 30-60 | Good quality-speed balance |
| 30-34B parameters | 15-30 | Requires more VRAM or quantization |
| 65-70B parameters | 8-20 | Usually requires multi-GPU or quantization |
| 180B+ parameters | 3-10 | Requires multi-node deployment |
These are rough estimates and vary significantly depending on hardware, software stack, quantization, and batch size.
Quantization reduces the precision of model weights from higher-precision formats (FP32, FP16, BF16) to lower-precision formats (INT8, INT4, FP8). This reduces memory usage and increases TPS, often with minimal impact on output quality.
| Quantization format | Bits per weight | Memory reduction vs. FP16 | Typical speedup | Quality impact |
|---|---|---|---|---|
| FP16 / BF16 | 16 | Baseline | Baseline | None |
| FP8 | 8 | ~2x | ~1.5-2x | Minimal |
| INT8 (W8A8) | 8 | ~2x | ~1.5-2x | Minimal |
| INT4 (GPTQ, AWQ) | 4 | ~4x | ~2-3x | Small; model-dependent |
| 2-3 bit | 2-3 | ~5-8x | ~3-4x | Noticeable degradation |
Quantization improves TPS primarily by reducing memory bandwidth requirements. During the decode phase, LLM inference is typically memory-bandwidth bound: the bottleneck is loading model weights from GPU memory, not performing computations. Smaller weights mean less data to transfer, which translates directly to higher TPS.
Notable quantization methods include GPTQ (post-training quantization using Hessian-based optimization), AWQ (activation-aware weight quantization), and GGUF (the format used by llama.cpp for CPU and mixed CPU-GPU inference).
The choice of hardware profoundly affects TPS. The two most important hardware characteristics for LLM inference are memory bandwidth and compute throughput.
| GPU | Memory | Memory bandwidth | FP16 TFLOPS | Relative LLM inference speed |
|---|---|---|---|---|
| NVIDIA A100 80GB | 80 GB HBM2e | 2.0 TB/s | 312 | 1x (baseline) |
| NVIDIA H100 SXM | 80 GB HBM3 | 3.35 TB/s | 989 (with sparsity) | ~2-3x |
| NVIDIA H200 | 141 GB HBM3e | 4.8 TB/s | 989 | ~3-4x |
| NVIDIA B200 | 192 GB HBM3e | 8.0 TB/s | 2,250 (with sparsity) | ~5-8x (projected) |
| Apple M2 Ultra | 192 GB unified | 0.8 TB/s | ~27 | ~0.3-0.5x |
| AMD MI300X | 192 GB HBM3 | 5.3 TB/s | 1,307 | ~2.5-3.5x |
For the memory-bandwidth-bound decode phase, memory bandwidth is the primary determinant of TPS. The NVIDIA H100 achieves roughly 2.8 times the inference throughput of the A100, closely tracking its ~1.7x advantage in memory bandwidth combined with architectural improvements in the Transformer Engine and FP8 support.
Batch size (the number of requests processed simultaneously) significantly affects both per-request TPS and system throughput TPS, but in opposite directions.
| Batch size | Per-request output TPS | System throughput TPS | Hardware utilization |
|---|---|---|---|
| 1 | Highest | Lowest | Low (memory-bandwidth bound) |
| 8-16 | Moderate | Moderate | Moderate |
| 32-64 | Lower | Higher | High |
| 128+ | Lowest per-request | Highest total | Maximum (compute bound) |
At small batch sizes, inference is memory-bandwidth bound, and increasing the batch size improves total throughput without proportionally increasing latency. At large batch sizes, inference becomes compute bound, and further increases in batch size begin to slow down individual requests.
Continuous batching (also called iteration-level batching), used by serving frameworks like vLLM and TensorRT-LLM, dynamically adds new requests to the batch as existing requests complete. This maintains high throughput without the latency penalty of waiting for a full batch to accumulate.
The inference software stack has a large impact on TPS. Key optimizations include:
| Optimization | Description | TPS impact |
|---|---|---|
| KV cache management | Reuses computed attention key-value pairs across generation steps | Essential; without it, generation would be impractically slow |
| Flash Attention | Memory-efficient attention algorithm that reduces memory reads/writes | 1.5-3x speedup for long sequences |
| PagedAttention (vLLM) | Manages KV cache like virtual memory pages, reducing fragmentation | Up to 2-4x throughput improvement |
| Tensor parallelism | Splits model layers across multiple GPUs | Enables larger models; near-linear scaling with good interconnect |
| Speculative decoding | Uses a small draft model to predict multiple tokens, verified in parallel | 2-3x speedup without quality loss |
| Kernel fusion | Combines multiple operations into single GPU kernels | 10-30% improvement |
The length of the input prompt (context) affects both TTFT and, to a lesser extent, output TPS. Longer prompts require more time for the prefill phase and produce larger KV caches, which increase memory pressure during the decode phase.
| Context length | Effect on TTFT | Effect on output TPS |
|---|---|---|
| Short (< 1K tokens) | Minimal | Negligible |
| Medium (1K-8K tokens) | Moderate increase | Small decrease |
| Long (8K-32K tokens) | Significant increase | Moderate decrease (due to KV cache size) |
| Very long (32K-128K+ tokens) | Large increase (seconds) | Noticeable decrease |
Accurate TPS benchmarking requires careful methodology to produce meaningful, reproducible results.
Standardized tokenization. Different models use different tokenizers, so the same text may produce different token counts depending on the model. Artificial Analysis standardizes all measurements to OpenAI GPT-4 tokens (o200k_base tokenizer) to allow fair cross-model comparisons. Other benchmarks may use each model's native tokenizer, which can make comparisons misleading.
Warm-up runs. Initial requests may be slower due to model loading, kernel compilation (for JIT-compiled frameworks), and cache population. Benchmarks should include warm-up requests that are excluded from measurements.
Statistical reporting. TPS can vary between requests due to server load, network conditions, and other factors. Robust benchmarks report statistical aggregates. Artificial Analysis reports the median (P50) over the past 72 hours to reflect sustained performance. Other benchmarks may report P50, P90, P95, or P99 latencies.
Fixed prompt and output lengths. For comparable results, benchmarks should use standardized prompt lengths and target output lengths. Artificial Analysis uses several standardized workloads including short prompts and 100-token outputs.
Concurrent requests. Benchmarks should specify the level of concurrency. Single-request benchmarks measure per-user speed; high-concurrency benchmarks measure system throughput.
Artificial Analysis is the most widely referenced independent benchmarking platform for LLM API performance. Founded in 2023, it continuously benchmarks over 300 models and 500 API endpoints across providers including OpenAI, Anthropic, Google, Mistral, and numerous open-model hosting providers.
Key aspects of their methodology:
As of early 2026, some of the fastest models on the Artificial Analysis leaderboard include Mercury Coder (835+ tokens/second), NVIDIA Nemotron models (400+ tokens/second), and various optimized 7-8B parameter models (300+ tokens/second).
NVIDIA provides benchmarking tools through its NIM (NVIDIA Inference Microservices) platform and TensorRT-LLM library. Their AIPerf benchmark measures throughput, TTFT, and inter-token latency under controlled conditions. NVIDIA's benchmarks are particularly useful for evaluating the impact of hardware choices and optimization settings.
| Framework | Provider | Focus |
|---|---|---|
| LLMPerf | Anyscale | API endpoint benchmarking with configurable concurrency |
| vLLM benchmarks | vLLM project | Serving throughput with PagedAttention |
| llama.cpp benchmarks | llama.cpp community | Local inference on consumer hardware |
| MLPerf Inference | MLCommons | Standardized ML inference benchmarking across hardware |
TPS requirements and achievable values differ dramatically across deployment scenarios.
| Scenario | Typical TPS requirement | Typical hardware | Key optimization priority |
|---|---|---|---|
| Cloud API (consumer chatbot) | 30-80 output TPS per user | H100/H200 clusters | Low TTFT, consistent speed |
| Cloud API (batch processing) | Maximize system throughput | H100/H200 clusters | High concurrent throughput, cost per token |
| On-premise enterprise | 20-60 output TPS per user | A100/H100 | Balance of speed and cost |
| Edge/mobile deployment | 5-20 output TPS | Apple Silicon, mobile NPUs | Low memory usage, energy efficiency |
| Local hobbyist | 10-40 output TPS | Consumer GPUs (RTX 4090), Apple M-series | Quantized models, llama.cpp/Ollama |
| Research/experimentation | Variable | Mixed | Flexibility and reproducibility |
The relationship between TPS and user experience follows a pattern of diminishing returns. Research and user studies suggest the following thresholds:
| TPS range | User experience |
|---|---|
| < 5 TPS | Noticeably slow; users report frustration |
| 5-15 TPS | Acceptable for non-interactive use |
| 15-30 TPS | Comfortable reading speed; most users find this satisfactory for chat |
| 30-60 TPS | Fast; exceeds typical reading speed for most users |
| 60+ TPS | Diminishing perceptual benefit; text appears nearly instantaneous |
The average adult reads at approximately 200-300 words per minute, which corresponds to roughly 5-8 words per second or about 7-10 tokens per second (given average tokenization ratios). Output speeds of 30+ TPS therefore significantly exceed human reading speed, and further improvements primarily benefit batch processing, programmatic use cases, and TTFT-sensitive applications rather than direct user perception.
TPS is directly related to the cost of serving LLM requests. Higher TPS means each GPU-second produces more tokens, reducing the cost per token.
Cost per million tokens (approx.) = (GPU cost per second * 1,000,000) / (system throughput TPS)
For example, if an H100 server costs $3 per hour ($0.000833 per second) and achieves a system throughput of 1,000 TPS, the cost per million output tokens is approximately $0.83. Improving throughput to 2,000 TPS would halve this cost to approximately $0.42 per million tokens.
This economic relationship drives the intense industry focus on inference optimization, as even small improvements in TPS translate to significant cost savings at scale.
TPS for frontier models has improved steadily due to advances in hardware, software optimization, and model architecture.
| Year | Milestone |
|---|---|
| 2020 | GPT-3 (175B) served at ~10-20 TPS per request on A100 clusters |
| 2022 | Introduction of Flash Attention significantly improves throughput for long contexts |
| 2023 | vLLM introduces PagedAttention, improving serving throughput by 2-4x; speculative decoding gains traction |
| 2024 | FP8 quantization on H100 becomes standard; specialized inference chips (Groq LPU, Cerebras) achieve 300-500+ TPS |
| 2025 | NVIDIA Blackwell (B200) begins deployment; Groq and competitors push toward 1,000+ TPS for medium-sized models |
| 2026 | Mercury Coder achieves 835+ TPS on Artificial Analysis leaderboard; inference optimization continues to accelerate |
| Misconception | Reality |
|---|---|
| Higher TPS always means better model | TPS measures speed, not quality. A small, fast model may generate worse text than a larger, slower one. |
| TPS is constant throughout generation | TPS can vary during generation due to KV cache growth, batch scheduling, and attention computation scaling with sequence length. |
| Comparing TPS across models is straightforward | Different tokenizers mean the same text produces different token counts. Cross-model comparisons require standardized tokenization. |
| Doubling GPU count doubles TPS | Tensor parallelism introduces communication overhead. Scaling is sub-linear, especially beyond 4-8 GPUs. |
| Quantization always reduces quality significantly | Modern quantization methods (GPTQ, AWQ) at 4-bit precision often produce negligible quality degradation for most tasks. |