Performance
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,298 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 2,298 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Performance in machine learning is an overloaded word. It refers to two related but distinct ideas. The first is the quality of a model's predictions: how accurate or useful its outputs are for a given task. The second is computational performance: how fast and cost-efficient the model is to train and serve, measured in latency, throughput, memory bandwidth, and floating point operations per second.
Both senses matter in practice. A model that scores well on benchmarks is useless if it takes ten seconds per token to respond, and a fast model that returns wrong answers is just wrong faster. Engineers usually have to trade one against the other, picking a smaller model, a lower numeric precision, or a different batch size to land on a better point on the quality versus speed curve. This article covers both meanings, starting with prediction quality and then moving to the systems side, including MLPerf and modern accelerator metrics.
When a paper or a model card reports "performance," it usually means how well the model does on a held-out test set. Different tasks call for different metrics, and choosing the wrong metric is an easy way to ship a model that looks good on paper but behaves badly in production.
In a classification task a model assigns each input to one of several predefined classes.
In regression, the model predicts a continuous value rather than a class label.
For large language models, traditional accuracy is not enough. Researchers evaluate these models on standardized benchmarks:
Reporting numbers on the data you trained on is one of the oldest ways to lie with statistics. Real measurement requires a separate dataset.
Good practice is to keep a true holdout that is touched only once, after model selection and hyperparameter tuning are done. Reusing the test set inflates reported numbers.
Several factors shape how well a model predicts:
The other sense of performance is the systems sense: how fast does the model run, on what hardware, at what cost. This has become the dominant sense in the LLM era, where serving a chatbot can cost more than training it.
Floating point operations per second (FLOPS) measures raw arithmetic throughput. A teraFLOP (TFLOPS) is one trillion ops per second; a petaFLOP (PFLOPS) is one thousand TFLOPS. Accelerators are rated separately for each numeric precision (FP64, FP32, BF16, FP16, FP8, FP4).
The NVIDIA H100 (Hopper architecture, 2022) delivers up to 1,979 TFLOPS of dense FP16 tensor throughput, doubling to about 3,958 TFLOPS with structured sparsity, and supports FP8 for inference. The B200 (Blackwell architecture, 2025) goes significantly higher and adds FP6 and FP4 precisions, reporting up to 9 PFLOPS of dense FP4 tensor throughput. The B200 also doubles memory capacity to 180 GB of HBM3e and pushes aggregate memory bandwidth to roughly 7.7 TB/s, about 2.3 times the H100.
Raw FLOPS is a useful headline number but rarely tells the whole story, because real workloads are often limited by memory rather than arithmetic.
Arithmetic intensity is the ratio of arithmetic operations to bytes moved from memory. Kernels with high arithmetic intensity are compute-bound and benefit from more FLOPS. Kernels with low arithmetic intensity are memory-bound and benefit from higher memory bandwidth, not more compute.
LLM inference splits cleanly into two regimes. The prefill phase processes the input prompt in parallel and tends to be compute-bound. The decode phase generates one token at a time and is heavily memory-bound, because every step requires loading the model weights and KV cache from memory. At batch size 1, decode arithmetic intensity is on the order of 1 to 2 FLOPs per byte, so memory bandwidth caps performance long before the compute units saturate. Larger batches raise arithmetic intensity (the same weights serve more requests per load) and let the GPU approach its theoretical peak. This is why "memory wall" is a recurring phrase in modern AI hardware discussions.
For real-time applications the key numbers are about how long a user waits:
When the goal is to serve many users rather than minimize one user's latency, throughput matters more:
There is a fundamental tradeoff between latency and throughput. Larger batches amortize fixed costs over more tokens and raise throughput, but they force individual requests to wait for the batch to fill, raising TTFT. Continuous batching, speculative decoding, and paged attention are common techniques for moving the frontier outward.
Training is run with large batches, mixed precision (BF16 or FP8), and many GPUs working together for days or weeks. The bottleneck is usually a mix of compute, memory, and inter-GPU communication. Common metrics include time to a target loss, total FLOPs consumed, and tokens per second per GPU.
Inference often runs at much smaller batches, sometimes a single user at a time, under hard latency constraints. The bottleneck shifts toward memory bandwidth and KV cache size, especially during decode. Inference is also where lower precisions like INT8, FP8, and FP4 pay off; the accuracy hit is small but throughput roughly doubles with each step down on Blackwell-class hardware.
MLPerf is the de facto industry benchmark for AI hardware and software systems. It was started in 2018 by a coalition of academic and industry researchers and is now maintained by MLCommons, a non-profit consortium with over 100 members that launched in December 2020. MLPerf has separate suites for training and inference, and for different deployment scenarios (datacenter, edge, mobile, tiny). Results are submitted under controlled rules and reviewed by other participants before release.
MLPerf Inference v5.1, released in September 2025, set a participation record with 27 organizations submitting systems. It introduced three new benchmarks: DeepSeek-R1 (the first reasoning model in the suite), Llama 3.1 8B for summarization, and Whisper Large V3 for speech recognition. Top systems improved by as much as 50% over v5.0 six months earlier. Llama 2 70B remained the most popular benchmark with 24 submitters, and new accelerators included the AMD Instinct MI355X and NVIDIA's GB300 and RTX Pro 6000 Blackwell.
MLPerf Training v5.1, released in November 2025, included 65 unique systems across 12 different hardware accelerators. Generative AI benchmark improvements outpaced Moore's Law, and Llama 3.1 8B replaced BERT while Flux.1 replaced Stable Diffusion v2 to reflect current workloads.
You almost never optimize one sense of performance in isolation. Quantizing a model to INT8 or FP8 may give 2x throughput at the cost of a fraction of a point on accuracy. Distilling a 70B model into a 7B student fits smaller hardware but rarely matches the teacher on every benchmark. Speculative decoding speeds up generation without quality loss but requires a smaller draft model that uses extra memory.
The right point on the quality-cost-latency curve depends on the application. A coding assistant can spend a few hundred milliseconds on a heavyweight model. A real-time speech assistant cannot. Picking the right metric, measuring it honestly, and being clear about which sense of "performance" you mean is most of the job.
Performance in machine learning is two different things wearing the same name.
The first kind is like getting a grade on a test. It tells us how often the computer is right when it guesses what's in a picture or how much a house costs. Different grading systems exist, like accuracy (how many it got right) or F1 score (a balance between being careful and being thorough).
The second kind is like how fast you can finish a race. Even if you know all the answers, you need to give them quickly. A model that takes ten seconds to reply is annoying even if it is correct. So engineers also measure how many words per second a model produces and how long you wait before the first word appears.
The tricky part is that being more accurate often makes you slower, and being faster often makes you a little less accurate. Choosing the right balance is most of the work.