# Performance

> Source: https://aiwiki.ai/wiki/performance
> Updated: 2026-07-17
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

## Introduction

Performance in [machine learning](/wiki/machine_learning) is an overloaded word. It refers to two related but distinct ideas. The first is the quality of a model's predictions: how accurate or useful its outputs are for a given task. The second is computational performance: how fast and cost-efficient the model is to train and serve, measured in latency, throughput, memory bandwidth, and floating point operations per second.

Both senses matter in practice. A model that scores well on benchmarks is useless if it takes ten seconds per token to respond, and a fast model that returns wrong answers is just wrong faster. Engineers usually have to trade one against the other, picking a smaller model, a lower numeric precision, or a different batch size to land on a better point on the quality versus speed curve. This article covers both meanings, starting with prediction quality and then moving to the systems side, including [MLPerf](/wiki/mlperf) and modern accelerator metrics.

## Performance as prediction quality

When a paper or a model card reports "performance," it usually means how well the model does on a held-out test set. Different tasks call for different metrics, and choosing the wrong metric is an easy way to ship a model that looks good on paper but behaves badly in production.

### Classification metrics

In a classification task a model assigns each input to one of several predefined classes.

- [Accuracy](/wiki/accuracy): correct predictions divided by total predictions, (TP + TN) / (TP + TN + FP + FN). Intuitive but misleading on imbalanced datasets. A model that always predicts the majority class can hit 99% accuracy on a problem where 1% of cases are positive.[1]
- [Precision](/wiki/precision): TP / (TP + FP). High precision means the model is rarely wrong when it predicts positive.[1]
- [Recall](/wiki/recall): also called sensitivity or true positive rate, TP / (TP + FN). High recall means the model catches most actual positives, which is what you want in medical screening even at the cost of more false alarms.[1]
- F1 score: the harmonic mean of precision and recall, 2TP / (2TP + FP + FN). Ranges from 0 to 1 and is the standard summary when you care about both false positives and false negatives.[1]
- ROC-AUC: the area under the receiver operating characteristic curve, which plots true positive rate against false positive rate as the threshold varies. 1.0 is perfect separation, 0.5 is random guessing.[3]
- PR-AUC: the area under the precision-recall curve, generally preferred over ROC-AUC on heavily imbalanced data.[3]
- Log loss (cross-entropy): penalizes confident wrong predictions more than uncertain ones, and is the loss function for most probabilistic classifiers.[3]

### Regression metrics

In regression, the model predicts a continuous value rather than a class label.

- [Mean Absolute Error (MAE)](/wiki/mean_absolute_error_mae): the average absolute difference between predicted and actual values. Robust to outliers and easy to interpret.[3]
- [Mean Squared Error (MSE)](/wiki/mean_squared_error_mse): the average squared difference. Penalizes large errors more than small ones.[3]
- [Root Mean Squared Error (RMSE)](/wiki/root_mean_squared_error_rmse): the square root of MSE, in the same units as the target.[3]
- R-squared: the proportion of variance the model explains. Reaches 1.0 only for perfect predictions and can be negative for very bad models.[3]
- Mean Absolute Percentage Error (MAPE): average relative error as a percentage, useful when relative scale matters more than absolute size.[3]

### Generative model benchmarks

For large language models, traditional accuracy is not enough. Researchers evaluate these models on standardized benchmarks:

- MMLU (Massive Multitask Language Understanding): 57 subjects from high school math to professional law. By 2025 top frontier models score above 88%, close to saturation.[13]
- GPQA (Graduate-Level Google-Proof Q&A): 448 hard physics, chemistry, and biology questions written by domain experts. The Diamond subset is one of the most discriminating reasoning benchmarks for frontier models.
- SWE-bench: drops a model into a real GitHub repository and asks it to fix actual bugs. Top systems solved only 4.4% of issues in 2023 but the leading models exceeded 70% by 2024.[13]
- HumanEval: 164 hand-written Python code generation problems.
- Perplexity: a token-level measure of how well a language model predicts a held-out corpus. Lower is better.

## Validation techniques

Reporting numbers on the data you trained on is one of the oldest ways to lie with statistics. Real measurement requires a separate dataset.

- Holdout validation: a simple training versus test split, typically 70-30 or 80-20. Quick but noisy on small datasets.[2]
- K-fold cross-validation: split into K equal folds, train on K-1 and test on the remaining fold, rotate, and average. Ten-fold is the common default.[2]
- Stratified K-fold: each fold preserves the class distribution of the full dataset, which matters for imbalanced classification.[2]
- Leave-one-out cross-validation: K equals the number of data points. Almost unbiased but expensive.[2]
- Time series cross-validation: the test fold must come after the training fold so future information does not leak backward.[2]

Good practice is to keep a true holdout that is touched only once, after model selection and hyperparameter tuning are done. Reusing the test set inflates reported numbers.

## Factors influencing prediction quality

Several factors shape how well a model predicts:

- Data quality and quantity. More representative data usually beats a fancier model. Mislabeled examples, biased samples, and distribution shift between training and deployment are common reasons a strong test score does not survive in production.
- Feature engineering. For tabular problems, picking and constructing the right features is often the biggest single lever.
- Model complexity. Too simple and the model underfits with high bias. Too complex and it overfits with high variance. The [bias-variance tradeoff](/wiki/bias_variance_tradeoff) describes this balance.[4]
- Hyperparameter tuning. Learning rate, regularization, tree depth, batch size, and similar knobs are not learned by the model. Grid search, random search, and Bayesian optimization automate the search.
- Class imbalance. Skewed datasets often need resampling, class weights, or threshold tuning to avoid trivially predicting the majority class.

## Performance as computational efficiency

The other sense of performance is the systems sense: how fast does the model run, on what hardware, at what cost. This has become the dominant sense in the LLM era, where serving a chatbot can cost more than training it.

### Compute throughput: FLOPS and TFLOPS

Floating point operations per second (FLOPS) measures raw arithmetic throughput. A teraFLOP (TFLOPS) is one trillion ops per second; a petaFLOP (PFLOPS) is one thousand TFLOPS. Accelerators are rated separately for each numeric precision (FP64, FP32, BF16, FP16, FP8, FP4).

The NVIDIA H100 (Hopper architecture, 2022) delivers up to 1,979 TFLOPS of dense FP16 tensor throughput, doubling to about 3,958 TFLOPS with structured sparsity, and supports FP8 for inference.[12] The B200 (Blackwell architecture, 2025) goes significantly higher and adds FP6 and FP4 precisions, reporting up to 9 PFLOPS of dense FP4 tensor throughput.[12] The B200 also doubles memory capacity to 180 GB of HBM3e and pushes aggregate memory bandwidth to roughly 7.7 TB/s, about 2.3 times the H100.[12]

Raw FLOPS is a useful headline number but rarely tells the whole story, because real workloads are often limited by memory rather than arithmetic.

### Memory bandwidth and arithmetic intensity

Arithmetic intensity is the ratio of arithmetic operations to bytes moved from memory. Kernels with high arithmetic intensity are compute-bound and benefit from more FLOPS. Kernels with low arithmetic intensity are memory-bound and benefit from higher memory bandwidth, not more compute.

LLM inference splits cleanly into two regimes. The prefill phase processes the input prompt in parallel and tends to be compute-bound. The decode phase generates one token at a time and is heavily memory-bound, because every step requires loading the model weights and KV cache from memory. At batch size 1, decode arithmetic intensity is on the order of 1 to 2 FLOPs per byte, so memory bandwidth caps performance long before the compute units saturate.[11] Larger batches raise arithmetic intensity (the same weights serve more requests per load) and let the GPU approach its theoretical peak. This is why "memory wall" is a recurring phrase in modern AI hardware discussions.

### Latency metrics for inference

For real-time applications the key numbers are about how long a user waits:

- Time to first token (TTFT): the delay between sending a request and seeing the first output token. What users perceive as responsiveness.[8]
- Inter-token latency (ITL) or time per output token (TPOT): the average gap between successive tokens. Determines how fast a response streams.[8]
- End-to-end latency: total time from request to last token. Dominated by output length for long generations.[8]
- Tail latency, often reported as p95 or p99: the slowest few percent of requests, which is what users actually remember.[8]

### Throughput metrics

When the goal is to serve many users rather than minimize one user's latency, throughput matters more:

- Tokens per second (TPS): aggregate output tokens generated per second across all concurrent requests.[8]
- Requests per second (RPS): how many full requests the system completes per second.[8]
- Model FLOPS utilization (MFU): the fraction of theoretical peak FLOPS actually delivered. Well-tuned training runs on large clusters reach 40 to 60% MFU; production inference is often much lower.[8]

There is a fundamental tradeoff between latency and throughput. Larger batches amortize fixed costs over more tokens and raise throughput, but they force individual requests to wait for the batch to fill, raising TTFT. Continuous batching, speculative decoding, and paged attention are common techniques for moving the frontier outward.

## Training versus inference performance

Training is run with large batches, mixed precision (BF16 or FP8), and many GPUs working together for days or weeks. The bottleneck is usually a mix of compute, memory, and inter-GPU communication. Common metrics include time to a target loss, total FLOPs consumed, and tokens per second per GPU.

Inference often runs at much smaller batches, sometimes a single user at a time, under hard latency constraints. The bottleneck shifts toward memory bandwidth and KV cache size, especially during decode. Inference is also where lower precisions like INT8, FP8, and FP4 pay off; the accuracy hit is small but throughput roughly doubles with each step down on Blackwell-class hardware.

## Industry benchmarks: MLPerf

MLPerf is the de facto industry benchmark for AI hardware and software systems. It was started in 2018 by a coalition of academic and industry researchers and is now maintained by [MLCommons](/wiki/mlcommons), a non-profit consortium with over 100 members that launched in December 2020.[7] MLPerf has separate suites for training and inference, and for different deployment scenarios (datacenter, edge, mobile, tiny). Results are submitted under controlled rules and reviewed by other participants before release.

MLPerf Inference v5.1, released in September 2025, set a participation record with 27 organizations submitting systems.[5] It introduced three new benchmarks: DeepSeek-R1 (the first reasoning model in the suite), Llama 3.1 8B for summarization, and Whisper Large V3 for speech recognition.[5] Top systems improved by as much as 50% over v5.0 six months earlier.[5] Llama 2 70B remained the most popular benchmark with 24 submitters, and new accelerators included the AMD Instinct MI355X and NVIDIA's GB300 and RTX Pro 6000 Blackwell.[5]

MLPerf Training v5.1, released in November 2025, included 65 unique systems across 12 different hardware accelerators.[6] Generative AI benchmark improvements outpaced Moore's Law, and Llama 3.1 8B replaced BERT while Flux.1 replaced Stable Diffusion v2 to reflect current workloads.[6]

## Quality versus speed tradeoffs

You almost never optimize one sense of performance in isolation. Quantizing a model to INT8 or FP8 may give 2x throughput at the cost of a fraction of a point on accuracy. Distilling a 70B model into a 7B student fits smaller hardware but rarely matches the teacher on every benchmark. Speculative decoding speeds up generation without quality loss but requires a smaller draft model that uses extra memory.

The right point on the quality-cost-latency curve depends on the application. A coding assistant can spend a few hundred milliseconds on a heavyweight model. A real-time speech assistant cannot. Picking the right metric, measuring it honestly, and being clear about which sense of "performance" you mean is most of the job.

## Explain like I'm 5 (ELI5)

Performance in machine learning is two different things wearing the same name.

The first kind is like getting a grade on a test. It tells us how often the computer is right when it guesses what's in a picture or how much a house costs. Different grading systems exist, like accuracy (how many it got right) or F1 score (a balance between being careful and being thorough).

The second kind is like how fast you can finish a race. Even if you know all the answers, you need to give them quickly. A model that takes ten seconds to reply is annoying even if it is correct. So engineers also measure how many words per second a model produces and how long you wait before the first word appears.

The tricky part is that being more accurate often makes you slower, and being faster often makes you a little less accurate. Choosing the right balance is most of the work.

## References

[1] Google for Developers, Machine Learning Crash Course, Classification: Accuracy, recall, precision, and related metrics. https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
[2] scikit-learn documentation, Cross-validation: evaluating estimator performance. https://scikit-learn.org/stable/modules/cross_validation.html
[3] scikit-learn documentation, Metrics and scoring: quantifying the quality of predictions. https://scikit-learn.org/stable/modules/model_evaluation.html
[4] Wikipedia, Bias-variance tradeoff. https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff
[5] MLCommons, MLPerf Inference v5.1 Results (September 2025). https://mlcommons.org/2025/09/mlperf-inference-v5-1-results/
[6] MLCommons, MLPerf Training v5.1 Results (November 2025). https://mlcommons.org/2025/11/training-v5-1-results/
[7] MLCommons, About MLCommons. https://mlcommons.org/about-us/
[8] NVIDIA Developer Blog, LLM Inference Benchmarking: Fundamental Concepts. https://developer.nvidia.com/blog/llm-benchmarking-fundamental-concepts/
[9] NVIDIA NIM LLMs Benchmarking, Metrics. https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html
[10] BentoML, Key metrics for LLM inference. https://bentoml.com/llm/inference-optimization/llm-inference-metrics
[11] arXiv, Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference (2503.08311). https://arxiv.org/abs/2503.08311
[12] NVIDIA, H100 and B200 product specifications. https://www.nvidia.com/en-us/data-center/h100/
[13] Stanford HAI, 2025 AI Index Report, Technical Performance chapter. https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance