MLPerf is an industry-standard suite of benchmarks for measuring the performance of machine learning systems across training, inference, and specialized computing environments. Developed and maintained by MLCommons, a global engineering consortium of over 125 organizations, MLPerf provides a transparent and reproducible methodology for comparing hardware and software platforms on real-world AI workloads. Since its founding in 2018, MLPerf has become the most widely cited benchmark in the AI hardware industry, with results used by chip manufacturers, cloud providers, and system integrators to validate performance claims and guide purchasing decisions.
The MLPerf project originated in February 2018 from a series of meetings between engineers and researchers at Baidu, Google, Harvard University, Stanford University, and UC Berkeley. The founding team included Peter Mattson (Google Brain), David Patterson (UC Berkeley), Greg Diamos (Baidu), Cliff Young (Google), Peter Bailis (Stanford), and Gu-Yeon Wei (Harvard). Their goal was to create an equivalent of SPEC benchmarks or TPC benchmarks for the machine learning field, providing standardized, fair comparisons across rapidly proliferating hardware and software platforms.
The first MLPerf Training benchmark results were published in late 2018, and the MLPerf Inference benchmark followed in 2019. The formal academic paper, "MLPerf Training Benchmark," appeared on arXiv in October 2019 with contributions from over 38 authors spanning industry and academia.
In December 2020, the MLPerf effort was formally incorporated into MLCommons, a nonprofit engineering consortium headquartered in Dover, Delaware. MLCommons expanded the scope beyond benchmarking to include open datasets, safety evaluations, and research initiatives. The founding board included representatives from Alibaba, Meta (then Facebook AI), Google, Intel, NVIDIA, and Harvard University.
As of 2025, MLCommons has grown to include over 125 member organizations ranging from startups to Fortune 500 companies, universities, and nonprofits across the globe. The MLPerf benchmark suite has produced over 90,000 individual results across all its categories.
MLCommons organizes its work through several working groups:
| Working Group Area | Key Initiatives |
|---|---|
| Benchmarks | MLPerf Training, MLPerf Inference, MLPerf HPC, MLPerf Tiny, MLPerf Storage, MLPerf Client, MLPerf Automotive, AlgoPerf, AILuminate |
| Data | Croissant metadata format, open datasets, medical AI data, MLCube |
| Research | Algorithms, Chakra execution traces, data-centric ML, science |
| AI Risk and Reliability | AI safety evaluation, responsible AI development |
The MLPerf benchmarks follow a biannual release cadence, with new results published roughly every six months (e.g., v5.0 in June 2025 and v5.1 in November 2025 for Training). Each release can include updated workloads, new models, and retired benchmarks to keep pace with the fast-moving field.
MLPerf is not a single benchmark but a family of benchmark suites, each targeting a different aspect of the ML lifecycle or a different class of hardware. The major suites are:
| Suite | Target | First Release | Latest Version (as of late 2025) |
|---|---|---|---|
| MLPerf Training | Datacenter GPU/accelerator clusters | 2018 | v5.1 (November 2025) |
| MLPerf Inference | Datacenter and edge inference systems | 2019 | v5.1 (September 2025) |
| MLPerf HPC | Supercomputers for scientific ML | 2019 | v3.0 (November 2023) |
| MLPerf Tiny | Microcontrollers and edge devices | 2021 | v1.3 (September 2025) |
| MLPerf Storage | Storage systems for ML training | 2023 | v2.0 (August 2025) |
| MLPerf Client | Laptops, desktops, workstations | 2024 | v1.5 (November 2025) |
| MLPerf Automotive | Automotive inference systems | 2025 | v0.5 (August 2025) |
All MLPerf benchmark suites use a common framework of submission divisions and availability categories to ensure fair comparisons while still allowing innovation.
Closed Division: Submitters must use the reference model architecture and adhere to strict rules about preprocessing, training procedures, and hyperparameters. This division enables direct, apples-to-apples comparisons of hardware and system software. Most industry submissions fall into this category.
Open Division: Submitters may modify the model architecture, use different optimizers, or apply novel training techniques. This division encourages algorithmic innovation and research exploration. Results in the Open Division cannot be directly compared to Closed Division results.
| Category | Description |
|---|---|
| Available | All hardware and software components can be purchased or rented by the public at the time of submission |
| Preview | Components must become Available by the next submission round |
| RDI (Research, Development, Internal) | Experimental or pre-production systems not yet commercially available |
The MLPerf Training benchmark measures the wall-clock time required to train a model on a specified dataset to a defined quality target. The methodology involves running the benchmark a specified number of times, discarding the highest and lowest results, and averaging the remainder. Typical accuracy variance across runs is roughly plus or minus 2.5% for imaging benchmarks and plus or minus 5% for other workloads.
The v5.1 suite includes the following workloads:
| Benchmark | Model | Dataset | Domain | Notes |
|---|---|---|---|---|
| LLM Pretraining (large) | Llama 3.1 405B | C4 (v3.0.1) | Natural language processing | 405 billion parameters, sequence length 8,192. Replaced GPT-3 in v5.0. |
| LLM Pretraining (small) | Llama 3.1 8B | C4 | Natural language processing | New in v5.1, replaces BERT. Enables single-node benchmarking. |
| LLM Fine-tuning | Llama 2 70B (LoRA) | SCROLLS GovReport | Natural language processing | Sequence length 8,192. Uses Low-Rank Adaptation. |
| Object Detection | RetinaNet | Open Images | Computer vision | Lightweight object detection |
| Image Classification | ResNet-50 | ImageNet | Computer vision | Classic vision benchmark |
| Medical Image Segmentation | 3D U-Net | KiTS19 | Medical imaging | Kidney tumor segmentation |
| Text-to-Image Generation | Flux.1 | CC12M (train), COCO 2014 (eval) | Generative AI | 11.9B parameter transformer model. New in v5.1, replaces Stable Diffusion v2. |
| Recommendation | DLRMv2 | Criteo 4TB | Commerce | Deep Learning Recommendation Model |
| Graph Neural Network | R-GAT | IGBH-Full | Graph learning | Relational Graph Attention Network |
| Speech Recognition | RNN-T | LibriSpeech | Audio | Recurrent Neural Network Transducer |
MLPerf Training v5.0 (June 2025) introduced the Llama 3.1 405B benchmark, the largest model ever included in the training suite. In a landmark result, CoreWeave, NVIDIA, and IBM completed the Llama 3.1 405B training benchmark in just 27.3 minutes using a cluster of 2,496 NVIDIA GB200 NVL72 GPUs. This was the largest cluster ever benchmarked under MLPerf and achieved 91% scaling efficiency from 512 to 2,496 GPUs.
The NVIDIA Blackwell architecture delivered up to 2.6x higher performance per GPU compared to the previous-generation Hopper architecture, with at least 2x faster training at equivalent cluster sizes.
MLPerf Training v5.1 (November 2025) included results from 20 organizations submitting 65 unique systems across 12 different hardware accelerators. Nearly half of all submissions were multi-node configurations, an 86% increase over v4.1. The generative AI workloads (Llama 2 70B LoRA and Llama 3.1 8B) saw 24% and 15% submission increases respectively.
Version 4.0 (2024) was notable for introducing system-wide power draw and energy consumption measurements during training, reflecting the growing importance of energy efficiency in large-scale AI systems.
The MLPerf Inference benchmark measures how quickly systems can execute trained models across various deployment scenarios. It covers both datacenter and edge deployments, with different scenarios reflecting real-world usage patterns.
| Scenario | Description | Primary Use Case |
|---|---|---|
| Offline | Processes the entire dataset as a batch; measures raw throughput | Batch processing, analytics |
| Server | Queries arrive following a Poisson distribution; measures throughput under latency constraints | Cloud API endpoints, web services |
| SingleStream | One query at a time; measures latency per query | Mobile apps, single-user devices |
| MultiStream | Multiple concurrent query streams | Multi-camera systems, autonomous vehicles |
| Interactive | Tight latency constraints for time-to-first-token (TTFT) and time-per-output-token (TPOT) | Chatbots, agentic AI applications |
Datacenter submissions require the Offline and Server scenarios (plus Interactive for LLM benchmarks). Edge submissions require SingleStream, MultiStream, and Offline scenarios.
| Benchmark | Model | Dataset | Scenarios |
|---|---|---|---|
| Image Classification | ResNet-50 | ImageNet | Datacenter (Offline, Server), Edge (Offline, SingleStream, MultiStream) |
| Object Detection | RetinaNet | Open Images | Datacenter, Edge |
| Medical Image Segmentation | 3D U-Net | KiTS19 | Datacenter, Edge |
| NLP | BERT-Large | SQuAD v1.1 | Datacenter, Edge |
| Speech Recognition | RNN-T | LibriSpeech | Datacenter, Edge |
| Recommendation | DLRMv2 | Criteo | Datacenter only |
| Text Summarization | GPT-J | CNN DailyMail | Datacenter, Edge |
| Large Language Model | Llama 2 70B | Multiple | Datacenter, Edge |
| Large Language Model (Interactive) | Llama 2 70B Interactive | Multiple | Datacenter (tighter latency: 450ms TTFT, 40ms TPOT) |
| Mixture of Experts | Mixtral 8x7B | Multiple | Datacenter |
| Large Language Model | Llama 3.1 405B | Multiple | Datacenter |
| Small LLM | Llama 3.1 8B | CNN DailyMail | Datacenter, Edge. New in v5.1; replaces GPT-J. 128K token context. |
| Reasoning Model | DeepSeek-R1 | Math, Q&A, Code | Datacenter. New in v5.1; first reasoning model in the suite. |
| Speech Recognition | Whisper Large V3 | Multiple | Datacenter, Edge. New in v5.1; multilingual. |
| Graph Neural Network | R-GAT | IGBH | Datacenter |
| Autonomous Driving | PointPainting | nuScenes, Cognata | Edge (automotive) |
Most benchmarks require the submitted system to achieve at least 99% of the accuracy of the FP32 reference model. Language processing and medical imaging benchmarks have a stricter threshold of 99.9% accuracy relative to the reference.
MLPerf Inference v5.1 (September 2025) set a record for participation, with 27 organizations submitting results. Six were first-time submitters: MiTac, Nebius, Amitash Nanda (academic), TheStage AI, University of Florida, and Vultr. Five new accelerators debuted in this round:
| Accelerator | Vendor |
|---|---|
| Instinct MI355X | AMD |
| Arc Pro B60 48GB Turbo | Intel |
| GB300 | NVIDIA |
| RTX 4000 Ada PCIe 20GB | NVIDIA |
| RTX Pro 6000 Blackwell Server Edition | NVIDIA |
Performance on the Llama 2 70B benchmark improved by up to 50% over v5.0 results from just six months prior. NVIDIA's Blackwell Ultra architecture topped the leaderboard across multiple benchmarks. A notable milestone was the first heterogeneous system submission, which used software load-balancing across different accelerator types.
MLPerf Inference v5.0 (April 2025) included 17,457 results from 23 organizations. Five newly available processors were benchmarked: AMD Instinct MI325X, Intel Xeon 6980P, Google TPU Trillium (v6e), NVIDIA B200, and NVIDIA Jetson AGX Thor 128.
MLPerf HPC targets supercomputer-scale systems used for scientific machine learning. Unlike the standard Training benchmark, HPC benchmarks focus on scientific applications and include an optional throughput metric for large multi-user systems.
| Benchmark | Model | Application | Description |
|---|---|---|---|
| DeepCAM | Convolutional neural network | Climate science | Identifies weather patterns in climate simulation data |
| CosmoFlow | 3D CNN | Cosmology | Predicts cosmological parameters from 3D dark matter distributions |
| OpenCatalyst | GNN | Chemistry | Predicts quantum mechanical properties of catalyst materials for energy storage |
| OpenFold | Generative AI | Structural biology | Predicts 3D protein structures from amino acid sequences. Added in v3.0. |
MLPerf HPC v3.0 results demonstrated performance gains of up to 2.8x compared to just five months prior and 49x over the first HPC results. The DeepCAM weather modeling benchmark ran 14x faster than when it debuted, illustrating how innovations in ML hardware and software benefit scientific computing.
MLPerf Tiny benchmarks inference performance on extremely low-power embedded devices such as microcontrollers, DSPs, and tiny neural network accelerators. These devices typically operate at clock speeds between 10 MHz and 250 MHz and consume less than 50 mW of power. The neural networks tested are small, typically 100 KB and below, processing sensor data like audio and images.
| Benchmark | Task | Description |
|---|---|---|
| Keyword Spotting (KWS) | Audio classification | Detects spoken wake words or commands. Used in earbuds and virtual assistants. |
| Visual Wake Words (VWW) | Image classification | Detects presence of a person in an image. Used in security cameras and smart home devices. |
| Image Classification (IC) | Image classification | Classifies small images into 10 categories using a compact CNN. |
| Anomaly Detection (AD) | Audio anomaly detection | Identifies abnormal sounds in machine operating environments for predictive maintenance. |
| Streaming KWS | Audio streaming | New in v1.3. Uses a 1D depthwise separable CNN on continuous audio streams to detect wake words in real time. |
The reference implementation uses TensorFlow Lite for Microcontrollers (TFLM). Three metrics are measured for each benchmark: accuracy, latency, and energy consumption. MLPerf Tiny v1.3 included 70 results across five benchmarks from four participants: Kai Jiang, Qualcomm, STMicroelectronics, and Syntiant.
MLPerf Storage measures how quickly storage systems can feed training data to accelerators during model training. Storage throughput is a critical bottleneck in large-scale training, especially when datasets are too large to fit in memory.
| Version | Workloads | Key Feature |
|---|---|---|
| v1.0 (September 2024) | 3D U-Net, ResNet-50, CosmoFlow | Added distributed training support. 55 submissions from 12 companies. |
| v2.0 (August 2025) | 3D U-Net, ResNet-50, CosmoFlow, Llama 3 checkpointing | Added real-world checkpointing tests. Over 200 results from 26 organizations across 7 countries. |
The v2.0 results showed that submitted storage systems could support roughly twice the number of simultaneous accelerators compared to v1.0, reflecting the scaling demands of modern training clusters.
MLPerf Client evaluates AI performance on personal computers, including laptops, desktops, and workstations. Announced in January 2024, this benchmark suite targets the growing "AI PC" market segment where large language models run locally on consumer hardware.
| Version | Release Date | Models Tested | Key Features |
|---|---|---|---|
| v0.5 | December 2024 | Llama 2 7B (4-bit quantized) | Four tasks: content generation, creative writing, text summarization. Metrics: TTFT, tokens per second. |
| v1.0 | July 2025 | Llama 2 7B Chat, Llama 3.1 8B Instruct, Phi 3.5 Mini Instruct | AMD NPU/GPU, Intel NPU/GPU (OpenVINO), NVIDIA GPU, Qualcomm NPU, Apple GPU (MLX) support |
| v1.5 | November 2025 | Same as v1.0 plus Phi 4 Reasoning 14B (experimental) | Windows ML integration, iPad support, experimental power measurement, Linux CLI |
MLPerf Client v1.5 supports Windows x64, Windows on Arm, macOS, Linux, and iPad, making it the most cross-platform MLPerf benchmark. It is developed collaboratively by AMD, Intel, Microsoft, NVIDIA, Qualcomm, and major PC OEMs.
MLPerf Automotive is jointly developed by MLCommons and the Autonomous Vehicle Computing Consortium (AVCC) to evaluate inference performance for automotive AI systems. The benchmark covers Advanced Driver Assistance Systems (ADAS), autonomous driving (AD), and in-vehicle infotainment (IVI).
| Benchmark | Task | Model/Method |
|---|---|---|
| 2D Object Detection | Detect vehicles, pedestrians, etc. in camera images | Various detectors on Cognata 8MP imagery |
| 2D Semantic Segmentation | Pixel-level scene classification | Segmentation models on Cognata dataset |
| 3D Object Detection | Detect objects in 3D space from lidar and camera fusion | PointPainting on nuScenes dataset |
The 3D object detection benchmark uses PointPainting, a sensor-fusion technique that combines image-based semantic segmentation with lidar point cloud data. Future versions are expected to incorporate vision-language-action (VLA) models for end-to-end self-driving evaluation.
MLPerf results serve several purposes across the AI industry:
Hardware Vendors: Chip makers like NVIDIA, AMD, Intel, and Google use MLPerf to demonstrate the performance advantages of new processor generations. NVIDIA has been a particularly dominant and consistent top performer, using MLPerf results prominently in product launches for its Hopper and Blackwell GPU architectures.
Cloud Providers: Companies like CoreWeave, Oracle Cloud, Google Cloud, Microsoft Azure, and Lambda use MLPerf to validate their infrastructure offerings. Cloud-based submissions allow potential customers to see performance on rentable systems.
System Integrators: Server manufacturers like Dell, HPE, Lenovo, Supermicro, and Cisco submit results to showcase the performance of their configured systems, helping enterprise buyers compare turnkey solutions.
AI Startups and Chip Designers: Newer entrants like Cerebras, SambaNova, Groq, and others have used MLPerf submissions (or plan to) to establish credibility and benchmark their novel architectures against established players.
Researchers: Academic institutions use MLPerf HPC and other suites to measure the ML capabilities of research clusters and supercomputers, informing procurement and system design decisions.
MLPerf employs several mechanisms to ensure fair and meaningful comparisons:
The biannual cadence and evolving workload selection ensure that benchmarks remain relevant as the field progresses. Retired workloads (such as the Mini Go reinforcement learning benchmark, the original NMT machine translation task, GPT-3, and BERT) are replaced with newer, more representative models.
While MLPerf is widely respected, some criticisms have been raised:
| Suite | Version | Date | Key Changes |
|---|---|---|---|
| Training | v0.5 | December 2018 | Initial release with ResNet-50, Transformer, NMT, Mini Go, Mask R-CNN, SSD |
| Training | v0.7 | June 2019 | Added BERT, DLRM |
| Training | v1.0 | July 2020 | Updated models and datasets |
| Training | v3.1 | November 2023 | Stable Diffusion, GPT-3, LLM fine-tuning added |
| Training | v4.0 | June 2024 | Power measurements introduced |
| Training | v5.0 | June 2025 | Llama 3.1 405B replaces GPT-3; record submissions |
| Training | v5.1 | November 2025 | Llama 3.1 8B replaces BERT; Flux.1 replaces Stable Diffusion |
| Inference | v0.5 | June 2019 | Initial release |
| Inference | v4.0 | March 2024 | Mixtral 8x7B, SDXL added |
| Inference | v5.0 | April 2025 | Llama 3.1 405B, R-GAT, Automotive PointPainting, Interactive scenario |
| Inference | v5.1 | September 2025 | DeepSeek-R1, Llama 3.1 8B, Whisper Large V3 added; 27 submitters |
| HPC | v0.5 | November 2019 | DeepCAM, CosmoFlow |
| HPC | v3.0 | November 2023 | OpenFold protein folding added; 49x gains over first results |
| Tiny | v0.5 | June 2021 | KWS, VWW, IC, AD |
| Tiny | v1.3 | September 2025 | Streaming KWS benchmark added |
| Storage | v0.5 | 2023 | Initial preview |
| Storage | v1.0 | September 2024 | Distributed training support |
| Storage | v2.0 | August 2025 | Llama 3 checkpointing; 200+ results |
| Client | v0.5 | December 2024 | Initial release for AI PCs |
| Client | v1.5 | November 2025 | Windows ML, iPad, power measurement |
| Automotive | v0.5 | August 2025 | 2D/3D object detection, semantic segmentation |