MLPerf

AI Benchmarks AI Hardware Machine Learning

23 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v7 · 4,683 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MLPerf is the AI industry's standard suite of benchmarks for measuring how fast machine learning systems train and run models, developed and maintained by MLCommons, a nonprofit engineering consortium of more than 125 organizations. Founded in 2018, MLPerf measures the wall-clock performance of GPU and accelerator hardware on real-world workloads such as large language model pretraining, image generation, and recommendation, and it has become the most widely cited benchmark in the AI hardware industry. By 2026 the suite had produced more than 100,000 individual results, and its headline training record stood at 27.3 minutes to pretrain Llama 3.1 405B on 2,496 NVIDIA GB200 GPUs. Chip manufacturers, cloud providers, and system integrators use MLPerf results to validate performance claims and guide multimillion-dollar purchasing decisions.^[1]^[5]^[8]

MLCommons states that MLPerf exists to "accelerate ML progress through fair and useful measurement," enabling fair comparison of competing systems while still encouraging AI innovation, enforcing reproducibility, serving both commercial and research communities, and keeping benchmarking affordable enough that all can participate.^[6]

History and Origins

The MLPerf project originated in February 2018 from a series of meetings between engineers and researchers at Baidu, Google, Harvard University, Stanford University, and UC Berkeley. The founding team included Peter Mattson (Google Brain), David Patterson (UC Berkeley), Greg Diamos (Baidu), Cliff Young (Google), Peter Bailis (Stanford), and Gu-Yeon Wei (Harvard). Their goal was to create an equivalent of SPEC benchmarks or TPC benchmarks for the machine learning field, providing standardized, fair comparisons across rapidly proliferating hardware and software platforms.^[1]

The first MLPerf Training benchmark results were published in late 2018, and the MLPerf Inference benchmark followed in 2019. The formal academic paper, "MLPerf Training Benchmark," appeared on arXiv in October 2019 with contributions from 37 authors spanning industry and academia.^[1]

In December 2020, the MLPerf effort was formally incorporated into MLCommons, a nonprofit engineering consortium headquartered in Dover, Delaware. MLCommons expanded the scope beyond benchmarking to include open datasets, safety evaluations, and research initiatives. The founding board included representatives from Alibaba, Meta (then Facebook AI), Google, Intel, NVIDIA, and Harvard University.^[5]

As of 2026, MLCommons has grown to include over 125 member organizations ranging from startups to Fortune 500 companies, universities, and nonprofits across the globe. The MLPerf benchmark suite has produced more than 100,000 individual results across all its categories.^[5]^[7]

Organizational Structure

MLCommons organizes its work through several working groups:

Working Group Area	Key Initiatives
Benchmarks	MLPerf Training, MLPerf Inference, MLPerf HPC, MLPerf Tiny, MLPerf Storage, MLPerf Client, MLPerf Automotive, AlgoPerf, AILuminate
Data	Croissant metadata format, open datasets, medical AI data, MLCube
Research	Algorithms, Chakra execution traces, data-centric ML, science
AI Risk and Reliability	AILuminate safety evaluation, responsible AI development

The MLPerf benchmarks follow a biannual release cadence, with new results published roughly every six months (e.g., v5.0 in June 2025 and v5.1 in November 2025 for Training; v5.1 in September 2025 and v6.0 in April 2026 for Inference).^[7]^[8]^[9] Each release can include updated workloads, new models, and retired benchmarks to keep pace with the fast-moving field.

What does the MLPerf benchmark suite include?

MLPerf is not a single benchmark but a family of benchmark suites, each targeting a different aspect of the ML lifecycle or a different class of hardware. The major suites are:

Suite	Target	First Release	Latest Version (as of May 2026)
MLPerf Training	Datacenter GPU/accelerator clusters	2018	v5.1 (November 2025)^[9]
MLPerf Inference	Datacenter and edge inference systems	2019	v6.0 (April 2026)^[7]
MLPerf HPC	Supercomputers for scientific ML	2019	v3.0 (November 2023)^[10]
MLPerf Tiny	Microcontrollers and edge devices	2021	v1.3 (September 2025)^[12]
MLPerf Storage	Storage systems for ML training	2023	v2.0 (August 2025)^[11]
MLPerf Client	Laptops, desktops, workstations	2024	v1.5 (November 2025)^[13]
MLPerf Automotive	Automotive inference systems	2025	v0.5 (August 2025)^[14]
AILuminate	LLM safety evaluation	2024	v1.1 (2025)^[21]^[22]

Submission Divisions and Categories

All MLPerf benchmark suites use a common framework of submission divisions and availability categories to ensure fair comparisons while still allowing innovation.

Divisions

Closed Division: Submitters must use the reference model architecture and adhere to strict rules about preprocessing, training procedures, and hyperparameters. This division enables direct, apples-to-apples comparisons of hardware and system software. Most industry submissions fall into this category.^[1]^[2]

Open Division: Submitters may modify the model architecture, use different optimizers, or apply novel training techniques. This division encourages algorithmic innovation and research exploration. Results in the Open Division cannot be directly compared to Closed Division results.^[1]^[2]

Availability Categories

Category	Description
Available	All hardware and software components can be purchased or rented by the public at the time of submission
Preview	Components must become Available by the next submission round
RDI (Research, Development, Internal)	Experimental or pre-production systems not yet commercially available

MLPerf Training

The MLPerf Training benchmark measures the wall-clock time required to train a model on a specified dataset to a defined quality target. The methodology involves running the benchmark a specified number of times, discarding the highest and lowest results, and averaging the remainder. Typical accuracy variance across runs is roughly plus or minus 2.5% for imaging benchmarks and plus or minus 5% for other workloads.^[1]

Training Workloads (v5.1, November 2025)

The v5.1 suite includes the following workloads:^[9]

Benchmark	Model	Dataset	Domain	Notes
LLM Pretraining (large)	Llama 3.1 405B	C4 (v3.0.1)	Natural language processing	405 billion parameters, sequence length 8,192. Replaced GPT-3 in v5.0.
LLM Pretraining (small)	Llama 3.1 8B	C4	Natural language processing	New in v5.1, replaces BERT. Enables single-node benchmarking.
LLM Fine-tuning	Llama 2 70B (LoRA)	SCROLLS GovReport	Natural language processing	Sequence length 8,192. Uses Low-Rank Adaptation.
Object Detection	RetinaNet	Open Images	Computer vision	Lightweight object detection
Image Classification	ResNet-50	ImageNet	Computer vision	Classic vision benchmark
Medical Image Segmentation	3D U-Net	KiTS19	Medical imaging	Kidney tumor segmentation
Text-to-Image Generation	Flux.1	CC12M (train), COCO 2014 (eval)	Generative AI	11.9B parameter transformer model. New in v5.1, replaces Stable Diffusion v2.
Recommendation	DLRMv2	Criteo 4TB	Commerce	Deep Learning Recommendation Model
Graph Neural Network	R-GAT	IGBH-Full	Graph learning	Relational Graph Attention Network
Speech Recognition	RNN-T	LibriSpeech	Audio	Recurrent Neural Network Transducer

Notable Training Results

MLPerf Training v5.0 (June 2025) introduced the Llama 3.1 405B benchmark, the largest model ever included in the training suite. In a landmark result, CoreWeave, NVIDIA, and IBM completed the Llama 3.1 405B training benchmark in just 27.3 minutes using a cluster of 2,496 NVIDIA GB200 NVL72 GPUs. This was the largest cluster ever benchmarked under MLPerf and achieved 91% scaling efficiency from 512 to 2,496 GPUs.^[8]^[16]

The NVIDIA Blackwell architecture delivered up to 2.6x higher performance per GPU compared to the previous-generation Hopper architecture, with at least 2x faster training at equivalent cluster sizes.^[15]

MLPerf Training v5.1 (November 2025) included results from 20 organizations submitting 65 unique systems across 12 different hardware accelerators. Nearly half of all submissions were multi-node configurations, an 86% increase over v4.1. The generative AI workloads (Llama 2 70B LoRA and Llama 3.1 8B) saw 24% and 15% submission increases respectively.^[9] First-time submitters included University of Florida, Verda, and Wiwynn.^[9]

NVIDIA swept all seven MLPerf Training v5.1 tests and was the only platform to submit results on every test.^[17] Highlights included a 10-minute Llama 3.1 405B pretraining time on 5,000+ Blackwell GPUs, a 5.2-minute Llama 3.1 8B pretraining time on 512 Blackwell Ultra GPUs, and a 12.5-minute Flux.1 image-generation training on 1,152 Blackwell GPUs.^[17] Compared to Hopper, Blackwell Ultra (GB300 NVL72) delivered roughly 4x faster Llama 3.1 405B pretraining and nearly 5x faster Llama 2 70B LoRA fine-tuning at the same GPU count.^[17] NVIDIA described NVFP4 as "a first in the history of MLPerf Training," noting it was "the only company that had submitted MLPerf Training results for FP4 precision while simultaneously meeting the strict accuracy requirements of the benchmark."^[17]

NVIDIA also demonstrated the first MLPerf training submissions using NVFP4 four-bit floating-point precision, achieving double the rate of FP8 on Blackwell and roughly 3x on Blackwell Ultra while remaining the only platform meeting accuracy requirements with FP4.^[17]

Version 4.0 (2024) was notable for introducing system-wide power draw and energy consumption measurements during training, reflecting the growing importance of energy efficiency in large-scale AI systems.

MLPerf Inference

The MLPerf Inference benchmark measures how quickly systems can execute trained models across various deployment scenarios. It covers both datacenter and edge deployments, with different scenarios reflecting real-world usage patterns.^[2]

Inference Scenarios

Scenario	Description	Primary Use Case
Offline	Processes the entire dataset as a batch; measures raw throughput	Batch processing, analytics
Server	Queries arrive following a Poisson distribution; measures throughput under latency constraints	Cloud API endpoints, web services
SingleStream	One query at a time; measures latency per query	Mobile apps, single-user devices
MultiStream	Multiple concurrent query streams	Multi-camera systems, autonomous vehicles
Interactive	Tight latency constraints for time-to-first-token (TTFT) and time-per-output-token (TPOT)	Chatbots, agentic AI applications

Datacenter submissions require the Offline and Server scenarios (plus Interactive for LLM benchmarks). Edge submissions require SingleStream, MultiStream, and Offline scenarios.^[2]^[7]

Inference Workloads (v6.0, April 2026)

MLPerf Inference v6.0, released April 1, 2026, refreshed roughly half of the datacenter suite. Five of the eleven datacenter tests were new or updated, and an additional new object-detection workload was added for edge systems.^[7]

Benchmark	Model	Dataset	Scenarios	Notes
Reasoning LLM	GPT-OSS 120B	AIME 2024, GPQA-Diamond, LiveCodeBench v6 (accuracy); PubMed (performance)	Datacenter (Server, Interactive)	New in v6.0. Mixture-of-experts (117B total, 5.1B active per token). First MLPerf benchmark to split performance and accuracy datasets.^[7]^[19]
Reasoning LLM (Interactive)	DeepSeek-R1	Math, Q&A, code	Datacenter (Interactive)	New interactive scenario in v6.0; first MLPerf standard for speculative decoding (EAGLE-style). Constraints: TTFT <= 1.5s, TPOT <= 15ms.^[7]^[19]
Recommendation	DLRMv3	Sequential commerce data	Datacenter	New in v6.0; third-generation recommender, first sequential recommendation benchmark, modernized with contributions from Meta.^[7]
Text-to-Video	Wan 2.2	Standard prompts	Datacenter	New in v6.0; first video generation workload in MLPerf Inference.^[7]
Vision-Language Model	Open-weight VLM	E-commerce product catalog	Datacenter	New in v6.0; covers structured metadata generation from images.^[7]
Object Detection (Edge)	YOLOv11 Large	Open Images / KITTI	Edge	New in v6.0 for edge systems.^[7]
LLM	Llama 3.1 405B	Multiple	Datacenter	Retained from v5.x.
Small LLM	Llama 3.1 8B	CNN DailyMail	Datacenter, Edge	Retained from v5.1; 128K token context.
LLM	Llama 2 70B / Interactive	Multiple	Datacenter	Retained from v5.x.
Mixture of Experts	Mixtral 8x7B	Multiple	Datacenter	Retained.
Image Classification	ResNet-50	ImageNet	Datacenter, Edge	Retained.
Medical Image Segmentation	3D U-Net	KiTS19	Datacenter, Edge	Retained.
Speech Recognition	Whisper Large V3	Multiple	Datacenter, Edge	Retained from v5.1.
Graph Neural Network	R-GAT	IGBH	Datacenter	Retained.
Autonomous Driving	PointPainting	nuScenes, Cognata	Edge (automotive)	Retained.

LoadGen++ and Visualizer

The v6.0 release also introduced LoadGen++, a substantially upgraded version of the load-generator harness that allows LLM benchmarks to run against serving-style software stacks (resembling production deployments) rather than the simpler bench rigs used previously. MLCommons described LoadGen++ as a key infrastructure investment to keep the suite nimble as model serving stacks evolve.^[7] Alongside LoadGen++, MLCommons launched an interactive results dashboard at mlcommons.org/visualizer with advanced filtering and customized performance graphs.^[7]^[20]

Accuracy Requirements

Most benchmarks require the submitted system to achieve at least 99% of the accuracy of the FP32 reference model. Language processing and medical imaging benchmarks have a stricter threshold of 99.9% accuracy relative to the reference. For GPT-OSS 120B, accuracy targets are 82.92% on AIME (8 repeats), 74.95% on GPQA-Diamond (5 repeats), and 84.68% on LiveCodeBench v6 (3 repeats).^[19]

Notable Inference Results

MLPerf Inference v6.0 (April 2026) drew submissions from 24 organizations, including AMD, ASUSTeK, Cisco, CoreWeave, Dell, GATEOverflow, Giga Computing, Google, HPE, Intel, Inventec, KRAI, Lambda, Lenovo, MangoBoost, MiTAC, Nebius, Netweb Technologies India, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Stevens Institute of Technology, and Supermicro.^[7] The round saw a 30% increase in multi-node submissions versus v5.1, and 10% of submissions featured more than ten nodes.^[7]

NVIDIA's largest submission used 288 Blackwell Ultra GPUs across four GB300 NVL72 racks connected by Quantum-X800 InfiniBand, the largest MLPerf Inference submission ever, quadruple the previous round's maximum.^[7]^[17]^[20] On DeepSeek-R1, this system delivered approximately 2.49 million tokens/second offline and 1.55 million tokens/second in server mode.^[20] Blackwell Ultra (GB300) also set per-GPU records on GPT-OSS 120B, Llama 3.1 405B, Llama 3.1 8B, Whisper, and other workloads.^[17]^[20]

AMD's Instinct MI355X delivered competitive single-node results and the first model bring-up on GPT-OSS 120B and Wan 2.2 text-to-video by a non-NVIDIA accelerator, and AMD multinode submissions crossed one million tokens per second at cluster scale.^[7] Intel was the only vendor with standalone CPU submissions; over half of the v6.0 submissions used Xeon processors, and Intel debuted its Arc Pro B70 GPU (delivering up to 1.8x the inference performance of the B60).^[18]

MLPerf Inference v5.1 (September 2025) had set the previous participation record at 27 organizations. Five new accelerators debuted: AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, NVIDIA RTX 4000 Ada PCIe 20GB, and NVIDIA RTX Pro 6000 Blackwell Server Edition. Performance on Llama 2 70B improved by up to 50% over v5.0 results from just six months prior. The round also included the first heterogeneous system submission, which used software load-balancing across different accelerator types.

MLPerf Inference v5.0 (April 2025) included 17,457 results from 23 organizations. Five newly available processors were benchmarked: AMD Instinct MI325X, Intel Xeon 6980P, Google TPU Trillium (v6e), NVIDIA B200, and NVIDIA Jetson AGX Thor 128.

MLPerf HPC

MLPerf HPC targets supercomputer-scale systems used for scientific machine learning. Unlike the standard Training benchmark, HPC benchmarks focus on scientific applications and include an optional throughput metric for large multi-user systems.^[4]

HPC Workloads (v3.0, November 2023)

Benchmark	Model	Application	Description
DeepCAM	Convolutional neural network	Climate science	Identifies weather patterns in climate simulation data
CosmoFlow	3D CNN	Cosmology	Predicts cosmological parameters from 3D dark matter distributions
OpenCatalyst	GNN	Chemistry	Predicts quantum mechanical properties of catalyst materials for energy storage
OpenFold	Generative AI	Structural biology	Predicts 3D protein structures from amino acid sequences. Added in v3.0.

MLPerf HPC v3.0 results demonstrated performance gains of up to 2.8x compared to just five months prior and 49x over the first HPC results. The DeepCAM weather modeling benchmark ran 14x faster than when it debuted, illustrating how innovations in ML hardware and software benefit scientific computing.^[10]

MLPerf Tiny

MLPerf Tiny benchmarks inference performance on extremely low-power embedded devices such as microcontrollers, DSPs, and tiny neural network accelerators. These devices typically operate at clock speeds between 10 MHz and 250 MHz and consume less than 50 mW of power. The neural networks tested are small, typically 100 KB and below, processing sensor data like audio and images.^[3]

Tiny Workloads (v1.3, September 2025)

Benchmark	Task	Description
Keyword Spotting (KWS)	Audio classification	Detects spoken wake words or commands. Used in earbuds and virtual assistants.
Visual Wake Words (VWW)	Image classification	Detects presence of a person in an image. Used in security cameras and smart home devices.
Image Classification (IC)	Image classification	Classifies small images into 10 categories using a compact CNN.
Anomaly Detection (AD)	Audio anomaly detection	Identifies abnormal sounds in machine operating environments for predictive maintenance.
Streaming KWS	Audio streaming	New in v1.3. Uses a 1D depthwise separable CNN on continuous audio streams to detect wake words in real time.

The reference implementation uses TensorFlow Lite for Microcontrollers (TFLM). Three metrics are measured for each benchmark: accuracy, latency, and energy consumption. MLPerf Tiny v1.3 included 70 results across five benchmarks from four participants: Kai Jiang, Qualcomm, STMicroelectronics, and Syntiant.^[12]

MLPerf Storage

MLPerf Storage measures how quickly storage systems can feed training data to accelerators during model training. Storage throughput is a critical bottleneck in large-scale training, especially when datasets are too large to fit in memory.

Storage Workloads

Version	Workloads	Key Feature
v1.0 (September 2024)	3D U-Net, ResNet-50, CosmoFlow	Added distributed training support. 55 submissions from 12 companies.
v2.0 (August 2025)	3D U-Net, ResNet-50, CosmoFlow, Llama 3 checkpointing	Added real-world checkpointing tests. Over 200 results from 26 organizations across 7 countries.^[11]

The v2.0 results showed that submitted storage systems could support roughly twice the number of simultaneous accelerators compared to v1.0, reflecting the scaling demands of modern training clusters.^[11]

MLPerf Client

MLPerf Client evaluates AI performance on personal computers, including laptops, desktops, and workstations. Announced in January 2024, this benchmark suite targets the growing "AI PC" market segment where large language models run locally on consumer hardware.

Client Version History

Version	Release Date	Models Tested	Key Features
v0.5	December 2024	Llama 2 7B (4-bit quantized)	Four tasks: content generation, creative writing, text summarization. Metrics: TTFT, tokens per second.
v1.0	July 2025	Llama 2 7B Chat, Llama 3.1 8B Instruct, Phi 3.5 Mini Instruct	AMD NPU/GPU, Intel NPU/GPU (OpenVINO), NVIDIA GPU, Qualcomm NPU, Apple GPU (MLX) support
v1.5	November 2025	Same as v1.0 plus Phi 4 Reasoning 14B (experimental)	Windows ML integration, iPad support, experimental power measurement, Linux CLI^[13]

MLPerf Client v1.5 supports Windows x64, Windows on Arm, macOS, Linux, and iPad, making it the most cross-platform MLPerf benchmark. It is developed collaboratively by AMD, Intel, Microsoft, NVIDIA, Qualcomm, and major PC OEMs.^[13]

MLPerf Automotive

MLPerf Automotive is jointly developed by MLCommons and the Autonomous Vehicle Computing Consortium (AVCC) to evaluate inference performance for automotive AI systems. The benchmark covers Advanced Driver Assistance Systems (ADAS), autonomous driving (AD), and in-vehicle infotainment (IVI).^[14]

Automotive Workloads (v0.5, August 2025)

Benchmark	Task	Model/Method
2D Object Detection	Detect vehicles, pedestrians, etc. in camera images	Various detectors on Cognata 8MP imagery
2D Semantic Segmentation	Pixel-level scene classification	Segmentation models on Cognata dataset
3D Object Detection	Detect objects in 3D space from lidar and camera fusion	PointPainting on nuScenes dataset

The 3D object detection benchmark uses PointPainting, a sensor-fusion technique that combines image-based semantic segmentation with lidar point cloud data. Future versions are expected to incorporate vision-language-action (VLA) models for end-to-end self-driving evaluation.^[14]

AILuminate

AILuminate is the MLCommons safety and reliability benchmark for large language models and AI chat systems, run by the AI Risk and Reliability working group. First launched as v1.0 in December 2024 and updated to v1.1 in 2025, it is described by MLCommons as the first industry-standard benchmark for assessing AI-product risk.^[21]^[22]

AILuminate evaluates a system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior across twelve hazard categories: violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons (CBRNE), suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (medical, legal, financial).^[21] Each test uses over 24,000 prompts per language (12,000 public practice prompts plus 12,000 private prompts used for the Official Test) and assigns systems a letter-grade safety score from "Poor" to "Excellent."^[21]

The benchmark initially supported English (v1.0) and added French, with Chinese and Hindi in development.^[21] AILuminate complements the performance-focused MLPerf suites by giving model developers and procurement teams a comparable measure of model safety alongside throughput and latency.

How do companies use MLPerf?

MLPerf results serve several purposes across the AI industry:

Hardware Vendors: Chip makers like NVIDIA, AMD, Intel, and Google use MLPerf to demonstrate the performance advantages of new processor generations. NVIDIA has been a particularly dominant and consistent top performer, using MLPerf results prominently in product launches for its Hopper, Blackwell, and Blackwell Ultra GPU architectures.^[15]^[17]

Cloud Providers: Companies like CoreWeave, Oracle Cloud, Google Cloud, Microsoft Azure, Nebius, Lambda, and Vultr use MLPerf to validate their infrastructure offerings. Cloud-based submissions allow potential customers to see performance on rentable systems.^[7]^[8]

System Integrators: Server manufacturers like Dell, HPE, Lenovo, Supermicro, Cisco, ASUSTeK, Quanta Cloud Technology, Inventec, and Wiwynn submit results to showcase the performance of their configured systems, helping enterprise buyers compare turnkey solutions.^[7]^[9]

AI Startups and Chip Designers: Newer entrants like Cerebras, SambaNova, Groq, and MangoBoost have used MLPerf submissions (or plan to) to establish credibility and benchmark their novel architectures against established players.^[7]

Researchers: Academic institutions use MLPerf HPC and other suites to measure the ML capabilities of research clusters and supercomputers, informing procurement and system design decisions. Recent first-time academic submitters include the University of Florida, Stevens Institute of Technology, and Amitash Nanda.^[7]^[9]

How does MLPerf ensure fair comparisons?

MLPerf employs several mechanisms to ensure fair and meaningful comparisons:^[1]^[2]

Reference implementations are provided for every benchmark, giving all submitters a common starting point.
Strict rules in the Closed Division control model architecture, preprocessing, and training hyperparameters.
Multiple runs are required, with outliers discarded and results averaged to reduce variance.
Peer review allows submitters to review each other's results before publication.
Results usage guidelines establish standards for how companies can cite MLPerf numbers in marketing materials.
Power measurement (optional but increasingly adopted) captures energy consumption alongside performance.
Speculative decoding rules introduced in v6.0 require EAGLE-style implementation and prohibit manipulating acceptance rates via continued pretraining or quantization.^[19]

The biannual cadence and evolving workload selection ensure that benchmarks remain relevant as the field progresses. Retired workloads (such as the Mini Go reinforcement learning benchmark, the original NMT machine translation task, GPT-3, GPT-J, and BERT) are replaced with newer, more representative models.

What are the criticisms of MLPerf?

While MLPerf is widely respected, some criticisms have been raised:

Vendor dominance in submissions: Large companies with dedicated benchmarking teams can optimize extensively, potentially making results less representative of typical user experience.
Benchmark selection lag: Despite regular updates, the fast pace of AI development means some benchmarks may not reflect the latest model architectures at any given time.
Cost and complexity: Participating in MLPerf requires significant engineering effort and hardware investment, which can limit participation from smaller organizations and startups.
Closed Division constraints: The strict rules, while enabling fair comparison, may not reflect the optimizations that users would apply in production deployments.
Coverage of emerging workloads: Although v6.0 added text-to-video and vision-language workloads, areas like long-horizon agentic AI, audio-conversational systems, and end-to-end multimodal pipelines remain underrepresented, though MLCommons continuously evaluates new workload candidates.^[7]

Version History Summary

Suite	Version	Date	Key Changes
Training	v0.5	December 2018	Initial release with ResNet-50, Transformer, NMT, Mini Go, Mask R-CNN, SSD
Training	v0.7	June 2019	Added BERT, DLRM
Training	v1.0	July 2020	Updated models and datasets
Training	v3.1	November 2023	Stable Diffusion, GPT-3, LLM fine-tuning added
Training	v4.0	June 2024	Power measurements introduced
Training	v5.0	June 2025	Llama 3.1 405B replaces GPT-3; record submissions^[8]
Training	v5.1	November 2025	Llama 3.1 8B replaces BERT; Flux.1 replaces Stable Diffusion; first NVFP4 submissions^[9]^[17]
Inference	v0.5	June 2019	Initial release
Inference	v4.0	March 2024	Mixtral 8x7B, SDXL added
Inference	v5.0	April 2025	Llama 3.1 405B, R-GAT, Automotive PointPainting, Interactive scenario
Inference	v5.1	September 2025	DeepSeek-R1, Llama 3.1 8B, Whisper Large V3 added; 27 submitters
Inference	v6.0	April 2026	GPT-OSS 120B, DLRMv3, Wan 2.2 text-to-video, VLM, YOLOv11; LoadGen++; speculative decoding standard; 288-GPU largest submission^[7]^[19]^[20]
HPC	v0.5	November 2019	DeepCAM, CosmoFlow
HPC	v3.0	November 2023	OpenFold protein folding added; 49x gains over first results
Tiny	v0.5	June 2021	KWS, VWW, IC, AD
Tiny	v1.3	September 2025	Streaming KWS benchmark added
Storage	v0.5	2023	Initial preview
Storage	v1.0	September 2024	Distributed training support
Storage	v2.0	August 2025	Llama 3 checkpointing; 200+ results
Client	v0.5	December 2024	Initial release for AI PCs
Client	v1.5	November 2025	Windows ML, iPad, power measurement
Automotive	v0.5	August 2025	2D/3D object detection, semantic segmentation
AILuminate	v1.0	December 2024	English LLM safety benchmark, 12 hazard categories^[21]
AILuminate	v1.1	2025	French added; Chinese, Hindi in development^[22]

References

Mattson, P., et al. "MLPerf Training Benchmark." arXiv:1910.01500, October 2019. https://arxiv.org/abs/1910.01500 ↩
Reddi, V.J., et al. "MLPerf Inference Benchmark." arXiv:1911.02549, November 2019. https://arxiv.org/abs/1911.02549 ↩
Banbury, C., et al. "MLPerf Tiny Benchmark." arXiv:2106.07597, June 2021. https://arxiv.org/abs/2106.07597 ↩
Farrell, S., et al. "MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems." arXiv:2110.11466, October 2021. https://arxiv.org/abs/2110.11466 ↩
MLCommons. "About MLCommons." https://mlcommons.org/about-us/. ↩
MLCommons. "MLPerf Training Benchmark." https://mlcommons.org/benchmarks/training/. ↩
MLCommons. "MLCommons Releases New MLPerf Inference v6.0 Benchmark Results." April 1, 2026. https://mlcommons.org/2026/04/mlperf-inference-v6-0-results/ ↩
MLCommons. "New MLCommons MLPerf Training v5.0 Benchmark Results Reflect Rapid Growth and Evolution of the Field of AI." June 4, 2025. https://mlcommons.org/2025/06/mlperf-training-v5-0-results/ ↩
MLCommons. "MLCommons Releases MLPerf Training v5.1 Results." November 12, 2025. https://mlcommons.org/2025/11/training-v5-1-results/ ↩
MLCommons. "New MLPerf Training and HPC Benchmark Results Showcase 49X Performance Gains in 5 Years." November 8, 2023. https://mlcommons.org/2023/11/mlperf-training-v3-1-hpc-v3-0-results/ ↩
MLCommons. "New MLPerf Storage v2.0 Benchmark Results." August 2025. https://mlcommons.org/2025/08/mlperf-storage-v2-0-results/ ↩
MLCommons. "MLCommons New MLPerf Tiny v1.3 Benchmark Results Released." September 2025. https://mlcommons.org/2025/09/mlperf-tiny-v1-3-results/ ↩
MLCommons. "MLPerf Client v1.5 Advances AI PC Benchmarking with Windows ML Integration." November 2025. https://mlcommons.org/2025/11/mlperf-client-1-5-release/ ↩
MLCommons. "AVCC and MLCommons Release New MLPerf Automotive v0.5 Benchmark Results." August 2025. https://mlcommons.org/2025/08/mlperf-auto-v0-5-results/ ↩
NVIDIA. "NVIDIA Blackwell Delivers up to 2.6x Higher Performance in MLPerf Training v5.0." https://developer.nvidia.com/blog/nvidia-blackwell-delivers-up-to-2-6x-higher-performance-in-mlperf-training-v5-0/ ↩
CoreWeave. "CoreWeave, NVIDIA and IBM Submit Largest-Ever MLPerf Results on NVIDIA GB200 Grace Blackwell Superchips." 2025. https://investors.coreweave.com/news/ ↩
NVIDIA. "NVIDIA Wins Every MLPerf Training v5.1 Benchmark." November 2025. https://blogs.nvidia.com/blog/mlperf-training-benchmark-blackwell-ultra/ ↩
Intel. "Intel Delivers Open, Scalable AI Performance in MLPerf Inference v6.0." April 2026. https://newsroom.intel.com/artificial-intelligence/intel-delivers-ai-performance-mlperf-inference-v6-0 ↩
MLCommons. "A new GPT-OSS benchmark and DeepSeek R1 updates for latency-optimized reasoning." March 2026. https://mlcommons.org/2026/03/mlperf-inference-gpt-oss/ ↩
NVIDIA. "NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut." April 2026. https://developer.nvidia.com/blog/nvidia-blackwell-ultra-sets-new-inference-records-in-mlperf-debut/ ↩
Ghosh, S., et al. "AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons." arXiv:2503.05731, March 2025. https://arxiv.org/abs/2503.05731 ↩
MLCommons. "AILuminate." https://mlcommons.org/benchmarks/ailuminate/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

MLPerf

History and Origins

Organizational Structure

What does the MLPerf benchmark suite include?

Submission Divisions and Categories

Divisions

Availability Categories

MLPerf Training

Training Workloads (v5.1, November 2025)

Notable Training Results

MLPerf Inference

Inference Scenarios

Inference Workloads (v6.0, April 2026)

LoadGen++ and Visualizer

Accuracy Requirements

Notable Inference Results

MLPerf HPC

HPC Workloads (v3.0, November 2023)

MLPerf Tiny

Tiny Workloads (v1.3, September 2025)

MLPerf Storage

Storage Workloads

MLPerf Client

Client Version History

MLPerf Automotive

Automotive Workloads (v0.5, August 2025)

AILuminate

How do companies use MLPerf?

How does MLPerf ensure fair comparisons?

What are the criticisms of MLPerf?

Version History Summary

See Also

References

Improve this article

What links here

What links here

History and Origins

Organizational Structure

What does the MLPerf benchmark suite include?

Submission Divisions and Categories

Divisions

Availability Categories

MLPerf Training

Training Workloads (v5.1, November 2025)

Notable Training Results

MLPerf Inference

Inference Scenarios

Inference Workloads (v6.0, April 2026)

LoadGen++ and Visualizer

Accuracy Requirements

Notable Inference Results

MLPerf HPC

HPC Workloads (v3.0, November 2023)

MLPerf Tiny

Tiny Workloads (v1.3, September 2025)

MLPerf Storage

Storage Workloads

MLPerf Client

Client Version History

MLPerf Automotive

Automotive Workloads (v0.5, August 2025)

AILuminate

How do companies use MLPerf?

How does MLPerf ensure fair comparisons?

What are the criticisms of MLPerf?

Version History Summary

See Also

References

Improve this article

Related Articles

Tokens per second

DROP (Discrete Reasoning Over Paragraphs)

LiveBench

LiveCodeBench

MMLU-Pro

MMMU

What links here

Related Articles

Tokens per second

DROP (Discrete Reasoning Over Paragraphs)

LiveBench

LiveCodeBench

MMLU-Pro

MMMU

What links here