MLCommons
Last reviewed
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 2,351 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 2,351 words
Add missing citations, update stale details, or suggest a clearer explanation.
MLCommons is a nonprofit artificial intelligence engineering consortium that develops industry-standard benchmarks, open datasets, and measurement tools for machine learning systems. It is best known for the MLPerf benchmark suites, which measure how fast hardware and software can train and serve AI models, and more recently for AILuminate, a benchmark for AI safety. The organization grew out of the MLPerf effort launched in May 2018 and was formally incorporated as the MLCommons Association, launching publicly on December 3, 2020 with more than 50 founding members [1][2]. As of June 2026 it counts more than 125 member companies and affiliates, and its twice-yearly MLPerf results rounds have become the de facto scoreboard for the AI hardware industry [3].
MLCommons describes its mission as "to accelerate artificial intelligence innovation and increase its positive impact on society," and organizes its work into three broad areas: performance benchmarks and metrics (the MLPerf family), open public datasets, and measurement of AI risk and reliability (AILuminate) [3]. The association is registered as a 501(c)(6) nonprofit in Delaware [4]. Technical work happens in open working groups whose members span chipmakers, cloud providers, server OEMs, AI startups, and universities; benchmark results are peer reviewed by submitters before each publication round. The consortium positions itself, in its own words, as building benchmarks and data to "make AI better for everyone" [3].
MLPerf was announced on May 2, 2018 by a group of researchers and engineers spanning Baidu, Google, Harvard University, Stanford University, and the University of California, Berkeley, with backing from chip vendors including Intel and AMD [5][6]. The stated ambition was to create a SPEC-like standard benchmark for machine learning, covering both training and inference from mobile devices to cloud systems [6]. The first MLPerf Training results were published in December 2018, with submissions from Google, Intel, and NVIDIA; the first MLPerf Inference results followed in November 2019.
On December 3, 2020 the effort was reorganized as MLCommons, a parent consortium with a broader mandate spanning benchmarks, datasets, and best practices. The founding board included representatives of Alibaba, Facebook AI (now Meta AI), Google, Intel, and NVIDIA, along with Harvard professor Vijay Janapa Reddi, and the organization launched with more than 50 founding members [1]. David Kanter, a semiconductor analyst and MLPerf co-founder, served as founding executive director. Under a leadership evolution announced in 2024, data scientist Rebecca Weiss (formerly head of research and innovation at Mozilla) became executive director, while Kanter took the dedicated role of Head of MLPerf; Peter Mattson of Google, who founded the MLPerf consortium, serves as president [7][8].
MLPerf is a family of full-system benchmark suites. Training benchmarks measure the time to train reference models to a target quality, while inference benchmarks measure throughput and latency across scenarios such as server, offline, and single-stream. Submissions fall into a Closed division, which requires mathematically equivalent models for direct apples-to-apples comparison, and an Open division that permits model modifications. Optional power measurement was added to MLPerf Inference in April 2021 [9]. MLCommons requires that numbers advertised as MLPerf results be formally submitted and peer reviewed; unofficial measurements must be labeled as unverified.
| Suite | What it measures | First results | Latest round (as of June 2026) |
|---|---|---|---|
| MLPerf Training | Time to train models to target quality | December 2018 | v5.1 (November 2025) |
| MLPerf Inference | Datacenter and edge serving throughput and latency | November 2019 | v6.0 (April 2026) |
| MLPerf Storage | Storage performance under AI training I/O and checkpointing | September 2023 | v2.0 (August 2025) |
| MLPerf Client | LLM performance on PCs and consumer devices | v1.0 (July 2025) | v1.5 (November 2025) |
| MLPerf Automotive | In-vehicle AI inference, jointly with AVCC | v0.5 (August 2025) | v0.5 (August 2025) |
| MLPerf Endpoints | Generative AI cloud and on-prem serving endpoints (API-centric) | v0.5 (March 2026) | v0.5 (March 2026) |
The suites have tracked the field's shift toward generative AI. MLPerf Training v5.1, released November 12, 2025, drew 20 submitting organizations and a record 65 unique systems using 12 different accelerator types; its workloads center on Llama 3.1 405B pretraining, a new Llama 3.1 8B benchmark that replaced BERT, Llama 2 70B LoRA fine-tuning, and the Flux.1 image generator, which replaced Stable Diffusion v2 [10]. On the Llama 3.1 405B test, NVIDIA reported a 10-minute time-to-train using 5,120 Blackwell GPUs, a roughly 2.7x speedup over its fastest Blackwell submission in the prior round [10]. MLPerf Inference v5.1 (September 9, 2025) set a participation record with 27 submitters and added DeepSeek-R1 (the suite's first reasoning model), Llama 3.1 8B, and Whisper Large V3 [11]. MLPerf Inference v6.0, published April 1, 2026 with 24 submitters, was described by working group co-chair Frank Han, a Dell Technologies engineer, as "the most significant revision of the Inference benchmark suite that we've ever done" [12]. Five of eleven datacenter tests were new or updated, including gpt-oss 120B, an interactive DeepSeek-R1 scenario, Meta's DLRMv3 recommender, a text-to-video benchmark, and a vision-language benchmark built on Shopify product catalog data, plus YOLOv11 object detection for edge systems. The round also reflected the industry's move to rack-scale inference, with multi-node submissions up 30 percent and the largest entry spanning 72 nodes and 288 accelerators [12].
Beyond the flagship suites, MLPerf Storage v2.0 (August 2025) added tests replicating real-world checkpointing for large training clusters [13]; MLPerf Automotive, developed with the Autonomous Vehicle Computing Consortium, published its first v0.5 results on August 27, 2025 [14]; and MLPerf Client v1.0 (July 30, 2025) benchmarks LLMs such as Llama 3.1 8B and Phi 3.5 Mini on AI PCs across NPUs and GPUs from AMD, Intel, NVIDIA, Qualcomm, and Apple, with v1.5 (November 18, 2025) adding Windows ML support [15][16]. In March 2026 MLCommons introduced MLPerf Endpoints, an API-centric benchmark for generative AI services in which, as the consortium put it, "the benchmark client is lightweight and production-ready; the system under test is simply a URL"; the v0.5 demonstration included submissions from AMD, Google, Intel, KRAI, and NVIDIA measuring metrics such as time-to-first-token and tokens per second across models including DeepSeek-R1, Llama 3.1 8B, and Qwen3 Coder 480B [30]. Earlier spin-offs include MLPerf Mobile, a smartphone benchmark app; MLPerf Tiny, for microcontroller-class devices, first published in June 2021; and MLPerf HPC, for scientific machine learning on supercomputers, introduced in November 2020.
In December 2024 MLCommons released AILuminate v1.0, a safety benchmark for general-purpose chat systems developed by its AI Risk and Reliability working group with input from AI companies, academics, and civil society organizations, including a design collaboration with Singapore's AI Verify Foundation [17]. The benchmark tests a model's resistance to eliciting harmful responses across twelve hazard categories, including violent crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, hate, privacy, defamation, and unqualified specialized advice. Each language version uses 24,000 test prompts, split between 12,000 public practice prompts and 12,000 private prompts held out for official testing, with responses scored by a tuned ensemble of evaluator models; systems receive grades on a five-step scale from Poor to Excellent [17][18]. Announcing the launch, MLCommons founder and president Peter Mattson said, "Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety," while executive director Rebecca Weiss called the release "a major milestone in our work to build a harmonized approach to safer AI" [17].
AILuminate launched in English with public grades for widely used chat systems. A French version followed at the Paris AI Action Summit in February 2025 as part of an AILuminate v1.1 update, with Chinese and Hindi versions in development [17][19]. In October 2025 MLCommons added an AILuminate Jailbreak benchmark (v0.5) that quantifies a "Resilience Gap" between a system's baseline safety and its behavior under adversarial jailbreak attacks [20].
MLCommons builds large open datasets aimed at lowering barriers to ML research. In December 2021 it released the People's Speech, a roughly 30,000-hour supervised English speech recognition dataset under Creative Commons licensing, alongside the Multilingual Spoken Words Corpus, which contains over 340,000 spoken keywords across 50 languages with more than 23 million audio examples [21]. "Speech technology can empower billions of people across the planet, but there's a real need for large, open, and diverse datasets to catalyze innovation," David Kanter said when the datasets were announced [21]. In January 2025 the datasets working group, in collaboration with Hugging Face, released the Unsupervised People's Speech, more than one million hours of multilingual audio drawn from permissively licensed Archive.org material [22].
The consortium also develops dataset infrastructure. Croissant, announced in March 2024, is a machine-readable metadata format for ML datasets that builds on schema.org and is supported by repositories including Hugging Face, Kaggle, and OpenML, allowing described datasets to be loaded directly into frameworks such as PyTorch, TensorFlow, and JAX [23]. Other efforts include MedPerf, an open platform for benchmarking medical AI models on distributed clinical data via federated evaluation, described in a 2023 Nature Machine Intelligence paper [24], and stewardship of Dynabench, a platform for dynamic, human-in-the-loop benchmarking originally created at Facebook AI Research.
The MLCommons Association is a member-funded 501(c)(6) industry consortium, with more than 125 members and affiliates including major chipmakers (NVIDIA, AMD, Intel, Qualcomm, Arm), hyperscalers and cloud providers (Google, Microsoft, Oracle), server OEMs (Dell, HPE, Lenovo, Supermicro), AI startups, and academic affiliates [3][4]. It is governed by a board of directors; as of 2026 Peter Mattson (Google) serves as president, Vijay Janapa Reddi (Harvard) and Carole-Jean Wu (Meta) as vice presidents, and the board includes representatives of NVIDIA, Intel, Qualcomm, Graphcore, and Myrtle.ai, with David Kanter as Head of MLPerf and Rebecca Weiss as executive director [7][8]. Day-to-day technical work is organized in open working groups for each benchmark suite, datasets, and AI risk and reliability.
MLPerf functions as the AI hardware industry's most visible neutral scoreboard, and its results rounds are a fixture of chip marketing. NVIDIA has submitted to every round since 2018 and regularly tops the charts; press coverage of MLPerf rounds in the early 2020s often led with variations of "NVIDIA wins again" [9], and HPCwire calculated that roughly 90 percent of systems submitted to Training v2.0 in June 2022 used NVIDIA accelerators [25]. Vendor participation is itself a competitive signal. AMD did not submit datacenter GPU results until MLPerf Inference v4.1 in August 2024, when its Instinct MI300X posted Llama 2 70B results roughly in line with NVIDIA's H100, and it made its first MLPerf Training submission in June 2025 [26][27]. Google submits TPU results selectively, while several prominent accelerator startups, including Cerebras, Groq, and SambaNova, have historically declined to participate; analysts attribute this to the substantial engineering cost of producing optimized, peer-reviewed submissions and to startups' need to focus resources on customers rather than benchmarks [28]. Critics have also noted that MLPerf reports neither prices nor, in most submissions, power consumption, limiting its usefulness for real-world price-performance comparisons [29].
Despite these critiques, participation has grown steadily, with record submitter counts and system diversity in the 2025 and 2026 rounds [11][12], and MLPerf results are widely used in procurement evaluations, academic systems research, and public claims about AI hardware progress. With AILuminate, MLCommons is attempting to extend the same neutral, consensus-based measurement model from performance to safety.