MLCommons

Research Organizations

12 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v2 · 2,351 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MLCommons is a nonprofit artificial intelligence engineering consortium that develops industry-standard benchmarks, open datasets, and measurement tools for machine learning systems. It is best known for the MLPerf benchmark suites, which measure how fast hardware and software can train and serve AI models, and more recently for AILuminate, a benchmark for AI safety. The organization grew out of the MLPerf effort launched in May 2018 and was formally incorporated as the MLCommons Association, launching publicly on December 3, 2020 with more than 50 founding members ^[1]^[2]. As of June 2026 it counts more than 125 member companies and affiliates, and its twice-yearly MLPerf results rounds have become the de facto scoreboard for the AI hardware industry ^[3].

What is MLCommons?

MLCommons describes its mission as "to accelerate artificial intelligence innovation and increase its positive impact on society," and organizes its work into three broad areas: performance benchmarks and metrics (the MLPerf family), open public datasets, and measurement of AI risk and reliability (AILuminate) ^[3]. The association is registered as a 501(c)(6) nonprofit in Delaware ^[4]. Technical work happens in open working groups whose members span chipmakers, cloud providers, server OEMs, AI startups, and universities; benchmark results are peer reviewed by submitters before each publication round. The consortium positions itself, in its own words, as building benchmarks and data to "make AI better for everyone" ^[3].

When was MLCommons founded?

MLPerf was announced on May 2, 2018 by a group of researchers and engineers spanning Baidu, Google, Harvard University, Stanford University, and the University of California, Berkeley, with backing from chip vendors including Intel and AMD ^[5]^[6]. The stated ambition was to create a SPEC-like standard benchmark for machine learning, covering both training and inference from mobile devices to cloud systems ^[6]. The first MLPerf Training results were published in December 2018, with submissions from Google, Intel, and NVIDIA; the first MLPerf Inference results followed in November 2019.

On December 3, 2020 the effort was reorganized as MLCommons, a parent consortium with a broader mandate spanning benchmarks, datasets, and best practices. The founding board included representatives of Alibaba, Facebook AI (now Meta AI), Google, Intel, and NVIDIA, along with Harvard professor Vijay Janapa Reddi, and the organization launched with more than 50 founding members ^[1]. David Kanter, a semiconductor analyst and MLPerf co-founder, served as founding executive director. Under a leadership evolution announced in 2024, data scientist Rebecca Weiss (formerly head of research and innovation at Mozilla) became executive director, while Kanter took the dedicated role of Head of MLPerf; Peter Mattson of Google, who founded the MLPerf consortium, serves as president ^[7]^[8].

What is MLPerf?

MLPerf is a family of full-system benchmark suites. Training benchmarks measure the time to train reference models to a target quality, while inference benchmarks measure throughput and latency across scenarios such as server, offline, and single-stream. Submissions fall into a Closed division, which requires mathematically equivalent models for direct apples-to-apples comparison, and an Open division that permits model modifications. Optional power measurement was added to MLPerf Inference in April 2021 ^[9]. MLCommons requires that numbers advertised as MLPerf results be formally submitted and peer reviewed; unofficial measurements must be labeled as unverified.

Suite	What it measures	First results	Latest round (as of June 2026)
MLPerf Training	Time to train models to target quality	December 2018	v5.1 (November 2025)
MLPerf Inference	Datacenter and edge serving throughput and latency	November 2019	v6.0 (April 2026)
MLPerf Storage	Storage performance under AI training I/O and checkpointing	September 2023	v2.0 (August 2025)
MLPerf Client	LLM performance on PCs and consumer devices	v1.0 (July 2025)	v1.5 (November 2025)
MLPerf Automotive	In-vehicle AI inference, jointly with AVCC	v0.5 (August 2025)	v0.5 (August 2025)
MLPerf Endpoints	Generative AI cloud and on-prem serving endpoints (API-centric)	v0.5 (March 2026)	v0.5 (March 2026)

The suites have tracked the field's shift toward generative AI. MLPerf Training v5.1, released November 12, 2025, drew 20 submitting organizations and a record 65 unique systems using 12 different accelerator types; its workloads center on Llama 3.1 405B pretraining, a new Llama 3.1 8B benchmark that replaced BERT, Llama 2 70B LoRA fine-tuning, and the Flux.1 image generator, which replaced Stable Diffusion v2 ^[10]. On the Llama 3.1 405B test, NVIDIA reported a 10-minute time-to-train using 5,120 Blackwell GPUs, a roughly 2.7x speedup over its fastest Blackwell submission in the prior round ^[10]. MLPerf Inference v5.1 (September 9, 2025) set a participation record with 27 submitters and added DeepSeek-R1 (the suite's first reasoning model), Llama 3.1 8B, and Whisper Large V3 ^[11]. MLPerf Inference v6.0, published April 1, 2026 with 24 submitters, was described by working group co-chair Frank Han, a Dell Technologies engineer, as "the most significant revision of the Inference benchmark suite that we've ever done" ^[12]. Five of eleven datacenter tests were new or updated, including gpt-oss 120B, an interactive DeepSeek-R1 scenario, Meta's DLRMv3 recommender, a text-to-video benchmark, and a vision-language benchmark built on Shopify product catalog data, plus YOLOv11 object detection for edge systems. The round also reflected the industry's move to rack-scale inference, with multi-node submissions up 30 percent and the largest entry spanning 72 nodes and 288 accelerators ^[12].

Beyond the flagship suites, MLPerf Storage v2.0 (August 2025) added tests replicating real-world checkpointing for large training clusters ^[13]; MLPerf Automotive, developed with the Autonomous Vehicle Computing Consortium, published its first v0.5 results on August 27, 2025 ^[14]; and MLPerf Client v1.0 (July 30, 2025) benchmarks LLMs such as Llama 3.1 8B and Phi 3.5 Mini on AI PCs across NPUs and GPUs from AMD, Intel, NVIDIA, Qualcomm, and Apple, with v1.5 (November 18, 2025) adding Windows ML support ^[15]^[16]. In March 2026 MLCommons introduced MLPerf Endpoints, an API-centric benchmark for generative AI services in which, as the consortium put it, "the benchmark client is lightweight and production-ready; the system under test is simply a URL"; the v0.5 demonstration included submissions from AMD, Google, Intel, KRAI, and NVIDIA measuring metrics such as time-to-first-token and tokens per second across models including DeepSeek-R1, Llama 3.1 8B, and Qwen3 Coder 480B ^[30]. Earlier spin-offs include MLPerf Mobile, a smartphone benchmark app; MLPerf Tiny, for microcontroller-class devices, first published in June 2021; and MLPerf HPC, for scientific machine learning on supercomputers, introduced in November 2020.

What is AILuminate?

In December 2024 MLCommons released AILuminate v1.0, a safety benchmark for general-purpose chat systems developed by its AI Risk and Reliability working group with input from AI companies, academics, and civil society organizations, including a design collaboration with Singapore's AI Verify Foundation ^[17]. The benchmark tests a model's resistance to eliciting harmful responses across twelve hazard categories, including violent crimes, child sexual exploitation, indiscriminate weapons, suicide and self-harm, hate, privacy, defamation, and unqualified specialized advice. Each language version uses 24,000 test prompts, split between 12,000 public practice prompts and 12,000 private prompts held out for official testing, with responses scored by a tuned ensemble of evaluator models; systems receive grades on a five-step scale from Poor to Excellent ^[17]^[18]. Announcing the launch, MLCommons founder and president Peter Mattson said, "Companies are increasingly incorporating AI into their products, but they have no standardized way of evaluating product safety," while executive director Rebecca Weiss called the release "a major milestone in our work to build a harmonized approach to safer AI" ^[17].

AILuminate launched in English with public grades for widely used chat systems. A French version followed at the Paris AI Action Summit in February 2025 as part of an AILuminate v1.1 update, with Chinese and Hindi versions in development ^[17]^[19]. In October 2025 MLCommons added an AILuminate Jailbreak benchmark (v0.5) that quantifies a "Resilience Gap" between a system's baseline safety and its behavior under adversarial jailbreak attacks ^[20].

Datasets and other work

MLCommons builds large open datasets aimed at lowering barriers to ML research. In December 2021 it released the People's Speech, a roughly 30,000-hour supervised English speech recognition dataset under Creative Commons licensing, alongside the Multilingual Spoken Words Corpus, which contains over 340,000 spoken keywords across 50 languages with more than 23 million audio examples ^[21]. "Speech technology can empower billions of people across the planet, but there's a real need for large, open, and diverse datasets to catalyze innovation," David Kanter said when the datasets were announced ^[21]. In January 2025 the datasets working group, in collaboration with Hugging Face, released the Unsupervised People's Speech, more than one million hours of multilingual audio drawn from permissively licensed Archive.org material ^[22].

The consortium also develops dataset infrastructure. Croissant, announced in March 2024, is a machine-readable metadata format for ML datasets that builds on schema.org and is supported by repositories including Hugging Face, Kaggle, and OpenML, allowing described datasets to be loaded directly into frameworks such as PyTorch, TensorFlow, and JAX ^[23]. Other efforts include MedPerf, an open platform for benchmarking medical AI models on distributed clinical data via federated evaluation, described in a 2023 Nature Machine Intelligence paper ^[24], and stewardship of Dynabench, a platform for dynamic, human-in-the-loop benchmarking originally created at Facebook AI Research.

Membership and governance

The MLCommons Association is a member-funded 501(c)(6) industry consortium, with more than 125 members and affiliates including major chipmakers (NVIDIA, AMD, Intel, Qualcomm, Arm), hyperscalers and cloud providers (Google, Microsoft, Oracle), server OEMs (Dell, HPE, Lenovo, Supermicro), AI startups, and academic affiliates ^[3]^[4]. It is governed by a board of directors; as of 2026 Peter Mattson (Google) serves as president, Vijay Janapa Reddi (Harvard) and Carole-Jean Wu (Meta) as vice presidents, and the board includes representatives of NVIDIA, Intel, Qualcomm, Graphcore, and Myrtle.ai, with David Kanter as Head of MLPerf and Rebecca Weiss as executive director ^[7]^[8]. Day-to-day technical work is organized in open working groups for each benchmark suite, datasets, and AI risk and reliability.

Why does MLPerf matter?

MLPerf functions as the AI hardware industry's most visible neutral scoreboard, and its results rounds are a fixture of chip marketing. NVIDIA has submitted to every round since 2018 and regularly tops the charts; press coverage of MLPerf rounds in the early 2020s often led with variations of "NVIDIA wins again" ^[9], and HPCwire calculated that roughly 90 percent of systems submitted to Training v2.0 in June 2022 used NVIDIA accelerators ^[25]. Vendor participation is itself a competitive signal. AMD did not submit datacenter GPU results until MLPerf Inference v4.1 in August 2024, when its Instinct MI300X posted Llama 2 70B results roughly in line with NVIDIA's H100, and it made its first MLPerf Training submission in June 2025 ^[26]^[27]. Google submits TPU results selectively, while several prominent accelerator startups, including Cerebras, Groq, and SambaNova, have historically declined to participate; analysts attribute this to the substantial engineering cost of producing optimized, peer-reviewed submissions and to startups' need to focus resources on customers rather than benchmarks ^[28]. Critics have also noted that MLPerf reports neither prices nor, in most submissions, power consumption, limiting its usefulness for real-world price-performance comparisons ^[29].

Despite these critiques, participation has grown steadily, with record submitter counts and system diversity in the 2025 and 2026 rounds ^[11]^[12], and MLPerf results are widely used in procurement evaluations, academic systems research, and public claims about AI hardware progress. With AILuminate, MLCommons is attempting to extend the same neutral, consensus-based measurement model from performance to safety.

References

MLCommons. "MLCommons Launches." December 3, 2020. https://mlcommons.org/2020/05/mlcommons-launches-2/ ↩
Crunchbase. "David Kanter - Executive Director and Founder, MLCommons." https://www.crunchbase.com/person/david-kanter ↩
MLCommons. "About MLCommons." https://mlcommons.org/about-us/ ↩
ProPublica Nonprofit Explorer. "MLCommons Association (EIN 85-0546914)." https://projects.propublica.org/nonprofits/organizations/850546914 ↩
The Register. "Hooray for MLPerf, another AI benchmark competition backed by Google, Baidu, etc." May 2, 2018. https://www.theregister.com/2018/05/02/mlperf_ai_benchmark/ ↩
RISE Lab, UC Berkeley. "MLPerf: SPEC for ML." 2018. https://rise.cs.berkeley.edu/blog/mlperf-spec-for-ml/ ↩
MLCommons. "Leadership." https://mlcommons.org/about-us/leadership/ ↩
MLCommons. "MLCommons Leadership Evolution." June 2024. https://mlcommons.org/2024/06/mlcommons-leadership-evolution/ ↩
HPCwire. "MLPerf Issues New Inferencing Results, Adds Power Metrics, Nvidia Wins (Again)." April 21, 2021. https://www.hpcwire.com/2021/04/21/mlperf-issues-new-inferencing-results-adds-power-metrics-nvidia-wins-again/ ↩
MLCommons. "MLCommons Releases MLPerf Training v5.1 Results." November 12, 2025. https://mlcommons.org/2025/11/training-v5-1-results/ ↩
MLCommons. "MLCommons Releases New MLPerf Inference v5.1 Benchmark Results." September 9, 2025. https://mlcommons.org/2025/09/mlperf-inference-v5-1-results/ ↩
MLCommons. "MLCommons Releases New MLPerf Inference v6.0 Benchmark Results." April 1, 2026. https://mlcommons.org/2026/04/mlperf-inference-v6-0-results/ ↩
MLCommons. "New MLPerf Storage v2.0 Benchmark Results Demonstrate the Critical Role of Storage Performance in AI Training Systems." August 2025. https://mlcommons.org/2025/08/mlperf-storage-v2-0-results/ ↩
MLCommons. "AVCC and MLCommons Release New MLPerf Automotive v0.5 Benchmark Results." August 27, 2025. https://mlcommons.org/2025/08/mlperf-auto-v0-5-results/ ↩
MLCommons. "MLCommons Releases MLPerf Client v1.0: A New Standard for AI PC and Client LLM Benchmarking." July 30, 2025. https://mlcommons.org/2025/07/mlperf-client-v1-0/ ↩
MLCommons. "MLPerf Client v1.5 Advances AI PC Benchmarking with Windows ML Integration." November 18, 2025. https://mlcommons.org/2025/11/mlperf-client-1-5-release/ ↩
MLCommons. "MLCommons Launches AILuminate, First-of-its-Kind Benchmark to Measure the Safety of Large Language Models." December 4, 2024. https://mlcommons.org/2024/12/mlcommons-ailuminate-v1-0-release/ ↩
Ghosh, S. et al. "AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons." arXiv, March 2025. https://arxiv.org/abs/2503.05731 ↩
MLCommons. "MLCommons Releases French AILuminate Benchmark Demo Prompt Dataset to GitHub." April 2025. https://mlcommons.org/2025/04/ailuminate-french-datasets/ ↩
MLCommons. "MLCommons Unveils New Jailbreak Benchmark, Quantifying AI's 'Resilience Gap' to Adversarial Attacks." October 2025. https://mlcommons.org/2025/10/ailuminate-jailbreak-v05/ ↩
MLCommons. "MLCommons Unveils Open Datasets and Tools to Drive Democratization of Machine Learning." December 14, 2021. https://mlcommons.org/2021/12/mlcommons-unveils-open-datasets-and-tools-to-drive-democratization-of-machine-learning/ ↩
MLCommons. "Unsupervised People's Speech: A Massive Multilingual Audio Dataset." January 2025. https://mlcommons.org/2025/01/new-unsupervised-peoples-speech/ ↩
MLCommons. "New Croissant Metadata Format Helps Standardize ML Datasets." March 2024. https://mlcommons.org/2024/03/croissant_metadata_announce/ ↩
Karargyris, A., Umeton, R. et al. "Federated benchmarking of medical artificial intelligence with MedPerf." Nature Machine Intelligence, July 2023. https://www.nature.com/articles/s42256-023-00652-2 ↩
HPCwire. "The Mainstreaming of MLPerf? Nvidia Dominates Training v2.0 but Challengers Are Rising." June 29, 2022. https://www.hpcwire.com/2022/06/29/the-mainstreaming-of-mlperf-nvidia-dominates-training-v2-0-but-challengers-are-rising/ ↩
Tom's Hardware. "AMD posts first Instinct MI300X MLPerf benchmark results, roughly in line with Nvidia H100 performance." August 2024. https://www.tomshardware.com/tech-industry/artificial-intelligence/amd-posts-first-instinct-mi300x-mlperf-benchmark-results-roughly-in-line-with-nvidia-h100-performance ↩
AMD. "AMD Expands AI Momentum with First MLPerf Training Submission." June 2025. https://www.amd.com/en/blogs/2025/amd-drives-ai-gains-with-mlperf-training-results.html ↩
Moor Insights & Strategy. "Why Can't NVIDIA Be Bested In MLPerf?" https://moorinsightsstrategy.com/why-cant-nvidia-be-bested-in-mlperf/ ↩
The Next Platform. "The Performance Of MLPerf As A Ubiquitous Benchmark Is Lacking." April 8, 2022. https://www.nextplatform.com/2022/04/08/the-performance-of-mlperf-as-a-ubiquitous-benchmark-is-lacking/ ↩
MLCommons. "Standardizing Generative AI Service Evaluation: An API-Centric Benchmarking Approach." March 19, 2026. https://mlcommons.org/2026/03/mlperf-endpoints-gen-ai-benchmarking/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Dynabench MLPerf Non-profit Organizations Performance Purple Llama

What is MLCommons?

When was MLCommons founded?

What is MLPerf?

What is AILuminate?

Datasets and other work

Membership and governance

Why does MLPerf matter?

References

Improve this article

Related Articles

ByteDance Seed

METR

Non-profit Organizations

Organizations

Machine Intelligence Research Institute

Center for AI Safety

What links here

Related Articles

ByteDance Seed

METR

Non-profit Organizations

Organizations

Machine Intelligence Research Institute

Center for AI Safety

What links here