MLPerf
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 4,560 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 4,560 words
Add missing citations, update stale details, or suggest a clearer explanation.
MLPerf is an industry-standard suite of benchmarks for measuring the performance of machine learning systems across training, inference, and specialized computing environments. Developed and maintained by MLCommons, a global engineering consortium of over 125 organizations, MLPerf provides a transparent and reproducible methodology for comparing hardware and software platforms on real-world AI workloads. Since its founding in 2018, MLPerf has become the most widely cited benchmark in the AI hardware industry, with results used by chip manufacturers, cloud providers, and system integrators to validate performance claims and guide purchasing decisions.[1][5]
The MLPerf project originated in February 2018 from a series of meetings between engineers and researchers at Baidu, Google, Harvard University, Stanford University, and UC Berkeley. The founding team included Peter Mattson (Google Brain), David Patterson (UC Berkeley), Greg Diamos (Baidu), Cliff Young (Google), Peter Bailis (Stanford), and Gu-Yeon Wei (Harvard). Their goal was to create an equivalent of SPEC benchmarks or TPC benchmarks for the machine learning field, providing standardized, fair comparisons across rapidly proliferating hardware and software platforms.[1]
The first MLPerf Training benchmark results were published in late 2018, and the MLPerf Inference benchmark followed in 2019. The formal academic paper, "MLPerf Training Benchmark," appeared on arXiv in October 2019 with contributions from over 38 authors spanning industry and academia.[1]
In December 2020, the MLPerf effort was formally incorporated into MLCommons, a nonprofit engineering consortium headquartered in Dover, Delaware. MLCommons expanded the scope beyond benchmarking to include open datasets, safety evaluations, and research initiatives. The founding board included representatives from Alibaba, Meta (then Facebook AI), Google, Intel, NVIDIA, and Harvard University.[5]
As of 2026, MLCommons has grown to include over 125 member organizations ranging from startups to Fortune 500 companies, universities, and nonprofits across the globe. The MLPerf benchmark suite has produced more than 100,000 individual results across all its categories.[5][7]
MLCommons organizes its work through several working groups:
| Working Group Area | Key Initiatives |
|---|---|
| Benchmarks | MLPerf Training, MLPerf Inference, MLPerf HPC, MLPerf Tiny, MLPerf Storage, MLPerf Client, MLPerf Automotive, AlgoPerf, AILuminate |
| Data | Croissant metadata format, open datasets, medical AI data, MLCube |
| Research | Algorithms, Chakra execution traces, data-centric ML, science |
| AI Risk and Reliability | AILuminate safety evaluation, responsible AI development |
The MLPerf benchmarks follow a biannual release cadence, with new results published roughly every six months (e.g., v5.0 in June 2025 and v5.1 in November 2025 for Training; v5.1 in September 2025 and v6.0 in April 2026 for Inference).[7][8][9] Each release can include updated workloads, new models, and retired benchmarks to keep pace with the fast-moving field.
MLPerf is not a single benchmark but a family of benchmark suites, each targeting a different aspect of the ML lifecycle or a different class of hardware. The major suites are:
| Suite | Target | First Release | Latest Version (as of May 2026) |
|---|---|---|---|
| MLPerf Training | Datacenter GPU/accelerator clusters | 2018 | v5.1 (November 2025)[9] |
| MLPerf Inference | Datacenter and edge inference systems | 2019 | v6.0 (April 2026)[7] |
| MLPerf HPC | Supercomputers for scientific ML | 2019 | v3.0 (November 2023)[10] |
| MLPerf Tiny | Microcontrollers and edge devices | 2021 | v1.3 (September 2025)[12] |
| MLPerf Storage | Storage systems for ML training | 2023 | v2.0 (August 2025)[11] |
| MLPerf Client | Laptops, desktops, workstations | 2024 | v1.5 (November 2025)[13] |
| MLPerf Automotive | Automotive inference systems | 2025 | v0.5 (August 2025)[14] |
| AILuminate | LLM safety evaluation | 2024 | v1.1 (2025)[21][22] |
All MLPerf benchmark suites use a common framework of submission divisions and availability categories to ensure fair comparisons while still allowing innovation.
Closed Division: Submitters must use the reference model architecture and adhere to strict rules about preprocessing, training procedures, and hyperparameters. This division enables direct, apples-to-apples comparisons of hardware and system software. Most industry submissions fall into this category.[1][2]
Open Division: Submitters may modify the model architecture, use different optimizers, or apply novel training techniques. This division encourages algorithmic innovation and research exploration. Results in the Open Division cannot be directly compared to Closed Division results.[1][2]
| Category | Description |
|---|---|
| Available | All hardware and software components can be purchased or rented by the public at the time of submission |
| Preview | Components must become Available by the next submission round |
| RDI (Research, Development, Internal) | Experimental or pre-production systems not yet commercially available |
The MLPerf Training benchmark measures the wall-clock time required to train a model on a specified dataset to a defined quality target. The methodology involves running the benchmark a specified number of times, discarding the highest and lowest results, and averaging the remainder. Typical accuracy variance across runs is roughly plus or minus 2.5% for imaging benchmarks and plus or minus 5% for other workloads.[1]
The v5.1 suite includes the following workloads:[9]
| Benchmark | Model | Dataset | Domain | Notes |
|---|---|---|---|---|
| LLM Pretraining (large) | Llama 3.1 405B | C4 (v3.0.1) | Natural language processing | 405 billion parameters, sequence length 8,192. Replaced GPT-3 in v5.0. |
| LLM Pretraining (small) | Llama 3.1 8B | C4 | Natural language processing | New in v5.1, replaces BERT. Enables single-node benchmarking. |
| LLM Fine-tuning | Llama 2 70B (LoRA) | SCROLLS GovReport | Natural language processing | Sequence length 8,192. Uses Low-Rank Adaptation. |
| Object Detection | RetinaNet | Open Images | Computer vision | Lightweight object detection |
| Image Classification | ResNet-50 | ImageNet | Computer vision | Classic vision benchmark |
| Medical Image Segmentation | 3D U-Net | KiTS19 | Medical imaging | Kidney tumor segmentation |
| Text-to-Image Generation | Flux.1 | CC12M (train), COCO 2014 (eval) | Generative AI | 11.9B parameter transformer model. New in v5.1, replaces Stable Diffusion v2. |
| Recommendation | DLRMv2 | Criteo 4TB | Commerce | Deep Learning Recommendation Model |
| Graph Neural Network | R-GAT | IGBH-Full | Graph learning | Relational Graph Attention Network |
| Speech Recognition | RNN-T | LibriSpeech | Audio | Recurrent Neural Network Transducer |
MLPerf Training v5.0 (June 2025) introduced the Llama 3.1 405B benchmark, the largest model ever included in the training suite. In a landmark result, CoreWeave, NVIDIA, and IBM completed the Llama 3.1 405B training benchmark in just 27.3 minutes using a cluster of 2,496 NVIDIA GB200 NVL72 GPUs. This was the largest cluster ever benchmarked under MLPerf and achieved 91% scaling efficiency from 512 to 2,496 GPUs.[8][16]
The NVIDIA Blackwell architecture delivered up to 2.6x higher performance per GPU compared to the previous-generation Hopper architecture, with at least 2x faster training at equivalent cluster sizes.[15]
MLPerf Training v5.1 (November 2025) included results from 20 organizations submitting 65 unique systems across 12 different hardware accelerators. Nearly half of all submissions were multi-node configurations, an 86% increase over v4.1. The generative AI workloads (Llama 2 70B LoRA and Llama 3.1 8B) saw 24% and 15% submission increases respectively.[9] First-time submitters included University of Florida, Verda, and Wiwynn.[9]
NVIDIA swept all seven MLPerf Training v5.1 tests and was the only platform to submit results on every test.[17] Highlights included a 10-minute Llama 3.1 405B pretraining time on 5,000+ Blackwell GPUs, a 5.2-minute Llama 3.1 8B pretraining time on 512 Blackwell Ultra GPUs, and a 12.5-minute Flux.1 image-generation training on 1,152 Blackwell GPUs.[17] Compared to Hopper, Blackwell Ultra (GB300 NVL72) delivered roughly 4x faster Llama 3.1 405B pretraining and nearly 5x faster Llama 2 70B LoRA fine-tuning at the same GPU count.[17]
NVIDIA also demonstrated the first MLPerf training submissions using NVFP4 four-bit floating-point precision, achieving double the rate of FP8 on Blackwell and roughly 3x on Blackwell Ultra while remaining the only platform meeting accuracy requirements with FP4.[17]
Version 4.0 (2024) was notable for introducing system-wide power draw and energy consumption measurements during training, reflecting the growing importance of energy efficiency in large-scale AI systems.
The MLPerf Inference benchmark measures how quickly systems can execute trained models across various deployment scenarios. It covers both datacenter and edge deployments, with different scenarios reflecting real-world usage patterns.[2]
| Scenario | Description | Primary Use Case |
|---|---|---|
| Offline | Processes the entire dataset as a batch; measures raw throughput | Batch processing, analytics |
| Server | Queries arrive following a Poisson distribution; measures throughput under latency constraints | Cloud API endpoints, web services |
| SingleStream | One query at a time; measures latency per query | Mobile apps, single-user devices |
| MultiStream | Multiple concurrent query streams | Multi-camera systems, autonomous vehicles |
| Interactive | Tight latency constraints for time-to-first-token (TTFT) and time-per-output-token (TPOT) | Chatbots, agentic AI applications |
Datacenter submissions require the Offline and Server scenarios (plus Interactive for LLM benchmarks). Edge submissions require SingleStream, MultiStream, and Offline scenarios.[2][7]
MLPerf Inference v6.0, released April 1, 2026, refreshed roughly half of the datacenter suite. Five of the eleven datacenter tests were new or updated, and an additional new object-detection workload was added for edge systems.[7]
| Benchmark | Model | Dataset | Scenarios | Notes |
|---|---|---|---|---|
| Reasoning LLM | GPT-OSS 120B | AIME 2024, GPQA-Diamond, LiveCodeBench v6 (accuracy); PubMed (performance) | Datacenter (Server, Interactive) | New in v6.0. Mixture-of-experts (117B total, 5.1B active per token). First MLPerf benchmark to split performance and accuracy datasets.[7][19] |
| Reasoning LLM (Interactive) | DeepSeek-R1 | Math, Q&A, code | Datacenter (Interactive) | New interactive scenario in v6.0; first MLPerf standard for speculative decoding (EAGLE-style). Constraints: TTFT <= 1.5s, TPOT <= 15ms.[7][19] |
| Recommendation | DLRMv3 | Sequential commerce data | Datacenter | New in v6.0; third-generation recommender, first sequential recommendation benchmark, modernized with contributions from Meta.[7] |
| Text-to-Video | Wan 2.2 | Standard prompts | Datacenter | New in v6.0; first video generation workload in MLPerf Inference.[7] |
| Vision-Language Model | Open-weight VLM | E-commerce product catalog | Datacenter | New in v6.0; covers structured metadata generation from images.[7] |
| Object Detection (Edge) | YOLOv11 Large | Open Images / KITTI | Edge | New in v6.0 for edge systems.[7] |
| LLM | Llama 3.1 405B | Multiple | Datacenter | Retained from v5.x. |
| Small LLM | Llama 3.1 8B | CNN DailyMail | Datacenter, Edge | Retained from v5.1; 128K token context. |
| LLM | Llama 2 70B / Interactive | Multiple | Datacenter | Retained from v5.x. |
| Mixture of Experts | Mixtral 8x7B | Multiple | Datacenter | Retained. |
| Image Classification | ResNet-50 | ImageNet | Datacenter, Edge | Retained. |
| Medical Image Segmentation | 3D U-Net | KiTS19 | Datacenter, Edge | Retained. |
| Speech Recognition | Whisper Large V3 | Multiple | Datacenter, Edge | Retained from v5.1. |
| Graph Neural Network | R-GAT | IGBH | Datacenter | Retained. |
| Autonomous Driving | PointPainting | nuScenes, Cognata | Edge (automotive) | Retained. |
The v6.0 release also introduced LoadGen++, a substantially upgraded version of the load-generator harness that allows LLM benchmarks to run against serving-style software stacks (resembling production deployments) rather than the simpler bench rigs used previously. MLCommons described LoadGen++ as a key infrastructure investment to keep the suite nimble as model serving stacks evolve.[7] Alongside LoadGen++, MLCommons launched an interactive results dashboard at mlcommons.org/visualizer with advanced filtering and customized performance graphs.[7][20]
Most benchmarks require the submitted system to achieve at least 99% of the accuracy of the FP32 reference model. Language processing and medical imaging benchmarks have a stricter threshold of 99.9% accuracy relative to the reference. For GPT-OSS 120B, accuracy targets are 82.92% on AIME (8 repeats), 74.95% on GPQA-Diamond (5 repeats), and 84.68% on LiveCodeBench v6 (3 repeats).[19]
MLPerf Inference v6.0 (April 2026) drew submissions from 24 organizations, including AMD, ASUSTeK, Cisco, CoreWeave, Dell, GATEOverflow, Giga Computing, Google, HPE, Intel, Inventec, KRAI, Lambda, Lenovo, MangoBoost, MiTAC, Nebius, Netweb Technologies India, NVIDIA, Oracle, Quanta Cloud Technology, Red Hat, Stevens Institute of Technology, and Supermicro.[7] The round saw a 30% increase in multi-node submissions versus v5.1, and 10% of submissions featured more than ten nodes.[7]
NVIDIA's largest submission used 288 Blackwell Ultra GPUs across four GB300 NVL72 racks connected by Quantum-X800 InfiniBand, the largest MLPerf Inference submission ever, quadruple the previous round's maximum.[7][17][20] On DeepSeek-R1, this system delivered approximately 2.49 million tokens/second offline and 1.55 million tokens/second in server mode.[20] Blackwell Ultra (GB300) also set per-GPU records on GPT-OSS 120B, Llama 3.1 405B, Llama 3.1 8B, Whisper, and other workloads.[17][20]
AMD's Instinct MI355X delivered competitive single-node results and the first model bring-up on GPT-OSS 120B and Wan 2.2 text-to-video by a non-NVIDIA accelerator, and AMD multinode submissions crossed one million tokens per second at cluster scale.[7] Intel was the only vendor with standalone CPU submissions; over half of the v6.0 submissions used Xeon processors, and Intel debuted its Arc Pro B70 GPU (delivering up to 1.8x the inference performance of the B60).[18]
MLPerf Inference v5.1 (September 2025) had set the previous participation record at 27 organizations. Five new accelerators debuted: AMD Instinct MI355X, Intel Arc Pro B60 48GB Turbo, NVIDIA GB300, NVIDIA RTX 4000 Ada PCIe 20GB, and NVIDIA RTX Pro 6000 Blackwell Server Edition. Performance on Llama 2 70B improved by up to 50% over v5.0 results from just six months prior. The round also included the first heterogeneous system submission, which used software load-balancing across different accelerator types.
MLPerf Inference v5.0 (April 2025) included 17,457 results from 23 organizations. Five newly available processors were benchmarked: AMD Instinct MI325X, Intel Xeon 6980P, Google TPU Trillium (v6e), NVIDIA B200, and NVIDIA Jetson AGX Thor 128.
MLPerf HPC targets supercomputer-scale systems used for scientific machine learning. Unlike the standard Training benchmark, HPC benchmarks focus on scientific applications and include an optional throughput metric for large multi-user systems.[4]
| Benchmark | Model | Application | Description |
|---|---|---|---|
| DeepCAM | Convolutional neural network | Climate science | Identifies weather patterns in climate simulation data |
| CosmoFlow | 3D CNN | Cosmology | Predicts cosmological parameters from 3D dark matter distributions |
| OpenCatalyst | GNN | Chemistry | Predicts quantum mechanical properties of catalyst materials for energy storage |
| OpenFold | Generative AI | Structural biology | Predicts 3D protein structures from amino acid sequences. Added in v3.0. |
MLPerf HPC v3.0 results demonstrated performance gains of up to 2.8x compared to just five months prior and 49x over the first HPC results. The DeepCAM weather modeling benchmark ran 14x faster than when it debuted, illustrating how innovations in ML hardware and software benefit scientific computing.[10]
MLPerf Tiny benchmarks inference performance on extremely low-power embedded devices such as microcontrollers, DSPs, and tiny neural network accelerators. These devices typically operate at clock speeds between 10 MHz and 250 MHz and consume less than 50 mW of power. The neural networks tested are small, typically 100 KB and below, processing sensor data like audio and images.[3]
| Benchmark | Task | Description |
|---|---|---|
| Keyword Spotting (KWS) | Audio classification | Detects spoken wake words or commands. Used in earbuds and virtual assistants. |
| Visual Wake Words (VWW) | Image classification | Detects presence of a person in an image. Used in security cameras and smart home devices. |
| Image Classification (IC) | Image classification | Classifies small images into 10 categories using a compact CNN. |
| Anomaly Detection (AD) | Audio anomaly detection | Identifies abnormal sounds in machine operating environments for predictive maintenance. |
| Streaming KWS | Audio streaming | New in v1.3. Uses a 1D depthwise separable CNN on continuous audio streams to detect wake words in real time. |
The reference implementation uses TensorFlow Lite for Microcontrollers (TFLM). Three metrics are measured for each benchmark: accuracy, latency, and energy consumption. MLPerf Tiny v1.3 included 70 results across five benchmarks from four participants: Kai Jiang, Qualcomm, STMicroelectronics, and Syntiant.[12]
MLPerf Storage measures how quickly storage systems can feed training data to accelerators during model training. Storage throughput is a critical bottleneck in large-scale training, especially when datasets are too large to fit in memory.
| Version | Workloads | Key Feature |
|---|---|---|
| v1.0 (September 2024) | 3D U-Net, ResNet-50, CosmoFlow | Added distributed training support. 55 submissions from 12 companies. |
| v2.0 (August 2025) | 3D U-Net, ResNet-50, CosmoFlow, Llama 3 checkpointing | Added real-world checkpointing tests. Over 200 results from 26 organizations across 7 countries.[11] |
The v2.0 results showed that submitted storage systems could support roughly twice the number of simultaneous accelerators compared to v1.0, reflecting the scaling demands of modern training clusters.[11]
MLPerf Client evaluates AI performance on personal computers, including laptops, desktops, and workstations. Announced in January 2024, this benchmark suite targets the growing "AI PC" market segment where large language models run locally on consumer hardware.
| Version | Release Date | Models Tested | Key Features |
|---|---|---|---|
| v0.5 | December 2024 | Llama 2 7B (4-bit quantized) | Four tasks: content generation, creative writing, text summarization. Metrics: TTFT, tokens per second. |
| v1.0 | July 2025 | Llama 2 7B Chat, Llama 3.1 8B Instruct, Phi 3.5 Mini Instruct | AMD NPU/GPU, Intel NPU/GPU (OpenVINO), NVIDIA GPU, Qualcomm NPU, Apple GPU (MLX) support |
| v1.5 | November 2025 | Same as v1.0 plus Phi 4 Reasoning 14B (experimental) | Windows ML integration, iPad support, experimental power measurement, Linux CLI[13] |
MLPerf Client v1.5 supports Windows x64, Windows on Arm, macOS, Linux, and iPad, making it the most cross-platform MLPerf benchmark. It is developed collaboratively by AMD, Intel, Microsoft, NVIDIA, Qualcomm, and major PC OEMs.[13]
MLPerf Automotive is jointly developed by MLCommons and the Autonomous Vehicle Computing Consortium (AVCC) to evaluate inference performance for automotive AI systems. The benchmark covers Advanced Driver Assistance Systems (ADAS), autonomous driving (AD), and in-vehicle infotainment (IVI).[14]
| Benchmark | Task | Model/Method |
|---|---|---|
| 2D Object Detection | Detect vehicles, pedestrians, etc. in camera images | Various detectors on Cognata 8MP imagery |
| 2D Semantic Segmentation | Pixel-level scene classification | Segmentation models on Cognata dataset |
| 3D Object Detection | Detect objects in 3D space from lidar and camera fusion | PointPainting on nuScenes dataset |
The 3D object detection benchmark uses PointPainting, a sensor-fusion technique that combines image-based semantic segmentation with lidar point cloud data. Future versions are expected to incorporate vision-language-action (VLA) models for end-to-end self-driving evaluation.[14]
AILuminate is the MLCommons safety and reliability benchmark for large language models and AI chat systems, run by the AI Risk and Reliability working group. First launched as v1.0 in December 2024 and updated to v1.1 in 2025, it is described by MLCommons as the first industry-standard benchmark for assessing AI-product risk.[21][22]
AILuminate evaluates a system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior across twelve hazard categories: violent crimes, nonviolent crimes, sex-related crimes, child sexual exploitation, indiscriminate weapons (CBRNE), suicide and self-harm, intellectual property, privacy, defamation, hate, sexual content, and specialized advice (medical, legal, financial).[21] Each test uses over 24,000 prompts per language (12,000 public practice prompts plus 12,000 private prompts used for the Official Test) and assigns systems a letter-grade safety score from "Poor" to "Excellent."[21]
The benchmark initially supported English (v1.0) and added French, with Chinese and Hindi in development.[21] AILuminate complements the performance-focused MLPerf suites by giving model developers and procurement teams a comparable measure of model safety alongside throughput and latency.
MLPerf results serve several purposes across the AI industry:
Hardware Vendors: Chip makers like NVIDIA, AMD, Intel, and Google use MLPerf to demonstrate the performance advantages of new processor generations. NVIDIA has been a particularly dominant and consistent top performer, using MLPerf results prominently in product launches for its Hopper, Blackwell, and Blackwell Ultra GPU architectures.[15][17]
Cloud Providers: Companies like CoreWeave, Oracle Cloud, Google Cloud, Microsoft Azure, Nebius, Lambda, and Vultr use MLPerf to validate their infrastructure offerings. Cloud-based submissions allow potential customers to see performance on rentable systems.[7][8]
System Integrators: Server manufacturers like Dell, HPE, Lenovo, Supermicro, Cisco, ASUSTeK, Quanta Cloud Technology, Inventec, and Wiwynn submit results to showcase the performance of their configured systems, helping enterprise buyers compare turnkey solutions.[7][9]
AI Startups and Chip Designers: Newer entrants like Cerebras, SambaNova, Groq, and MangoBoost have used MLPerf submissions (or plan to) to establish credibility and benchmark their novel architectures against established players.[7]
Researchers: Academic institutions use MLPerf HPC and other suites to measure the ML capabilities of research clusters and supercomputers, informing procurement and system design decisions. Recent first-time academic submitters include the University of Florida, Stevens Institute of Technology, and Amitash Nanda.[7][9]
MLPerf employs several mechanisms to ensure fair and meaningful comparisons:[1][2]
The biannual cadence and evolving workload selection ensure that benchmarks remain relevant as the field progresses. Retired workloads (such as the Mini Go reinforcement learning benchmark, the original NMT machine translation task, GPT-3, GPT-J, and BERT) are replaced with newer, more representative models.
While MLPerf is widely respected, some criticisms have been raised:
| Suite | Version | Date | Key Changes |
|---|---|---|---|
| Training | v0.5 | December 2018 | Initial release with ResNet-50, Transformer, NMT, Mini Go, Mask R-CNN, SSD |
| Training | v0.7 | June 2019 | Added BERT, DLRM |
| Training | v1.0 | July 2020 | Updated models and datasets |
| Training | v3.1 | November 2023 | Stable Diffusion, GPT-3, LLM fine-tuning added |
| Training | v4.0 | June 2024 | Power measurements introduced |
| Training | v5.0 | June 2025 | Llama 3.1 405B replaces GPT-3; record submissions[8] |
| Training | v5.1 | November 2025 | Llama 3.1 8B replaces BERT; Flux.1 replaces Stable Diffusion; first NVFP4 submissions[9][17] |
| Inference | v0.5 | June 2019 | Initial release |
| Inference | v4.0 | March 2024 | Mixtral 8x7B, SDXL added |
| Inference | v5.0 | April 2025 | Llama 3.1 405B, R-GAT, Automotive PointPainting, Interactive scenario |
| Inference | v5.1 | September 2025 | DeepSeek-R1, Llama 3.1 8B, Whisper Large V3 added; 27 submitters |
| Inference | v6.0 | April 2026 | GPT-OSS 120B, DLRMv3, Wan 2.2 text-to-video, VLM, YOLOv11; LoadGen++; speculative decoding standard; 288-GPU largest submission[7][19][20] |
| HPC | v0.5 | November 2019 | DeepCAM, CosmoFlow |
| HPC | v3.0 | November 2023 | OpenFold protein folding added; 49x gains over first results |
| Tiny | v0.5 | June 2021 | KWS, VWW, IC, AD |
| Tiny | v1.3 | September 2025 | Streaming KWS benchmark added |
| Storage | v0.5 | 2023 | Initial preview |
| Storage | v1.0 | September 2024 | Distributed training support |
| Storage | v2.0 | August 2025 | Llama 3 checkpointing; 200+ results |
| Client | v0.5 | December 2024 | Initial release for AI PCs |
| Client | v1.5 | November 2025 | Windows ML, iPad, power measurement |
| Automotive | v0.5 | August 2025 | 2D/3D object detection, semantic segmentation |
| AILuminate | v1.0 | December 2024 | English LLM safety benchmark, 12 hazard categories[21] |
| AILuminate | v1.1 | 2025 | French added; Chinese, Hindi in development[22] |