KernelBench

AI Benchmarks Model Evaluation

9 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,708 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

KernelBench is an AI benchmark and open-source evaluation environment that measures how well large language models can write fast and correct GPU kernels. The benchmark presents a model with a reference implementation written in PyTorch and asks it to produce a custom kernel, typically in CUDA, that computes the same result but runs faster on the GPU. Generated kernels are checked for numerical correctness against the PyTorch reference and then timed to measure any speedup ^[1]^[2].

KernelBench was introduced in the February 2025 paper "KernelBench: Can LLMs Write Efficient GPU Kernels?" by Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Re, and Azalia Mirhoseini, a team associated with the Stanford Scaling Intelligence Lab and Princeton University. The paper was posted to arXiv on February 14, 2025 (arXiv:2502.10517) and was accepted to the International Conference on Machine Learning (ICML) 2025 ^[1]^[2]^[3]. The benchmark comprises 250 carefully selected neural-network workloads, and it introduced a metric called fast_p that has since become a common way to report progress on automated GPU kernel generation ^[1]^[2].

The starting premise of KernelBench is that writing efficient GPU kernels is a high-value but difficult engineering task, and that progress on the benchmark translates directly into faster real-world machine-learning code rather than into an abstract score ^[2]. Through 2025 it became one of the standard reference points for evaluating "AI writing CUDA" efforts at companies and labs including NVIDIA ^[4].

Motivation (GPU kernel generation)

A GPU kernel is a small program that runs on a graphics processor and performs a specific computation, such as a matrix multiplication, a convolution, or a normalization step. Modern deep-learning models spend almost all of their compute inside such kernels, so the speed of these low-level routines largely determines how fast a model trains and how cheaply it runs at inference time ^[2].

Writing a fast kernel is hard. It requires detailed knowledge of GPU architecture, including the memory hierarchy, thread and block scheduling, shared memory, and how to keep the hardware's arithmetic units busy. It also tends to be tedious and error prone, and the people who can do it well are scarce. As a result, much of this work is done by a small number of specialists or is delegated to compilers and hand-tuned libraries ^[2].

The KernelBench authors frame this as a natural target for AI code generation. If a language model could reliably produce kernels that are both correct and faster than the default implementation, it would automate a costly bottleneck in machine-learning systems engineering. Unlike many coding benchmarks, the objective here is not only functional correctness but also measurable performance, and the reward signal (a real speedup on real hardware) is concrete and verifiable ^[1]^[2]. This makes KernelBench an example of an evaluation with an executable, performance-based ground truth rather than a fixed reference answer.

Structure (the levels)

KernelBench organizes its 250 core tasks into three difficulty levels, with an additional aspirational level drawn from real models. Each task provides a PyTorch reference inside a Model class with __init__ and forward methods, together with helper functions (get_inputs and get_init_inputs) that generate test tensors of fixed shapes. The model under test must return a replacement that produces the same outputs, usually by writing custom CUDA kernels in place of the PyTorch operations ^[1]^[2].

Level	Tasks	What it contains	Example workloads
Level 1	100	Single primitive operators (one kernel)	Matrix multiplication, convolution, layer normalization
Level 2	100	Fused sequences of operators	Convolution + bias + ReLU and similar fusion patterns
Level 3	50	Full model architectures	MobileNet, VGG, MiniGPT and other end-to-end networks
Level 4 (aspirational)	20	Optimizing whole models from a model hub	Selected models from Hugging Face

Difficulty increases sharply from Level 1 to Level 3. Level 1 isolates a single operation, where a model only has to optimize one kernel. Level 2 requires the model to recognize and exploit fusion opportunities, combining several operations into one kernel to cut memory traffic. Level 3 demands an optimized implementation of an entire architecture, which is closest to a realistic engineering task. The 20 Level 4 tasks, described by the authors as aspirational, target the optimization of complete models taken from a model hub and are harder still ^[1]^[2].

The benchmark deliberately does not impose a fixed train/test split, leaving researchers free to choose their own evaluation protocol, and it evaluates kernels only on fixed input shapes rather than requiring them to generalize across arbitrary shapes ^[1].

The fast_p metric

The headline metric of KernelBench is fast_p. It is defined as the fraction of tasks for which the generated kernel is both functionally correct and achieves a speedup greater than an adjustable threshold p over a baseline implementation. Speedup is computed as the ratio of the baseline wall-clock time to the generated kernel's wall-clock time, and only correct solutions can count toward the metric, since the authors note that fast but incorrect code is useless ^[1]^[2].

The threshold p makes the metric tunable, which lets a single benchmark express several questions at once ^[1]^[2]:

fast_0 counts kernels that are merely correct, with no speed requirement (a threshold of zero). This is equivalent to the plain correctness or pass rate.
fast_1 counts kernels that are correct and at least match the baseline, that is, run as fast or faster than PyTorch.
fast_2 counts kernels that are correct and at least twice as fast as the baseline.

As p rises, the bar gets harder and scores fall, so reporting fast_p across several thresholds shows not just whether a model can produce working kernels but how aggressively it can optimize them ^[1]^[2]. Correctness itself is established empirically: the generated kernel is run on five sets of random inputs of the fixed shapes, and its outputs must match the reference within an absolute and relative tolerance of 1e-02. Baseline timings are taken against PyTorch, and the framework supports comparison against both standard ("eager") PyTorch and the torch.compile baseline. Because timing is sensitive to the specific GPU, drivers, and software versions, the authors recommend regenerating baseline times on the same hardware used for evaluation ^[1]^[2].

Results

At the time of release, KernelBench was difficult for frontier models. The authors reported that the strongest reasoning models performed best out of the box but still fell short overall, matching or beating the PyTorch baseline (fast_1) in fewer than 20 percent of tasks. Correctness alone degraded steeply with difficulty, dropping from Level 1 to the fused and full-architecture tasks of Levels 2 and 3, and reasoning-focused models such as OpenAI's o1 produced correct kernels noticeably more often than a general model like GPT-4o on the harder levels ^[1]^[2].

The paper also showed that performance improves substantially when models are allowed to use execution and profiling feedback during iterative refinement, repeatedly running a candidate kernel, observing errors and timing data, and trying again. Even so, KernelBench remained a hard benchmark, and the authors stressed that its difficulty grows as the speedup threshold p is raised, since clearing fast_2 or higher requires genuine optimization rather than a correct but unremarkable kernel. They also highlighted a tension between the two objectives, because correctness and aggressive performance optimization are often in conflict ^[1]^[2].

A widely cited follow-on result came from NVIDIA. In a developer blog dated February 12, 2025, NVIDIA engineers described a closed-loop "inference-time scaling" workflow that paired the DeepSeek-R1 reasoning model with an automated verifier running on an H100 GPU. By letting the system iterate for roughly 15 minutes per problem, generating, verifying, and refining attention kernels, they reported numerically correct kernels for 100 percent of Level 1 problems and 96 percent of Level 2 problems in KernelBench, illustrating how additional test-time compute can close much of the gap that frontier models showed when answering in a single pass ^[4]. These figures measure correctness produced by an agentic system with extended search rather than the single-shot fast_1 rates reported in the original paper, so the two sets of numbers are not directly comparable.

Significance

KernelBench helped define a distinct line of evaluation focused on AI systems that write performant low-level code rather than ordinary application software. Its combination of an executable correctness check, a real speedup measurement on actual GPUs, and the tunable fast_p metric gave the research community a shared yardstick for the fast-moving "AI writing CUDA" trend that gathered momentum through 2025 ^[1]^[2]^[4].

The benchmark became a common target for agentic kernel-generation efforts and test-time methods, in which a model is wrapped in a loop that compiles, runs, profiles, and revises its own kernels. NVIDIA's DeepSeek-R1 workflow is one prominent example, and a series of subsequent academic systems have used KernelBench to report progress on automated kernel optimization ^[4]. Because gains on the benchmark map onto faster real kernels, improvements have direct practical value for the cost and speed of training and serving large models.

KernelBench also has clear limitations that its authors acknowledge. Tasks are evaluated only on fixed input shapes, so a kernel need not generalize to other sizes, and correctness is judged by numerical tolerance on a handful of random inputs rather than by formal verification. The lack of a prescribed train/test split means results depend on the protocol each group chooses, which can complicate comparison. There is also an inherent tension between maximizing correctness and maximizing speed, which the fast_p framing makes visible but does not resolve. Despite these caveats, KernelBench established performance-aware kernel generation as a measurable, reproducible challenge and remains a reference benchmark for the field ^[1]^[2].

References

Ouyang, Anne; Guo, Simon; Arora, Simran; Zhang, Alex L.; Hu, William; Re, Christopher; Mirhoseini, Azalia. "KernelBench: Can LLMs Write Efficient GPU Kernels?" arXiv:2502.10517, February 14, 2025. https://arxiv.org/abs/2502.10517 ↩
Scaling Intelligence Lab, Stanford University. "KernelBench: Can LLMs Write Efficient GPU Kernels?" (project page and blog). https://scalingintelligence.stanford.edu/pubs/kernelbench/ and https://scalingintelligence.stanford.edu/blogs/kernelbench/ ↩
International Conference on Machine Learning (ICML) 2025 poster: "KernelBench: Can LLMs Write Efficient GPU Kernels?" https://icml.cc/virtual/2025/poster/43517 ↩
NVIDIA Developer Blog. "Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling." February 12, 2025. https://developer.nvidia.com/blog/automating-gpu-kernel-generation-with-deepseek-r1-and-inference-time-scaling/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

GSO

Overview

Motivation (GPU kernel generation)

Structure (the levels)

The fast_p metric

Results

Significance

References

Improve this article

Related Articles

Benchmark (AI)

MATH

SWE-bench Verified

WebArena

Agent evaluation

Terminal-Bench