Pass@k

AI Benchmarks AI Code Generation Machine Learning Model Evaluation

12 min read

Updated Jun 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 9, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v2 · 2,331 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Pass@k is the standard metric for evaluating code generation models: it measures the probability that at least one of k generated candidate solutions passes all of a problem's unit tests. It became the default way to score large language models on functional correctness after OpenAI's Codex paper introduced it alongside the HumanEval benchmark in 2021. ^[1] The appeal is simple. Instead of asking whether a model's single best guess looks plausible, pass@k asks whether the model can produce working code at all, given a few tries, and then verifies that code by actually running it.

Definition and intuition

The core idea predates the name. Kulal et al. introduced it in their 2019 SPoC paper on translating pseudocode to C++: they let a system generate up to 100 candidate programs per task and counted the task as solved if any candidate compiled and passed the held-out test cases. ^[2] Chen et al. generalized this into the pass@k metric in "Evaluating Large Language Models Trained on Code," the paper that shipped Codex and HumanEval. ^[1]

The definition reads: for a given problem, you draw k samples from the model, run each one against the unit tests, and the problem counts as solved if one or more of those k samples passes everything. Average that indicator over all problems in the benchmark and you get pass@k for the whole dataset. When k is 1, this collapses to plain accuracy on a single sample. When k is large, it rewards a model that can occasionally stumble onto the right answer even if most of its attempts are wrong.

What makes pass@k a good fit for code is that correctness is checkable. A function either returns the expected output for every test input or it does not. There is no partial credit, no fuzzy similarity score, no human in the loop deciding whether the answer is close enough. This is execution-based evaluation, and it sidesteps the well-known unreliability of surface metrics like BLEU or exact-match string comparison, which can reward code that looks right but does not run. ^[4]

The estimation problem and the unbiased estimator

Here is where pass@k gets subtle. The obvious way to compute it is to generate exactly k samples per problem and check whether any pass. The trouble is variance. If a model solves a problem maybe 5% of the time and you only draw k=10 samples, your per-problem estimate is a noisy coin flip: sometimes you get lucky and one passes, sometimes none do, and the resulting pass@10 bounces around from run to run. To get a stable number you would need to repeat the whole sampling procedure many times, which is expensive.

Chen et al. solved this with an unbiased estimator. The trick is to oversample. Generate a large number n of candidates per problem (the paper and most follow-up work use n = 200), count how many of those n samples are correct (call it c), and then estimate pass@k as the probability that a random subset of size k drawn from the n samples contains at least one of the c correct ones. ^[1] ^[3] In plain notation:

pass@k = E_problems [ 1 - C(n - c, k) / C(n, k) ]

The term C(n - c, k) / C(n, k) is the probability that all k chosen samples come from the n - c incorrect ones, so subtracting it from 1 gives the probability that at least one of the k is correct. C(a, b) is the binomial coefficient "a choose b." You compute this per problem and average over the benchmark. ^[3]

Why is this better than the naive shortcut pass@k = 1 - (1 - p)^k, where p is the estimated single-sample pass rate? Because that formula assumes sampling with replacement, treating each of the k draws as independent. In practice you sample without replacement from a finite candidate set, the draws are not independent, and the naive expression systematically underestimates the true value. ^[3] The combinatorial estimator accounts for the without-replacement structure exactly and, as a U-statistic, gives the minimum-variance unbiased estimate. ^[3]

There is a numerical wrinkle worth knowing. Computing C(n - c, k) / C(n, k) directly can overflow for large n, so the reference implementation evaluates it as a product of the small fractions (n - c - i) / (n - i) for i from 0 to k - 1, which stays numerically stable. The implementation also returns 1.0 whenever n - c < k, since in that case any subset of size k is guaranteed to include a correct sample. ^[3] This estimator is the version that virtually every code benchmark reports, even when papers just write "pass@k."

Pass@1 vs pass@k vs pass^k

These three look similar and measure very different things.

Metric	Question it answers	Typical use
pass@1	Does the single sampled solution pass?	Headline accuracy, deterministic or low-temperature decoding
pass@k	Does at least one of k samples pass?	Measuring solvability under repeated sampling, k often 5, 10, or 100
pass^k	Do all k independent attempts pass?	Reliability and consistency, especially for agents

Pass@1 is what most leaderboards quote today because it reflects how people actually use a model: you ask once and you get one answer. Pass@k with larger k is more of a research lens. It tells you about a model's coverage, the breadth of solutions it can reach if you let it try repeatedly and have a perfect verifier to pick the winner. The gap between pass@1 and pass@100 can be enormous. Codex-12B scored 28.8% pass@1 on HumanEval but 72.31% pass@100, meaning the right answer was often somewhere in the model's distribution even when its first guess missed. ^[1] ^[6]

Pass^k (sometimes written "pass power k") flips the logic. Instead of "at least one of k succeeds" it asks whether all k independent attempts succeed. ^[5] This is a reliability metric. A coding assistant that solves a task 90% of the time has a respectable pass@1, but if you run it as an autonomous agent through a ten-step workflow where every step must work, the compounding failure rate matters far more than the single-shot rate. Pass^k captures that brittleness and has gained traction as people deploy LLMs in agentic pipelines where consistency, not just peak capability, is the thing that breaks. ^[5]

Temperature and sampling tradeoffs

Pass@k cannot be discussed without temperature. Sampling temperature controls how much randomness goes into each token choice, and it trades off the two metrics directly. Low temperature makes the model greedy and repetitive, which is good when you only get one shot, because your single sample lands on the model's most confident answer. High temperature produces diverse samples, which is good when you get many shots, because diversity raises the odds that at least one of them is correct.

Chen et al. quantified this. They found temperature around 0.2 was best for pass@1 and around 0.8 was best for pass@100. ^[1] ^[6] The intuition is that at k=1 you want to bet everything on the mode of the distribution, while at large k a low temperature wastes draws by generating near-duplicates of the same wrong answer. This is why comparing pass@k numbers across papers requires care: a model evaluated at its optimal temperature for each k will look better than one frozen at a single temperature, and not every paper reports the setting it used.

Benchmarks that report pass@k

Pass@k is now the lingua franca of code evaluation. The major benchmarks:

Benchmark	Year	Size	What it tests	Primary metric
HumanEval	2021	164 problems	Hand-written Python function synthesis from docstrings	pass@1, pass@10, pass@100
MBPP	2021	974 problems	Entry-level Python tasks, three asserts each	pass@1 (pass@k if multi-sampled)
APPS	2021	10,000 problems	Competition-style problems, three difficulty tiers	pass@k, test-case pass rate
SWE-bench	2023	2,294 tasks	Resolving real GitHub issues in large repos	resolved rate (pass@1)
LiveCodeBench	2024	1,000+ and growing	Contamination-free contest problems, time-filtered	pass@1

HumanEval is the original 164 hand-written problems, each a function signature plus docstring with hidden tests, deliberately authored by the OpenAI team so they would not appear in training data. ^[1] MBPP, from Google Research, adds 974 "mostly basic" Python problems aimed at beginners, each with three assert statements. ^[7] APPS, from Hendrycks et al., scales up to 10,000 problems scraped from sites like Codeforces and Kattis, split into introductory, interview, and competition tiers, and it reports both pass@k and a finer-grained fraction of test cases passed. ^[8]

SWE-bench marks a shift toward realism. Instead of self-contained functions it gives a model an actual GitHub issue and the full repository, and asks for a patch that makes the failing tests pass without breaking the passing ones. ^[9] Its headline number is the "resolved rate," which is pass@1 in disguise: the fraction of issues fixed on the first submitted patch. LiveCodeBench attacks the contamination problem head-on by continuously scraping fresh problems from LeetCode, AtCoder, and Codeforces and filtering out anything released before a model's training cutoff, so its pass@1 reflects genuinely unseen tasks. ^[10]

Criticisms and limitations

Pass@k is useful, but it is not the whole story, and the field has accumulated a clear-eyed list of its weaknesses.

The biggest is test coverage. Pass@k is only as good as the unit tests behind it. A solution that passes every test in the suite can still be wrong on inputs nobody thought to include, and HumanEval-style problems often ship with a handful of tests that miss edge cases. ^[11] You can only verify what you can test, so a high pass@k overstates true correctness whenever the tests are weak. Studies auditing benchmark quality have found buggy reference solutions and insufficient tests across popular datasets. ^[12]

Contamination is the second. HumanEval and MBPP have been public for years, their solutions are scattered across the internet and GitHub, and models trained on web-scale corpora have very likely seen them. ^[11] A model can score well by recalling a memorized solution rather than reasoning, which inflates numbers without reflecting capability. This is precisely the gap LiveCodeBench and other time-split benchmarks were built to close. ^[10]

Third is k inflation. Large k values flatter models. Reporting pass@100 when no real deployment would ever generate 100 samples and cherry-pick with a perfect oracle paints an optimistic picture, because in production there is rarely a free verifier to tell you which of the 100 is the correct one. The gap between pass@100 and pass@1 is in part a gap between a research idealization and reality. Related to this, pass@k's binary pass/fail per problem is coarse: it gives no credit for code that is 95% right, and it does not measure efficiency, readability, security, or style. ^[4] ^[11]

Finally, the metric says nothing about whether a single model run is reliable, which is the gap pass^k was introduced to fill. ^[5]

Relation to other metrics

Pass@k sits in a family of functional-correctness measures. It is the execution-based answer to older similarity metrics like BLEU, CodeBLEU, and exact match, which compare generated code to a reference string and reward textual closeness rather than behavior. ^[4] Those metrics are cheap and need no test harness, but they punish correct solutions that differ stylistically from the reference and reward broken code that happens to look similar, which is why the field largely moved on from them for code.

Within execution-based evaluation, pass@1 is the special case people quote most, the resolved rate on SWE-bench is pass@1 under another name, and the test-case pass rate used by APPS is a more granular cousin that scores the fraction of tests passed rather than all-or-nothing per problem. ^[8] Pass@k also connects to inference-time scaling: "best-of-n" sampling, where you generate n candidates and pick one with a reward model or verifier, is an attempt to convert a high pass@k into a high effective pass@1 by replacing the perfect oracle with a learned ranker. ^[1] When verification is reliable, the headroom between pass@1 and pass@k is exactly the prize that reranking and agentic search try to capture.

References

Chen, Mark; Tworek, Jerry; Jun, Heewoo; et al. "Evaluating Large Language Models Trained on Code." arXiv preprint, 2021. https://arxiv.org/abs/2107.03374 ↩
Kulal, Sumith; Pasupat, Panupong; Chandra, Kartik; Lee, Mina; Padon, Oded; Aiken, Alex; Liang, Percy. "SPoC: Search-based Pseudocode to Code." Advances in Neural Information Processing Systems (NeurIPS), 2019. https://arxiv.org/abs/1906.04908 ↩
Lee, Han Chung. "Statistics for AI/ML, Part 4: pass@k and Unbiased Estimator." Personal blog, 2025. https://leehanchung.github.io/blogs/2025/09/08/pass-at-k/ ↩
Brenndoerfer, Michael. "Code Evaluation: Functional Correctness and pass@k." mbrenndoerfer.com, 2025. https://mbrenndoerfer.com/writing/code-evaluation-functional-correctness-pass-at-k-benchmarks ↩
Schmid, Philipp. "Pass@k vs Pass^k: Understanding Agent Reliability." philschmid.de, 2025. https://www.philschmid.de/agents-pass-at-k-pass-power-k ↩
Rastogi, Ritvik. "Papers Explained 45: Codex." DAIR.AI, Medium, 2023. https://medium.com/dair-ai/papers-explained-45-codex-caca940feb31 ↩
Austin, Jacob; Odena, Augustus; Nye, Maxwell; et al. "Program Synthesis with Large Language Models." arXiv preprint, 2021. https://arxiv.org/abs/2108.07732 ↩
Hendrycks, Dan; Basart, Steven; Kadavath, Saurav; et al. "Measuring Coding Challenge Competence With APPS." NeurIPS Datasets and Benchmarks, 2021. https://arxiv.org/abs/2105.09938 ↩
Jimenez, Carlos E.; Yang, John; Wettig, Alexander; et al. "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" International Conference on Learning Representations (ICLR), 2024. https://arxiv.org/abs/2310.06770 ↩
Jain, Naman; Han, King; Gu, Alex; et al. "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code." arXiv preprint, 2024. https://arxiv.org/abs/2403.07974 ↩
Brenndoerfer, Michael. "HumanEval: Functional Code Generation Evaluation with Pass@k." mbrenndoerfer.com, 2025. https://mbrenndoerfer.com/writing/humaneval-code-generation-benchmark-pass-at-k ↩
Siddiq, Mohammed Latif; Santos, Joanna C. S.; et al. "The Fault in our Stars: Quality Assessment of Code Generation Benchmarks." IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM), 2024. https://s2e-lab.github.io/preprints/scam24-benchmarks-preprint.pdf ↩
OpenAI. "human-eval: Code for the paper Evaluating Large Language Models Trained on Code." GitHub repository, 2021. https://github.com/openai/human-eval

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Abbreviations Sleep-time compute

Definition and intuition

The estimation problem and the unbiased estimator

Pass@1 vs pass@k vs pass^k

Temperature and sampling tradeoffs

Benchmarks that report pass@k

Criticisms and limitations

Relation to other metrics

See also

References

Improve this article

Related Articles

SWE-bench Verified

Terminal-Bench

Spider 2.0

Multi-SWE-bench

LiveCodeBench

MBPP

What links here

Related Articles

SWE-bench Verified

Terminal-Bench

Spider 2.0

Multi-SWE-bench

LiveCodeBench

MBPP

What links here