HumanEval is a benchmark for evaluating the code generation capabilities of large language models, consisting of 164 hand-written Python programming problems. It was introduced by Mark Chen, Jerry Tworek, and colleagues at OpenAI in July 2021 as part of the Codex paper, "Evaluating Large Language Models Trained on Code" (arXiv:2107.03374). The benchmark also introduced the pass@k metric, which became the standard way to measure functional correctness in code synthesis. HumanEval was one of the first benchmarks to demonstrate that language models could generate working code from natural language descriptions, and it played a central role in the development and marketing of AI coding assistants like GitHub Copilot. By 2025, frontier models had largely saturated the benchmark, prompting the creation of harder alternatives like HumanEval+ (Liu et al., 2023), the broader EvalPlus evaluation framework, and contamination-resistant successors like LiveCodeBench and SWE-bench.
HumanEval was created as part of OpenAI's research into Codex, a GPT language model fine-tuned on publicly available code from GitHub. The Codex paper (Chen et al., 2021) needed a way to evaluate whether the model could actually produce functionally correct code, not just code that looked plausible. Existing code benchmarks at the time largely focused on code understanding (predicting the output of given code) or text similarity to reference solutions, rather than execution-based evaluation of code generation from a specification.
The paper itself has 58 listed authors, including many researchers who later became leaders in commercial AI: Greg Brockman (cofounder, then president of OpenAI), Ilya Sutskever (cofounder and chief scientist), Wojciech Zaremba (cofounder), Dario Amodei (later cofounder of Anthropic), Jared Kaplan (also later at Anthropic), Mira Murati (later CTO of OpenAI), and Jan Leike (later head of OpenAI's superalignment team and then at Anthropic). The unusually broad author list reflects the size of the engineering effort behind Codex, which included data collection, infrastructure, fine-tuning, evaluation, and the deployment work that became GitHub Copilot.
The OpenAI team built the benchmark from scratch rather than reusing existing datasets, specifically because they wanted to avoid contamination. Code crawled from GitHub almost certainly overlapped with public competitive-programming archives, university homework sets, and tutorial repositories. By writing every problem by hand, the authors hoped to create a clean test set that the model had not memorized. They argued that any benchmark drawn from public corpora would be suspect, since the same data had likely been ingested during pre-training. That reasoning holds up well in retrospect: the benchmark survived for several years before contamination concerns finally caught up with it.
Each problem consists of:
The problems were designed to be roughly comparable to easy software engineering interview questions. They cover a range of difficulty levels, from straightforward string manipulation and list operations to problems requiring basic algorithmic thinking like sorting, searching, prime checks, and small parsers. All problems are in Python, and all are single-function: there is no multi-file context, no external library dependencies beyond the standard library, and no need to interpret existing code. The model receives the prompt (signature plus docstring) and must produce the function body. The completed function is then executed against the hidden test suite, and the run is counted as a pass only if every assertion succeeds within a short time budget.
To give a concrete sense of the benchmark's content, here are descriptions of representative HumanEval problems:
String manipulation (easy). A function is_palindrome(string: str) -> bool that checks whether a given string reads the same forward and backward. The docstring provides examples like is_palindrome('racecar') returning True. This is among the simplest problems in the benchmark and is solved by virtually every modern model on the first try.
List operations (medium). A function sort_array(arr: list) -> list that sorts an array of non-negative integers based on the number of ones in their binary representation, with ties broken by the decimal value. The docstring includes examples showing the expected behavior. This requires combining binary conversion, counting, and a custom sort key.
Mathematical reasoning (harder). Functions like is_prime(n: int) -> bool, computing the greatest common divisor, summing Fibonacci-like sequences, or finding the largest divisor smaller than n. These require basic number theory and care with edge cases such as 0, 1, and negative inputs.
String parsing (harder). A function that parses a string of nested parentheses and returns the maximum depth, or one that splits a sentence on certain delimiters and counts vowels. These look easy on paper but trip up smaller models because they require building a small state machine or handling tricky escape conditions.
Each problem follows the same format. The model receives the function signature and docstring, and must generate the function body. The generated code is then executed against the hidden unit tests to determine correctness. Because the average of 7.7 tests per problem is small, partial implementations sometimes pass even when the underlying algorithm is wrong, which is the central limitation that HumanEval+ later addressed.
One of HumanEval's most lasting contributions is the pass@k metric. Code generation is inherently stochastic: the same model with the same prompt can produce different solutions on different runs because of sampling temperature. Rather than measuring whether a single attempt is correct, pass@k measures the probability that at least one correct solution appears among k generated samples.
The metric is computed using an unbiased estimator rather than naive sampling. Rather than simply generating k samples and checking if any pass, the evaluator generates n >= k samples (in the original paper, n = 200) and computes the expected pass rate using the formula:
pass@k = E_problems [1 - C(n - c, k) / C(n, k)]
where n is the total number of samples generated per problem, c is the number of correct samples among those n, and C denotes the binomial coefficient. This formula calculates the probability that at least one correct solution would appear in a random k-sized subset of the n generated samples, then averages across all 164 problems. The expectation is taken over problems, so a single problem with c = 0 contributes 0 and a problem with c = n contributes 1.
The unbiased estimator matters because naive estimation (generating exactly k samples and checking if any pass) has high variance and requires many evaluation runs to stabilize. With n = 200, a single evaluation run produces a low-variance estimate for any k up to 100. The Codex paper notes that there is no unbiased way to estimate pass@k when fewer than k samples are available, which is why generating more than k completions is necessary.
For example, if a model generates n = 200 samples for a problem and c = 40 of them are correct, then:
This shows that a model that succeeds only 20% of the time on a single try can almost certainly solve the problem if given 100 attempts, which has practical implications for systems that can generate and test multiple candidates.
The Codex paper also studied how sampling temperature affects pass@k. Lower temperatures (closer to 0) produce more deterministic, high-likelihood completions, which is good for pass@1 because the most probable answer is usually the best the model has. Higher temperatures inject more diversity into the samples, which hurts pass@1 (because some samples become noisier) but helps pass@100 (because the larger set of candidates is more likely to contain at least one correct answer). The paper recommends a temperature of about 0.6 when reporting numbers across the full pass@1 to pass@100 range, then notes that even lower temperatures can boost raw pass@1 scores. This insight became important for downstream tools like GitHub Copilot, which had to choose temperatures that balanced correctness against suggestion variety.
In the original Codex paper, the strongest model achieved:
| Model | pass@1 | pass@10 | pass@100 |
|---|---|---|---|
| Codex (12B) | 28.8% | 46.8% | 72.3% |
| Codex-S (12B, fine-tuned) | 37.7% | 59.5% | 77.5% |
| GPT-3 (175B) | 0.0% | 0.0% | 0.0% |
| GPT-J (6B) | 11.4% | 27.7% | not reported |
The paper also reported that with mean log-probability reranking (sampling 100 candidates and selecting the most likely one), Codex-S reached 44.5% pass@1, and that with oracle selection of the best of 100 samples it reached 70.2% pass@100 on the original 12B Codex model. These numbers established the benchmark's first reference points.
The fact that GPT-3, despite being a much larger model, scored 0% on code generation while the code-fine-tuned Codex scored nearly 29% demonstrated the importance of domain-specific training data. This finding helped justify the broader push toward code-specialized models like Code Llama, DeepSeek Coder, Qwen Coder, and StarCoder. It also exposed a quirk of GPT-3 specifically: the base model would happily emit Python that looked correct but rarely worked, because it was trained mostly on natural language rather than executable code.
After its release, HumanEval quickly became the go-to benchmark for AI code generation. Scores rose rapidly as models improved, both from larger base models and from code-specific fine-tuning techniques like instruction tuning, supervised fine-tuning on synthetic problem-solution pairs, and reinforcement learning from execution feedback.
| Year | Model | Organization | pass@1 |
|---|---|---|---|
| 2021 | Codex (12B) | OpenAI | 28.8% |
| 2022 | InCoder (6.7B) | Meta | 15.2% |
| 2022 | code-davinci-002 | OpenAI | 47.0% |
| 2022 | PaLM (540B) | 26.2% | |
| 2023 | GPT-3.5 (ChatGPT) | OpenAI | 72.6% |
| 2023 | GPT-4 | OpenAI | 67.0% (paper); 86.6% (later) |
| 2023 | Claude 2 | Anthropic | 71.2% |
| 2023 | Code Llama 34B | Meta | 48.8% |
| 2023 | WizardCoder 34B | Microsoft | 73.2% |
| 2023 | Phind-CodeLlama 34B v2 | Phind | 73.8% |
| 2024 | DeepSeek-Coder-V2 | DeepSeek | 90.2% |
| 2024 | Llama 3.1 405B | Meta | 89.0% |
| 2024 | GPT-4o | OpenAI | 90.2% |
| 2024 | Claude 3.5 Sonnet | Anthropic | 92.0% |
| 2024 | Qwen2.5-Coder 32B Instruct | Alibaba | 92.7% |
| 2024 | o1-preview | OpenAI | 96.3% |
| 2024 | o1-mini | OpenAI | 92.4% |
| 2025 | Claude Sonnet 4 | Anthropic | 95.1% |
| 2025 | Claude Opus 4 | Anthropic | 94.5% |
| 2025 | o3 (high) | OpenAI | 93.3% |
| 2025 | DeepSeek R1 | DeepSeek | 96.1% |
| 2025 | GPT-5 | OpenAI | 93.4% |
| 2025 | Kimi K2 0905 | Moonshot AI | 94.5% |
The benchmark went from a serious challenge (28.8% in 2021) to near-saturation (above 95% for several frontier models in 2025) in just four years. This rapid improvement reflected better base models, better code-specific training data, and improved post-training methods like RLHF and execution-based reinforcement learning. It also raised the obvious question: what does it mean when a benchmark with a hard ceiling at 100% is being clustered around 95%? Once differences between top models fall within a few problems out of 164, the noise from sampling and prompt formatting starts to dominate any genuine capability gap.
As models improved, several limitations of HumanEval became apparent. By 2024, virtually every serious code-capable LLM scored above 85%, and discussions in the research community shifted toward whether the benchmark was still measuring anything real.
Insufficient test cases. The original benchmark averages only 7.7 unit tests per problem. This is too few to catch many subtle bugs. A generated solution might pass all tests while still being incorrect for edge cases the test suite never exercises. Research by Liu et al. (2023) showed that this test insufficiency led to inflated pass rates, with some models' scores dropping by up to 19 to 29 percentage points when evaluated with more thorough tests.
Narrow problem scope. All 164 problems are standalone single-function Python problems. Real-world programming involves multi-file projects, complex dependencies, debugging existing code, understanding library APIs, and coordinating changes across modules. HumanEval captures none of this complexity. A model that scores 95% on HumanEval might still flounder on a 10,000-line codebase.
Ground-truth errors. The EvalPlus project found 18 defects (about 11% of problems) in HumanEval's original ground-truth solutions, including unhandled edge cases, incorrect logic, and performance issues. When the benchmark itself has bugs, the scores become harder to interpret. A correct solution that disagrees with a flawed reference solution can be marked wrong by the original test suite.
Python-only. The benchmark only tests Python code generation. Real software engineering involves dozens of programming languages, and a model's Python performance may not generalize to other languages, especially low-resource ones with thinner training corpora. MultiPL-E later showed that the gap between Python and languages like Racket or Lua can exceed 25 percentage points.
Contamination risk. Because HumanEval has been publicly available since July 2021 and is one of the most widely cited code benchmarks, there is a significant risk that its problems (or close paraphrases) have appeared in the training data of newer models. This would inflate scores without reflecting genuine improvement. Studies of common pre-training corpora found that 8 to 18% of HumanEval problems overlap with the training data of popular open datasets like RedPajama and StarCoder-Data, which is enough to materially shift reported numbers.
The contamination problem deserves special attention because HumanEval is among the most widely discussed benchmarks in AI. Since its release in July 2021, the 164 problems have been reproduced in thousands of blog posts, research papers, tutorials, and code repositories. Any model trained on a web-scale corpus almost certainly encounters HumanEval problems (or close variants) during training.
A 2023 study by Yang et al. ("Rethinking Benchmark and Contamination for Language Models with Rephrased Samples," arXiv:2311.04850) showed that intentionally rephrasing HumanEval problems and including them in fine-tuning data can boost a small CodeLlama 7B model's score from 32.9% to 67.7%, and a 13B model's score from 36.0% to 81.1%. These numbers approach or exceed GPT-4's reported HumanEval performance, despite the underlying models being far less capable. The study used n-gram overlap detection to demonstrate that standard string-matching decontamination methods do not catch these rephrased samples.
A related study introducing EvoEval (Xia et al., 2024) evaluated 51 LLMs on benchmarks evolved from HumanEval, finding an average performance drop of 39.4% across models, with individual drops ranging from 19.6% to 47.7%. The rankings of models also changed substantially, suggesting that leaderboard position on the original HumanEval does not reliably predict which model is actually better at code.
This creates a paradox: the more popular a benchmark becomes, the less reliable it is as a measure of genuine capability. Several strategies have been proposed to mitigate this:
To address the test insufficiency problem, Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang introduced HumanEval+ in their 2023 paper "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation" (arXiv:2305.01210, NeurIPS 2023). HumanEval+ uses the same 164 problems but dramatically expands the test suite.
The augmentation process works in two stages:
The result is an average of 764 tests per problem (up from 7.7), totaling roughly 125,000 tests across the benchmark. This 80-fold increase in test coverage catches a significant number of previously undetected incorrect solutions that happened to satisfy the thin original test suite.
When models were re-evaluated on HumanEval+, pass@1 scores dropped substantially. More importantly, the model rankings changed. Some models that appeared weaker on the original HumanEval actually outperformed their supposed superiors when evaluated with more thorough tests, which has the practical implication that the rank order on the original HumanEval cannot be trusted for fine distinctions between models.
| Model | HumanEval pass@1 | HumanEval+ pass@1 | Drop |
|---|---|---|---|
| ChatGPT (GPT-3.5) | 72.6% | 65.9% | 6.7 pp |
| WizardCoder-CL-34B | 73.2% | 64.2% | 9.0 pp |
| Phind-CodeLlama-34B | 73.8% | 68.9% | 4.9 pp |
| GPT-4 (2023) | 86.6% | 79.3% | 7.3 pp |
| Magicoder-S-DS-6.7B | 76.8% | 71.3% | 5.5 pp |
For example, both WizardCoder-CodeLlama and Phind-CodeLlama outperformed ChatGPT on HumanEval+ despite scoring close on the original benchmark. The Liu et al. paper concluded that the original HumanEval's thin test coverage was masking real differences in code quality, and that any leaderboard built on it should be read with caution. As of 2025, EvalPlus continues to show meaningful gaps between models that look identical on HumanEval, with leading scores in the high 80s rather than the saturated mid-90s.
The broader EvalPlus framework extends the same augmentation methodology to other code benchmarks, including MBPP+ (an augmented version of the Mostly Basic Programming Problems dataset, which expands the MBPP-sanitized 399-problem subset by roughly 35x in tests). The EvalPlus leaderboard at evalplus.github.io provides up-to-date model rankings using the expanded test suites.
As part of the EvalPlus project, the team identified and corrected 18 defects in HumanEval's original ground-truth solutions. These defects fell into several categories:
The corrected ground-truth solutions are included in the EvalPlus distribution and are used when computing HumanEval+ scores. The fact that 11% of the original reference solutions were buggy is a useful reminder that even hand-curated benchmarks decay over time and benefit from re-auditing.
The MultiPL-E framework (Cassano et al., 2022, arXiv:2208.08227, later published in IEEE Transactions on Software Engineering) addresses HumanEval's Python-only limitation by providing a system for translating the benchmark to other programming languages. MultiPL-E translates HumanEval and MBPP to 18 additional programming languages, creating the first massively multilingual code generation benchmark.
The supported languages span a range of paradigms and popularity levels:
| language category | languages |
|---|---|
| Systems languages | C, C++, Rust, Go |
| JVM languages | Java, Scala, Kotlin |
| Scripting languages | JavaScript, TypeScript, Ruby, PHP, Perl |
| Functional languages | Haskell, Racket, Elixir |
| Other | C#, Swift, Dart, R, Julia, Lua |
The translation process is semi-automated. The function signatures, docstrings, and test cases are translated to each target language, preserving the semantic intent while adapting to language-specific idioms (for example, mapping Python lists to Java arrays or Rust vectors). This allows researchers to evaluate whether a model's coding ability generalizes across languages or is narrowly concentrated in Python.
MultiPL-E revealed significant variation in model performance across languages. Models generally perform best on Python (their most-represented training language) and worst on less common languages like Racket, R, or Perl. The gap between Python and low-resource languages can be 20 to 30 percentage points for some models. The original MultiPL-E paper noted that Codex matched or even exceeded its Python performance on several other languages, which was unexpected and probably reflected the fact that Codex was trained on multiple languages from the start.
A related project, HumanEval-XL (Peng et al., 2024), goes further by establishing connections between 23 natural languages and 12 programming languages, comprising 22,080 prompts. This tests whether models can generate code from non-English natural language descriptions, an important capability for serving a global developer population. Similar lines of work include MCoNaLa, which tests code generation from natural language in Spanish, Japanese, and Russian, and CodeXGLUE, an earlier collection of code-related tasks across multiple languages.
HumanEval exists within a growing ecosystem of code generation benchmarks, each addressing different aspects of AI coding ability. Most modern coding evaluations either complement HumanEval (testing different skills) or replace it (testing the same skill more rigorously).
| Benchmark | Focus | Size | Language(s) | Key difference from HumanEval |
|---|---|---|---|---|
| HumanEval | Function synthesis from docstrings | 164 problems | Python | Original code generation benchmark |
| HumanEval+ | Same problems, more tests | 164 problems | Python | About 80x more tests per problem |
| MBPP | Basic programming problems | 974 problems | Python | Larger but easier problems, crowd-sourced |
| MBPP+ | Augmented MBPP | 399 problems (sanitized) | Python | About 35x more tests |
| MultiPL-E | Multilingual code generation | 164 problems | 18+ languages | Translates HumanEval to other languages |
| HumanEval-XL | Multilingual NL plus code | 22,080 prompts | 23 NL, 12 PL | Cross-lingual code generation |
| SWE-bench | Real GitHub issue resolution | 2,294 issues | Python | Tests end-to-end software engineering |
| SWE-bench Verified | Curated SWE-bench subset | 500 issues | Python | Quality-controlled subset |
| LiveCodeBench | Competition programming, refreshed | Ongoing (~700+) | Python | Continuously updated; contamination resistant |
| BigCodeBench | Complex function calls | 1,140 tasks | Python | Tests library usage and tool integration |
| APPS | Competition-style problems | 10,000 | Python | Wider range of difficulty (intro, interview, competition) |
| CodeContests | Competitive programming | About 13,500 problems | Python, C++ | Used for AlphaCode; very hard |
| CRUXEval | Code reasoning | 800 functions | Python | Predict input or output of code |
| EvoEval | Mutated HumanEval-style problems | Multiple variants | Python | Stress-tests overfitting |
Of these, SWE-bench (Jimenez et al., 2023, arXiv:2310.06770) represents the most significant departure from HumanEval's philosophy. Rather than testing whether a model can write a single function, SWE-bench tests whether an AI agent can resolve real software engineering issues in large codebases, a task that requires reading code, understanding context, and making coordinated multi-file changes. When SWE-bench launched in late 2023, the best model (Claude 2) solved only 1.96% of issues, illustrating the gap between toy function synthesis and real engineering work. By 2025, frontier agentic systems were solving more than 60% of SWE-bench Verified, but that progress traces a separate trajectory from HumanEval saturation.
LiveCodeBench (Jain et al., 2024, arXiv:2403.07974) directly attacks the contamination problem by collecting new programming-contest problems from LeetCode, AtCoder, and Codeforces continuously over time. Because problems can be filtered by release date, evaluators can isolate problems published after a model's training cutoff and get a clean read on whether the model has actually generalized. The Jain paper notes evidence of probable HumanEval overfitting: some models that score well on HumanEval do not score correspondingly well on LiveCodeBench, suggesting the HumanEval scores were inflated by exposure during training.
CodeContests was introduced by DeepMind in 2022 alongside AlphaCode, the first AI system to perform competitively in human programming contests. CodeContests is much harder than HumanEval (problems have hidden test cases that adversarially probe for shortcuts) and is targeted at competitive programming rather than everyday function writing. It is also frequently used as a fine-tuning corpus for code models.
APPS (Hendrycks et al., 2021, arXiv:2105.09938) was released around the same time as HumanEval and is the closest direct competitor in terms of "single-function from natural language" problems. APPS contains 10,000 problems split into Introductory, Interview, and Competition tiers, drawing from competitive programming sites and coding challenge platforms. APPS is harder than HumanEval but shares many of its limitations (Python-only, single-function, contamination risk).
BigCodeBench (Zhuo et al., 2024, arXiv:2406.15877) targets a different gap entirely: rather than testing isolated algorithmic problems, it asks models to compose calls to 139 real Python libraries across 7 domains. Even strong models top out around 60% on BigCodeBench, far from human performance of 97%, which suggests that library usage and tool composition remain genuinely hard.
Code generation is the broad task of producing source code from a higher-level specification, whether a natural language description (as in HumanEval) or another piece of code. HumanEval helped define the modern interpretation of this task by formalizing the input format (function signature plus docstring) and the evaluation method (execution against tests).
GSM8K is a math word-problem benchmark that played a similar role for arithmetic reasoning, also released in 2021. GSM8K and HumanEval together became standard reference points in early LLM technical reports, with MMLU covering general knowledge.
GitHub Copilot is the production code completion tool built on top of Codex (and later GPT-4 and successor models). Copilot launched in technical preview in June 2021, just a month before the HumanEval paper, and HumanEval became the public benchmark used to characterize Copilot's underlying capability.
Fine-tuning techniques like instruction tuning and reinforcement learning from human feedback (RLHF) helped push HumanEval scores from the 30s to the 90s. Code-specific fine-tuning data, often generated synthetically by stronger models (the Magicoder paper called this "OSS-INSTRUCT"), accelerated the trend.
Beyond HumanEval itself, the pass@k metric has become a lasting contribution to AI evaluation methodology. The insight that code generation should be evaluated probabilistically, accounting for the stochastic nature of sampling, has been adopted across the field. Several aspects of this legacy are worth noting:
Best-of-n sampling. The gap between pass@1 and pass@100 highlighted the potential of best-of-n strategies, where a system generates multiple candidates and selects the best one. This approach became standard in AI coding assistants, which often generate several solutions internally and present the most promising one. It also influenced the design of newer reasoning models (o1, o3, DeepSeek R1) that internally sample many chains of thought before committing to an answer.
Self-verification and verifier rerankers. The pass@100 number was often much higher than pass@1, which implied that models were generating correct solutions but not selecting them. This motivated work on learned rerankers and self-verification, where a model (or a separate verifier) judges its own samples and picks the best one. Codex itself benefited from a mean-log-prob reranker that boosted pass@1 from about 38% to 44% on the 12B model.
Scaling laws for code. The relationship between pass@1 and pass@k at different temperatures provided insights into the structure of model output distributions. Researchers observed that increasing sampling temperature could improve pass@k (by increasing diversity) while hurting pass@1 (by reducing precision), leading to practical guidance for how to configure code generation systems. This trade-off shows up again in modern reasoning models, where compute budget at inference time substitutes for additional samples.
Execution-based evaluation. HumanEval helped establish the principle that generated code should be evaluated by running it, not by comparing text. Earlier approaches often used text similarity metrics (like BLEU scores), which can give high scores to code that looks similar to the reference but does not actually work. The pass@k approach, which only counts solutions that pass all tests, provides a much more meaningful measure of functional correctness. This idea spread well beyond code: SQL benchmarks, math benchmarks like GSM8K, and even tool-use evaluations now lean on execution or programmatic verification rather than reference comparison.
HumanEval has become a near-mandatory line item in LLM technical reports. OpenAI, Anthropic, Google, Meta, DeepSeek, Alibaba, Mistral, and many others all report HumanEval pass@1 scores in their model cards and announcement blog posts. The benchmark is fast and cheap to run (164 problems times 200 samples is about 33,000 completions, doable in a few hours on commodity hardware) and the result is a single number that fits neatly in a comparison table.
The benchmark is also commonly used as a gating signal during model development. Internal teams often track HumanEval throughout training as a quick proxy for code quality, alongside losses on validation sets. Because HumanEval is small, it can be evaluated at intermediate checkpoints without significant compute cost. This convenience is part of why the benchmark has remained popular even as it lost discriminative power.
For reproducible evaluation, OpenAI maintains the original HumanEval repository at https://github.com/openai/human-eval, which contains the 164 problems, a reference evaluation harness, and example utilities for sampling. The EvalPlus repository at https://github.com/evalplus/evalplus extends this with HumanEval+ and MBPP+ tests, contamination utilities, and a leaderboard. Most modern evaluation harnesses (lm-evaluation-harness, OpenCompass, Hugging Face's BigCode Evaluation Harness) support HumanEval out of the box.
The critique that HumanEval is too easy, too narrow, and too contaminated has not removed it from leaderboards. Vendors keep reporting it because customers expect it, journalists keep citing it because it has a long history, and researchers keep comparing against it because it provides a continuous time series back to 2021. The benchmark survives in the same way that BLEU and ROUGE survived after they were known to be unreliable: it is a flawed but familiar reference point.
By 2025, HumanEval had become something of a checkbox benchmark: frontier models all score above 90%, making it largely uninformative for distinguishing between them. The benchmark remains widely reported in model release announcements (partly out of tradition and partly because high HumanEval scores look good in marketing materials), but it no longer tells researchers much about relative model quality at the top of the leaderboard.
HumanEval+ offers more headroom (top scores in 2025 sit around 87 to 92% rather than 95+), and the EvalPlus leaderboard continues to be actively maintained and consulted. For serious evaluation of AI coding capabilities, however, the field has moved toward more challenging benchmarks like SWE-bench Verified, SWE-bench Pro, LiveCodeBench, BigCodeBench, and domain-specific evaluations like RepoBench and CommitPackFT for repository-level work.
Despite its limitations, HumanEval's historical importance is clear. It was the first benchmark to rigorously measure whether AI could write code, its pass@k metric became the standard for the field, and it provided a consistent measuring stick during the explosive growth of AI coding from 2021 to 2025. The pattern it established (function signature plus docstring in, working code out) defined an entire generation of AI coding tools, from GitHub Copilot through Cursor, Continue, Cody, and the Claude Code agent. The newer benchmarks that have replaced it owe much of their structure (execution-based, pass@k metric, hand-curated problems) to choices first made in the Codex paper.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. de O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., & Zaremba, W. (2021). "Evaluating Large Language Models Trained on Code." arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374
Liu, J., Xia, C. S., Wang, Y., & Zhang, L. (2023). "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation." Advances in Neural Information Processing Systems 36 (NeurIPS 2023). arXiv:2305.01210. https://arxiv.org/abs/2305.01210
Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M., Zi, Y., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., & Jangda, A. (2022). "MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation." IEEE Transactions on Software Engineering. arXiv:2208.08227. https://arxiv.org/abs/2208.08227
Peng, Q., Chai, Y., & Li, G. (2024). "HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization." LREC-COLING 2024. arXiv:2402.16694. https://arxiv.org/abs/2402.16694
Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., & Sutton, C. (2021). "Program Synthesis with Large Language Models." arXiv:2108.07732. https://arxiv.org/abs/2108.07732
Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., & Steinhardt, J. (2021). "Measuring Coding Challenge Competence With APPS." NeurIPS 2021 Datasets and Benchmarks. arXiv:2105.09938. https://arxiv.org/abs/2105.09938
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. arXiv:2310.06770. https://arxiv.org/abs/2310.06770
Jain, N., Han, K., Gu, A., Li, W.-D., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., & Stoica, I. (2024). "LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code." arXiv:2403.07974. https://arxiv.org/abs/2403.07974
Zhuo, T. Y., et al. (2024). "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions." ICLR 2025. arXiv:2406.15877. https://arxiv.org/abs/2406.15877
Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., Hubert, T., Choy, P., de Masson d'Autume, C., Babuschkin, I., Chen, X., Huang, P.-S., Welbl, J., Gowal, S., Cherepanov, A., Molloy, J., Mankowitz, D. J., Sutherland Robson, E., Kohli, P., de Freitas, N., Kavukcuoglu, K., & Vinyals, O. (2022). "Competition-Level Code Generation with AlphaCode." Science 378(6624), 1092 to 1097. arXiv:2203.07814. https://arxiv.org/abs/2203.07814
Yang, S., Chiang, W.-L., Zheng, L., Gonzalez, J. E., & Stoica, I. (2023). "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples." arXiv:2311.04850. https://arxiv.org/abs/2311.04850
Xia, C. S., Deng, Y., & Zhang, L. (2024). "Top Leaderboard Ranking = Top Coding Proficiency, Always? EvoEval: Evolving Coding Benchmarks via LLM." arXiv:2403.19114. https://arxiv.org/abs/2403.19114
Wei, Y., Wang, Z., Liu, J., Ding, Y., & Zhang, L. (2023). "Magicoder: Empowering Code Generation with OSS-Instruct." arXiv:2312.02120. https://arxiv.org/abs/2312.02120
OpenAI. HumanEval GitHub Repository. https://github.com/openai/human-eval
EvalPlus GitHub Repository. https://github.com/evalplus/evalplus
EvalPlus Leaderboard. https://evalplus.github.io/leaderboard.html
MultiPL-E GitHub Repository. https://github.com/nuprl/MultiPL-E
LiveCodeBench Leaderboard. https://livecodebench.github.io/leaderboard.html
Anthropic. "Introducing Claude 3.5 Sonnet." June 2024. https://www.anthropic.com/news/claude-3-5-sonnet
OpenAI. "GPT-4o System Card." August 2024. https://openai.com/index/gpt-4o-system-card/
DeepSeek-AI. "DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence." 2024. arXiv:2406.11931. https://arxiv.org/abs/2406.11931
Hui, B., Yang, J., et al. (Qwen Team). "Qwen2.5-Coder Technical Report." 2024. arXiv:2409.12186. https://arxiv.org/abs/2409.12186