| HumanEval | |
|---|---|
| Overview | |
| Full name | HumanEval: Evaluating Large Language Models Trained on Code |
| Abbreviation | HumanEval |
| Description | A benchmark for evaluating code generation capabilities of language models through 164 hand-crafted Python programming challenges |
| Release date | 2021-07-07 |
| Latest version | 1.0 |
| Benchmark updated | 2021-07 |
| Authors | Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, And 53 others |
| Organization | OpenAI |
| Technical Details | |
| Type | Code Generation, Program Synthesis |
| Modality | Text, Code |
| Task format | Function implementation from docstring |
| Number of tasks | 164 |
| Total examples | 164 programming problems |
| Evaluation metric | Pass@k (k=1, 10, 100) |
| Domains | Algorithms, Mathematics, String Manipulation, Data Structures |
| Languages | English (natural language), Python (programming) |
| Performance | |
| Human performance | ~100% (expert programmers) |
| Baseline | 0% (GPT-3, 2021) |
| SOTA score | 93.7% |
| SOTA model | Claude 3.5 Sonnet |
| SOTA date | 2024 |
| Saturated | Nearly |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT |
| Successor | HumanEval+, BigCodeBench |
HumanEval is a benchmark dataset designed to evaluate the code generation capabilities of large language models (LLMs) by measuring the functional correctness of synthesized programs. Released on July 7, 2021, by OpenAI[1], HumanEval consists of 164 hand-crafted Python programming challenges that test language comprehension, algorithmic thinking, and simple mathematics. The benchmark introduced the influential pass@k metric for evaluating code generation and has become the standard evaluation tool for measuring programming capabilities in AI systems, witnessing dramatic improvements from 0% (GPT-3) to over 90% (current models) in just three years.
HumanEval addresses a critical gap in evaluating artificial intelligence systems by focusing on functional correctness rather than text similarity when assessing generated code. Each problem in the benchmark consists of a function signature and a docstring describing the desired behavior, requiring models to synthesize a complete implementation that passes multiple unit tests. This approach ensures that models must truly understand the programming task rather than merely pattern-matching similar code from training data[1].
The benchmark's problems are comparable to simple software interview questions and cover fundamental programming concepts including string manipulation, basic algorithms, simple mathematics, and data structure operations. With an average of 7.7 unit tests per problem, HumanEval provides robust verification of functional correctness while remaining computationally efficient to evaluate.
HumanEval has fundamentally shaped the field of AI code generation for several reasons:
Each of HumanEval's 164 problems contains five essential components:
| Component | Description | Example |
|---|---|---|
| **Task ID** | Unique identifier | "HumanEval/0" |
| **Prompt** | Function signature with docstring | `def has_close_elements(numbers, threshold):` |
| **Canonical Solution** | Reference implementation | Working Python code |
| **Test Cases** | Unit tests for verification | `assert function(input) == expected` |
| **Entry Point** | Function name to call | "has_close_elements" |
The benchmark covers diverse programming challenges[2]:
| Category | Approximate Count | Example Tasks |
|---|---|---|
| **String Manipulation** | ~40 | Palindrome checking, string parsing, pattern matching |
| **Mathematical Operations** | ~35 | Prime numbers, factorials, numerical computations |
| **List/Array Operations** | ~45 | Sorting, filtering, element manipulation |
| **Algorithmic Challenges** | ~30 | Dynamic programming, recursion, optimization |
| **Data Structure Tasks** | ~14 | Tree operations, dictionary manipulation |
HumanEval uses JSON Lines format with the following structure:
```json {
"task_id": "HumanEval/13", "prompt": "def greatest_common_divisor(a: int, b: int) -> int:\n \"\"\"Return a greatest common divisor of two integers a and b\n >>> greatest_common_divisor(3, 5)\n 1\n >>> greatest_common_divisor(25, 15)\n 5\n \"\"\"\n", "canonical_solution": " while b:\n a, b = b, a % b\n return a\n", "test": "def check(candidate):\n assert candidate(3, 7) == 1\n assert candidate(10, 15) == 5\n assert candidate(49, 14) == 7\n assert candidate(144, 60) == 12\n", "entry_point": "greatest_common_divisor"
} ```
HumanEval introduced the pass@k metric, which has become the standard for evaluating code generation[1]:
| Metric | Definition | Interpretation |
|---|---|---|
| **pass@1** | Probability that a single generated solution passes all tests | Direct success rate |
| **pass@10** | Probability that at least one of 10 attempts succeeds | Success with multiple tries |
| **pass@100** | Probability that at least one of 100 attempts succeeds | Upper bound performance |
The metric is calculated using the formula: ``` pass@k := E[1 - (C(n-c, k) / C(n, k))] ``` where n is total samples, c is number of correct samples, and C is the binomial coefficient.
The evaluation pipeline consists of:
1. **Code Generation**: Model generates Python code from the prompt 2. **Extraction**: Solution code is extracted from model output 3. **Execution**: Code is run in a sandboxed environment 4. **Testing**: Unit tests verify functional correctness 5. **Scoring**: Pass rates are calculated across all problems
HumanEval evaluation requires executing untrusted code, necessitating[2]:
| Year | Model | pass@1 | pass@10 | pass@100 | Key Innovation |
|---|---|---|---|---|---|
| 2021 | GPT-3 | 0.0% | 0.0% | 0.0% | Baseline large language model |
| 2021 | GPT-J 6B | 11.4% | 27.7% | - | Open-source alternative |
| 2021 | Codex 12B | 28.8% | 46.8% | 72.3% | Code-specific training |
| 2021 | Codex 300M | 13.2% | 20.4% | 36.3% | Smaller code model |
| 2022 | AlphaCode | 33.5% | ~50% | - | Competition-level training |
| 2022 | CodeGen 16B | 29.3% | 49.9% | 75.0% | Multi-turn synthesis |
| 2023 | GPT-4 | 67.0% | 87.0% | - | General capability improvement |
| 2023 | Claude 2 | 71.2% | - | - | Constitutional AI approach |
| 2024 | Claude 3 Opus | 84.9% | - | - | Multimodal capabilities |
| 2024 | GPT-4o | 90.2% | - | - | Optimized architecture |
| 2024 | DeepSeek-Coder-V2 | 90.2% | - | - | Specialized code model |
| 2024 | Claude 3.5 Sonnet | 93.7% | - | - | Current SOTA |
Analysis of performance trends reveals several important patterns[3]:
| Observation | Implication |
|---|---|
| Exponential improvement 2021-2023 | Rapid advancement in code understanding |
| Plateauing above 90% | Approaching benchmark saturation |
| Large model advantage | Scale correlates strongly with performance |
| Code-specific training helps | Specialized models outperform general ones initially |
| General models catching up | Recent general models match specialized ones |
Released in 2023, HumanEval+ addresses test insufficiency[4]:
| Aspect | Original HumanEval | HumanEval+ |
|---|---|---|
| **Test Coverage** | 7.7 tests/problem | ~600+ tests/problem (80x increase) |
| **Test Generation** | Manual | Automated + Manual |
| **Error Detection** | Basic | Comprehensive edge cases |
| **Score Impact** | Baseline | 15-20% score reduction |
The success of HumanEval inspired numerous multilingual variants:
| Extension | Languages | Problems | Method |
|---|---|---|---|
| **HumanEval-X** | 5 (Python, C++, Java, JavaScript, Go) | 820 | Hand-translated |
| **MultiPL-E** | 18 | 164 per language | Automated translation |
| **HumanEval-XL** | 23 natural × 12 programming | 22,080 | Cross-lingual generation |
| **MBXP** | 10+ | 974+ | Extended multilingual |
| Variant | Focus | Key Features |
|---|---|---|
| **HumanEval-V** | Visual reasoning | Code generation from diagrams |
| **DS-1000** | Data science | Pandas, NumPy, scikit-learn tasks |
| **BigCodeBench** | Real-world complexity | 1,140 challenging problems |
| **SWE-bench** | Software engineering | Real GitHub issues |
HumanEval has significantly influenced AI research:
The benchmark has enabled practical applications:
| Application | Description | Examples |
|---|---|---|
| **AI Coding Assistants** | IDE integrations for code completion | GitHub Copilot, Cursor, Replit |
| **Code Review Tools** | Automated code analysis and suggestions | CodeRabbit, DeepCode |
| **Educational Platforms** | Programming tutors and homework help | Khan Academy AI, Codecademy AI |
| **Developer Tools** | API generation and documentation | Mintlify, Stenography |
Despite its influence, HumanEval has several acknowledged limitations[5]:
| Limitation | Description | Impact |
|---|---|---|
| **Limited Complexity** | Simple interview-level problems | Doesn't test real-world programming |
| **Python Only** | Single language focus | Misses cross-language challenges |
| **Small Dataset** | Only 164 problems | Statistical significance concerns |
| **Test Coverage** | Average 7.7 tests per problem | May miss edge cases |
| **No Context** | Isolated functions | Doesn't test integration skills |
| **Saturation** | Top models exceed 90% | Limited differentiation ability |
Several developments are shaping the future of code generation evaluation:
1. **Complexity Scaling**: BigCodeBench and similar benchmarks with harder problems 2. **Repository-Level Tasks**: SWE-bench for real software engineering 3. **Interactive Evaluation**: Multi-turn code generation and debugging 4. **Execution-Based Metrics**: Beyond pass/fail to efficiency and style 5. **Contamination Detection**: Methods to identify training data overlap
Current research directions include:
HumanEval has fundamentally shaped the landscape of AI code generation evaluation. By introducing functional correctness as the primary metric and establishing the pass@k evaluation framework, it created a standardized, reproducible method for measuring programming capabilities in language models. The benchmark's simplicity and clarity have made it the de facto standard, enabling direct comparison across models and tracking the remarkable progress from 0% to over 90% accuracy in just three years.
While the benchmark approaches saturation with current models achieving near-human performance, HumanEval's influence extends beyond its specific problems. It established principles and methodologies that continue to guide the development of more challenging benchmarks and real-world evaluation frameworks. As AI systems increasingly assist in software development, HumanEval remains a crucial milestone in the journey toward artificial general intelligence in programming.