BigCodeBench

AI Benchmarks AI Code Generation

19 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 3,722 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

BigCodeBench is a Python code generation benchmark of 1,140 function-level programming tasks that require composing 723 distinct function calls from 139 libraries across seven domains, designed to test whether large language models can write realistic software rather than short algorithmic puzzles. Introduced in June 2024 by Terry Yue Zhuo and collaborators as part of the BigCode Project (an open scientific collaboration jointly run by Hugging Face and ServiceNow Research), it was built as a harder, saturation-resistant successor to HumanEval and MBPP. On the benchmark's headline finding, the best model (GPT-4o) reaches a calibrated Pass@1 of just 61.1 percent on BigCodeBench-Complete and 51.1 percent on BigCodeBench-Instruct, far below the 97 percent scored by human experts. The paper, BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions (arXiv:2406.15877), was accepted at the International Conference on Learning Representations (ICLR) 2025.^[1]^[2]^[3]

The paper frames the core gap directly: "our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%."^[1] Unlike earlier function-level benchmarks that focus on short, algorithmic puzzles solvable with a single standard-library call, BigCodeBench presents problems that mirror typical software-engineering work: data analysis pipelines, networking tasks, cryptography routines, web scrapers, visualization code, and similar tasks that require the model to chain multiple third-party libraries. Each problem ships with a unittest-based test harness using unittest.mock to simulate external resources, an average of 5.6 test cases per task, and an average statement-level branch coverage of 99 percent. The benchmark is released under the Apache 2.0 license and is distributed through the BigCode organization on Hugging Face, with a public leaderboard hosted at huggingface.co/spaces/bigcode/bigcodebench-leaderboard and a mirror at bigcode-bench.github.io.^[1]^[4]^[5]

What problem was BigCodeBench built to solve?

By 2024, the dominant code generation benchmark in the literature was OpenAI's HumanEval, introduced in the 2021 Codex paper, which contains 164 short Python tasks defined by a docstring and a small set of assertion-based tests. A closely related benchmark, Google's MBPP (Mostly Basic Python Problems), contains 974 short crowdsourced exercises. While both were useful for early Codex-era models, by 2024 frontier models such as GPT-4, Claude 3 Opus, and DeepSeek-Coder-V2 had reported HumanEval pass@1 scores above 90 percent, producing what researchers described as benchmark saturation. The EvalPlus project addressed part of this concern by enlarging the test sets of HumanEval and MBPP, releasing HumanEval+ and MBPP+, but the underlying task distribution remained algorithmic and standard-library-bound. Additional limitations included potential data contamination, since HumanEval and MBPP have been part of training corpora for many models, and a lack of representativeness with respect to real-world programming, which routinely depends on third-party libraries such as pandas, numpy, requests, matplotlib, scikit-learn, and flask. As the launch blog put it, "the main concern is that tasks in HumanEval are too simple and may not be representative of real-world programming tasks."^[1]^[2]^[6]

Several alternative benchmarks emerged in parallel to address these gaps. APPS contained 10,000 competitive-programming problems but remained algorithmic in character. DS-1000 focused on 1,000 data-science problems crawled from Stack Overflow but did not require integration across multiple libraries. APIBench tested isolated API calls. Repository-scale benchmarks such as SWE-bench from Princeton and SWE-bench Verified addressed real GitHub issues but at a much higher level of complexity, where most frontier models in 2024 scored below 30 percent. BigCodeBench was designed to occupy the under-served middle ground between micro-benchmarks like HumanEval and repository-level benchmarks like SWE-bench and SWE-bench Pro, providing function-scale tasks that are nonetheless realistic enough to exercise library composition and complex instructions.^[1]^[2]^[6]

When was BigCodeBench released?

The first preprint of BigCodeBench appeared on arXiv on 22 June 2024 (arXiv:2406.15877), and a coordinated launch blog post titled BigCodeBench: The Next Generation of HumanEval was published on the Hugging Face Blog the same week. The initial public package release was version 0.1.5 on 18 June 2024. The BigCodeBench-Hard subset, containing 148 difficult tasks selected by post-hoc analysis, was announced on 24 July 2024 in version 0.1.8 and accompanied by a follow-up blog post on the Hugging Face platform. Subsequent versions (0.1.9 in August 2024, 0.2.0 in October 2024, 0.2.2 in January 2025, and 0.2.4 in March 2025) refined the evaluation harness, expanded the model coverage of the leaderboard from roughly 60 to over 160 models, and standardized the evaluator's sandboxing options for local, Docker, E2B, and Gradio backends. The accepted ICLR 2025 version (v4 on arXiv, 1 April 2025) became the canonical reference.^[1]^[2]^[4]^[7]

Version	Date	Notable change
0.1.5 (initial package)	18 June 2024	First public release of the harness
arXiv:2406.15877 v1	22 June 2024	First preprint; launch blog
0.1.8	24 July 2024	BigCodeBench-Hard (148 tasks) announced
0.1.9	August 2024	Harness refinements
0.2.0	October 2024	Sandboxing backends standardized
0.2.2	January 2025	Leaderboard reaches 163 models
0.2.4	March 2025	Evaluator updates
arXiv:2406.15877 v4	1 April 2025	Canonical ICLR 2025 version

How were the tasks constructed?

The construction methodology for BigCodeBench is a three-stage human-LLM collaboration documented in section 3 of the paper. The first stage, Data Synthesis, begins with a seed of programming intents from ODEX, a dataset of one-liner Python solutions tied to natural-language queries from Stack Overflow. The seed intents were expanded by GPT-4 (via the OpenAI Code Interpreter sandbox) into function-level tasks that include a docstring, a candidate implementation, and a test harness. The second stage, Semi-automatic Program Refactoring, involved 13 human annotators working iteratively with GPT-4 in the sandbox to refine the tasks: adding error handling, expanding the test cases, ensuring the code actually executed, and validating that the docstring matched the implementation. Most annotators had five or more years of Python experience. The third stage, Human Curation, applied additional pre-evaluation with GPT-3.5-Turbo to flag flaky tasks, then cross-checked the remaining tasks with 7 additional human annotators. Finally, 11 expert human annotators solved a random sample of tasks to establish a human-performance baseline of 97 percent pass rate.^[1]

The 723 function calls in the benchmark span 77 standard-library modules and 62 external libraries, organized into seven domains: Computation (the largest, with about 63 percent of tasks invoking at least one computation library such as numpy, pandas, or scipy), Data Analysis, General-purpose, Networking, Visualization, Cryptography, and Web Development. The seven-domain taxonomy mirrors the structure of common Python application areas and helps surface failures concentrated in less common domains. Each task ends with an executable test harness that imports unittest and unittest.mock so that external resources such as HTTP endpoints, files, and sockets are deterministically patched during evaluation.^[1]

Solutions in the benchmark have a mean length of 10.0 lines of code, compared with 6.8 for HumanEval and 5.1 for DS-1000, and a mean cyclomatic complexity of 3.1, in the same range as HumanEval (3.6) but applied to substantially longer code. The mean character count for canonical solutions is 1,112.5, more than double HumanEval's 450.6.^[1]

What is in each task? (Dataset structure)

BigCodeBench contains 1,140 tasks. Each task in the dataset includes the following fields: a unique task_id; a complete_prompt containing the full docstring used in the Complete variant (ranging in length from about 460 to 3,470 characters); an instruct_prompt containing the natural-language instruction used in the Instruct variant (about 270 to 1,830 characters); a canonical_solution (433 to 1,700 characters); a code_prompt providing the function signature; a test field with the unittest harness (940 to 7,740 characters); an entry_point naming the function to implement; a doc_struct field encoding the structured documentation; a libs field listing the required libraries; and additional metadata used by the leaderboard. The full dataset is approximately 2.76 MB in Parquet format, distributed across multiple versioned splits (v0.1.0 through v0.1.4) on Hugging Face.^[4]^[5]

The average task uses 2 to 3 libraries, with a substantial tail of tasks that use four or more. Common libraries appearing across tasks include pandas, numpy, scikit-learn (with submodules such as KMeans, PCA, RandomForestClassifier), matplotlib, seaborn, requests, BeautifulSoup, flask, wtforms, statsmodels, scipy, re, subprocess, os, glob, ftplib, socket, and wordcloud. This library distribution is deliberately closer to a typical applied-Python workload than the algorithm-and-string-manipulation distribution of HumanEval.^[4]^[5]

What are the Complete and Instruct variants?

BigCodeBench is published in two variants of every task, distinguished by how the problem is communicated to the model.

The BigCodeBench-Complete variant presents the model with a long, structured docstring including parameter descriptions, return-value specifications, error semantics, and interactive examples. The model's job is to complete the function body. This variant is intended to test code-completion ability and provides the highest signal for base models that respond well to docstring-style prompts. Because the prompt is verbose, instruction-tuned models occasionally exhibit "model laziness," in which they omit required import statements or constant definitions on the assumption that the surrounding code already provides them. BigCodeBench addresses this with an optional --calibrate post-processing step that re-inserts missing imports and globals before evaluation.^[1]^[2]

The BigCodeBench-Instruct variant is built from the same 1,140 tasks but rewrites the docstring as a short, conversational natural-language instruction with only essential information. The paper describes it as "a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information." It is intended for instruction-tuned and chat models, and tests whether they can translate a user-style request into working code. Performance on Instruct is generally substantially lower than performance on Complete, because the model must infer many specifications that are explicit in the docstring. The Instruct variant is sometimes labeled "Vibe Check" on the leaderboard to emphasize that it stresses both code generation and instruction following.^[1]^[2]

What is BigCodeBench-Hard?

BigCodeBench-Hard is a subset of 148 tasks introduced on 24 July 2024 to give a tighter, more user-centric evaluation aligned with real-world Stack Overflow queries. The subset was constructed in two steps. First, the authors retrieved BigCodeBench tasks similar to questions in a 10.4-million-entry anonymized Stack Overflow archive, using the all-mpnet-base-v2 sentence embedding model and a normalized-embedding similarity threshold above 0.7. After deduplication this yielded 6,895 query-task pairs covering 626 BigCodeBench tasks. Three difficulty filters were then applied: the task must require more than two libraries (compared with a minimum of two for the full benchmark); the canonical solution must exceed 426 tokens (the average solution length); and the task must have a solve rate below 50 percent across the models already evaluated. The intersection of these criteria produced the final 148-task subset.^[7]

Performance on BigCodeBench-Hard is substantially lower than on the full set and is also more discriminating, since easy and saturated tasks have been removed. The authors note that GPT-4o and GPT-4, in particular, perform comparatively worse on the Hard subset than on the full set, consistent with the hypothesis that the full benchmark contains easier tasks that some closed models may have memorized or that the Hard subset is more representative of user-facing programming. The Hard rankings were independently validated against Scale AI's SEAL-Coding leaderboard for Python, where the top four closed models were GPT-4-Turbo Preview, Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro, in that order. Both the Complete and Instruct variants exist for the Hard subset, giving four distinct sub-benchmarks: Complete-Full, Complete-Hard, Instruct-Full, and Instruct-Hard.^[7]

How is BigCodeBench scored?

BigCodeBench is scored with pass@1 using greedy decoding, following the convention established by HumanEval and MBPP. A task counts as solved if and only if all unittest test cases pass when the model's generated function is executed inside the harness. Pass@5 sampling is also supported for stochastic generation but is reported less often on the leaderboard.^[1]^[2]

The official ranking metric is calibrated pass@1, in which a post-processing step ("sanitization") repairs the model's output before evaluation by extracting the function body from chat-style responses and re-inserting any import statements or global constants that were specified in the prompt but elided by the model. Calibration is intended to factor out cosmetic failures unrelated to algorithmic ability. The leaderboard reports both calibrated and raw pass@1 numbers. A complementary Elo-style ranking, inspired by the LMSys Chatbot Arena, is computed by treating each task as a one-on-one match between two models and fitting Bradley-Terry parameters via maximum likelihood with 500 bootstrap iterations starting from an initial Elo rating of 1000.^[1]^[2]

Evaluation can be run locally, but the recommended path is a Docker image (bigcodebench/bigcodebench-evaluate) that sandboxes the test harness. The benchmark also supports remote execution via the E2B sandbox service and a hosted Gradio API on Hugging Face. The generation step is decoupled from the evaluation step and supports vLLM, Hugging Face Transformers, OpenAI, Anthropic, Google, and Mistral backends.^[2]^[4]

The benchmark is intentionally challenging: as of the ICLR 2025 release, 149 tasks remained completely unsolved by any of the 60 models evaluated in the paper for the Complete variant and 278 tasks remained unsolved for the Instruct variant, while only 6 and 14 tasks respectively had been solved by every model. The headline number from the paper is that the best model at the time of writing reached calibrated pass@1 of approximately 60 percent, compared with a human baseline of 97 percent.^[1]

How well do models score on the leaderboard?

The public leaderboard at huggingface.co/spaces/bigcode/bigcodebench-leaderboard is updated as new models become available and as community members submit results. As of version 0.2.2 in January 2025 it covered 163 models across the four sub-benchmarks (Complete-Full, Complete-Hard, Instruct-Full, Instruct-Hard).^[2]^[4]

The headline scores reported in the original paper and the launch blog were dominated by GPT-4o, which reached 61.1 percent calibrated pass@1 on BigCodeBench-Complete and 51.1 percent on BigCodeBench-Instruct. DeepSeek-Coder-V2-Instruct followed as the strongest open model in the original paper's evaluation. The authors noted a substantial gap between closed and open models at the time of release, and they also documented that DeepSeek-V2-Chat improved by approximately 109 percent on the Complete split (from 15.5 to 32.4) after a June 2024 model update, and that Phi-3-Mini improved 13.8 to 24.3 points after a similar update, illustrating the benchmark's sensitivity to model iteration.^[1]^[2]^[7]

On the BigCodeBench-Hard Instruct subset, OpenAI's o1-preview reached calibrated pass@1 of about 26.84 percent, narrowly above GPT-4 at 26.35 percent, GPT-4o and o1-mini both at 25 percent, and Claude 3.5 Sonnet at 24.32 percent. These numbers are notably lower than the full-set scores and emphasize how aggressively the Hard subset filters out tasks that frontier closed models had already mastered. Within the same ranking, the four highest-scoring closed Python models in mid-2024 were GPT-4-Turbo Preview, Claude 3.5 Sonnet, GPT-4o, and Gemini 1.5 Pro, in that order, a result that the BigCodeBench team independently corroborated against Scale AI's SEAL-Coding ranking.^[7]^[8]

A separate finding from Qwen team self-reports is that Qwen2.5-Coder-32B-Instruct reached approximately 27.0 percent pass@1 on BigCodeBench-Hard, placing it competitively with the closed frontier on this subset. Subsequent evaluations from the BigCode team and from research groups using the harness have added later models such as Claude Sonnet 4, DeepSeek-V3, Qwen3-Coder, Llama 3.3 70B, and Mistral 3.2 24B to the BigCodeBench-Hard leaderboard.^[4]^[9]

How does BigCodeBench differ from HumanEval, MBPP, and SWE-bench?

BigCodeBench is most directly compared with HumanEval, MBPP, and DS-1000 as a function-level Python benchmark. Quantitatively, BigCodeBench is roughly seven times the size of HumanEval (1,140 versus 164 tasks) and slightly larger than MBPP (1,140 versus 974). Its canonical solutions are about 2.5 times longer in characters and 47 percent longer in lines of code than HumanEval's. Test cases are also denser: 5.6 cases per task on average with 99 percent branch coverage, compared with fewer cases and lower coverage in HumanEval. And by design, BigCodeBench requires multi-library composition where HumanEval and MBPP overwhelmingly rely on standard-library primitives. The launch blog summarizes the contrast: "compared to the algorithm-oriented tasks in HumanEval, real-world software development often involves diverse libraries and function calls."^[1]^[2]

Benchmark	Tasks	Mean solution (LOC)	Multi-library	Frontier ceiling (2024-25)
HumanEval	164	6.8	No	>90% (saturated)
MBPP	974	short	No	>85% (saturated)
DS-1000	1,000	5.1	Limited	mid-range
BigCodeBench-Complete	1,140	10.0	Yes	~61% (GPT-4o)
BigCodeBench-Hard	148	longer	Yes (>2 libs)	~27%

Compared with LiveCodeBench, which uses recently published competitive-programming problems to control for training-set contamination, BigCodeBench addresses contamination by using novel synthesized prompts with library-call structure that is unlikely to appear verbatim in pre-training data, but it does not refresh its task pool monthly the way LiveCodeBench does. The two benchmarks therefore measure different things: LiveCodeBench stresses competitive-programming reasoning under freshness constraints, while BigCodeBench stresses realistic library use and instruction following in a fixed but large task pool.^[1]

Compared with SWE-bench and SWE-bench Pro, BigCodeBench operates at a much smaller granularity. SWE-bench requires the model to localize and patch issues across a full repository, with hundreds of files and thousands of lines of context, while BigCodeBench asks the model to implement a single function with a docstring. Pass@1 scores on the two benchmarks are not directly comparable but they are complementary: a strong SWE-bench score implies repository-scale reasoning, while a strong BigCodeBench score implies precise library use and instruction following. Many model evaluation suites in 2024 and 2025 report all three.^[1]

BigCodeBench is also frequently compared with EvalPlus's HumanEval+ and MBPP+, which extend the test suites of the originals without expanding the task distribution. Both benchmarks reach saturation at the frontier, with strong models scoring above 85 percent on HumanEval+, while BigCodeBench scores remain well below 65 percent for the best models on the Complete variant and below 30 percent on the Hard variants, indicating that the benchmark retains headroom.^[1]^[2]^[6]

Other coding benchmarks that frontier laboratories report alongside BigCodeBench include Aider Polyglot for multi-file editing across many programming languages and general-knowledge benchmarks such as MMLU, though MMLU is not a coding benchmark. The collection of these benchmarks reflects a broader 2024-2025 trend of evaluating code models across multiple axes including function-level realism, repository-scale reasoning, multi-language editing, and competitive-programming difficulty.^[1]^[6]

BigCodeBench is one output of the broader BigCode Project, the same open scientific collaboration between Hugging Face and ServiceNow Research that produced the The Stack source-code dataset and the StarCoder family of open code large language models. Where The Stack supplies permissively licensed training data and StarCoder supplies open models, BigCodeBench supplies the evaluation layer, and StarCoder2 is among the open models the BigCode team scores on the leaderboard. The shared lineage means BigCodeBench was developed by a team that builds, trains, and evaluates open code models end to end rather than by a single closed laboratory.^[3]^[4]

Adoption by frontier laboratories

Since its release, BigCodeBench has appeared in the official evaluation suites and technical reports of many frontier laboratories. The BigCode team's official repository lists adoption by Zhipu AI, Alibaba Qwen, DeepSeek, Amazon AWS AI, Snowflake AI Research, ServiceNow Research, Meta AI, Cohere AI, Sakana AI, and the Allen Institute for Artificial Intelligence (AI2). The benchmark has been used to validate iterative improvements to open code models (DeepSeek-Coder, DeepSeek-V2 and V3, Qwen2.5-Coder, Qwen3-Coder, Llama 3.x and 3.3, Phi-3 and successors, Mistral and Codestral, and StarCoder2) and as a sanity check on closed frontier releases. Anthropic, OpenAI, and Google have reported BigCodeBench scores in selected releases, and the benchmark is one of the default evaluations integrated into community frameworks such as UK AISI's Inspect-Evals harness.^[2]^[4]^[9]

Independent third-party evaluations have also accumulated. A November 2025 study on benchmark agreement (arXiv:2511.04355) evaluated six leading models (Claude Sonnet 4, DeepSeek-V3, Qwen3-Coder, GPT-4o, Llama 3.3 70B, and Mistral 3.2 24B) on BigCodeBench-Hard and used the results to analyze remaining failure modes in code generation, particularly around error handling and multi-library composition.^[9]

Limitations

BigCodeBench's authors and external evaluators have identified several limitations. The benchmark is Python-only, partly because library-rich function calls are language-specific; extending it to JavaScript, Go, or Rust would require re-curating the entire task pool. Because the benchmark relies on unittest and unittest.mock for sandboxed execution, certain side-effecting code paths (long-running computations, deep filesystem effects, GPU operations, network operations beyond mocked sockets) cannot be tested precisely, which biases the task pool toward operations that are easily mocked. The benchmark is also subject to library obsolescence: a small number of tasks depend on third-party packages that may change behavior between minor versions, and the maintainers have released versioned snapshots (v0.1.0 through v0.1.4) to control for this drift. Finally, while the authors took care to construct tasks that do not appear verbatim in common pre-training corpora, the underlying library APIs are documented publicly and therefore appear in pre-training data; BigCodeBench therefore does not provide the contamination guarantees that purely time-fenced benchmarks like LiveCodeBench aim for.^[1]^[2]

Other documented limitations include the reliance on a single canonical solution per task for the difficulty filter in BigCodeBench-Hard, which may not capture all valid solution paths, and the model-laziness phenomenon, in which long Complete prompts cause some models to drop boilerplate code such as imports and constants. The calibrated pass@1 metric mitigates but does not eliminate this issue. The authors' future-work section lists multilingualism, finer-grained test augmentation following EvalPlus, generalization to emerging libraries such as transformers and langchain, evolution to handle library version drift, and an interactive agent variant in which the model can self-debug as priorities for subsequent versions.^[1]^[2]

References

Zhuo, Terry Yue, et al. "BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions." arXiv preprint arXiv:2406.15877 (June 2024, revised April 2025). https://arxiv.org/abs/2406.15877. Accessed 2026-06-21. ↩
"BigCodeBench: The Next Generation of HumanEval." Hugging Face Blog. https://huggingface.co/blog/leaderboard-bigcodebench. Accessed 2026-06-21. ↩
BigCode Project organization page. https://github.com/bigcode-project. Accessed 2026-06-21. ↩
BigCodeBench GitHub repository. "[ICLR'25] BigCodeBench: Benchmarking Code Generation Towards AGI." https://github.com/bigcode-project/bigcodebench. Accessed 2026-06-21. ↩
BigCodeBench dataset on Hugging Face. https://huggingface.co/datasets/bigcode/bigcodebench. Accessed 2026-06-21. ↩
Liu, Jiawei, et al. "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation" (EvalPlus). NeurIPS 2023. arXiv:2305.01210. Accessed 2026-06-21. ↩
Zhuo, Terry Yue. "Announcing BigCodeBench-Hard, and More." Hugging Face Blog, 24 July 2024. https://huggingface.co/blog/terryyz/bigcodebench-hard. Accessed 2026-06-21. ↩
BigCodeBench-Hard leaderboard mirror at LLM-Stats. https://llm-stats.com/benchmarks/bigcodebench-hard. Accessed 2026-06-21. ↩
"Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks." arXiv:2511.04355 (November 2025). https://arxiv.org/html/2511.04355. Accessed 2026-06-21. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

CRUXEval DeepSeek-Coder Essential AI MBPP SWE-Bench Pro The Stack (BigCode dataset)WebDev Arena

What problem was BigCodeBench built to solve?

When was BigCodeBench released?

How were the tasks constructed?

What is in each task? (Dataset structure)

What are the Complete and Instruct variants?

What is BigCodeBench-Hard?

How is BigCodeBench scored?

How well do models score on the leaderboard?

How does BigCodeBench differ from HumanEval, MBPP, and SWE-bench?

How is BigCodeBench related to the BigCode project, The Stack, and StarCoder?

Adoption by frontier laboratories

Limitations

References

Improve this article

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here