MBPP

AI Benchmarks AI Code Generation Large Language Models Machine Learning

27 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v9 · 5,351 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MBPP (Mostly Basic Python Problems) is a code generation benchmark of 974 crowd-sourced Python programming tasks designed to be solvable by entry-level programmers, introduced by Jacob Austin, Augustus Odena, and colleagues at Google Research in August 2021. ^[1] Each problem pairs a short natural language description with a reference solution and three automated test cases, and models are scored on whether their generated code passes those tests. In the original paper, the largest model tested (137 billion parameters) solved 59.6% of MBPP problems using few-shot prompting alone, without any code-specific training. ^[1] Alongside HumanEval, MBPP has become one of the two most widely used benchmarks for evaluating the ability of large language models to synthesize short Python programs from natural language, and it is a standard component of LLM evaluation suites across both industry and academia.

Who created MBPP and when was it released?

MBPP was introduced in the paper "Program Synthesis with Large Language Models," submitted to arXiv on August 16, 2021 (arXiv:2108.07732). ^[1] The full author list includes Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. All authors were affiliated with Google Research at the time of publication. ^[1]

The paper presented two benchmarks for evaluating program synthesis: MBPP and a separate dataset called MathQA-Python (containing 23,914 problems). ^[1] The central research question was whether large language models could generate correct code from natural language specifications, and how performance scaled with model size. The authors evaluated a collection of decoder-only transformer-based language models ranging from 244 million to 137 billion parameters on both benchmarks in few-shot and fine-tuning regimes. ^[1] They found that synthesis performance scales log-linearly with model size, a finding that influenced subsequent research on scaling laws for code generation.

The motivation for creating MBPP was the absence of standardized benchmarks for evaluating natural-language-to-code synthesis at the time. While prior work had explored program synthesis in constrained domains, there was no large-scale, open dataset of simple programming tasks that could serve as a practical yardstick for general-purpose language models. The authors aimed to fill that gap with a dataset broad enough to cover fundamental programming concepts yet simple enough to be solvable by entry-level programmers.

The dataset is released under the CC-BY-4.0 license and hosted in the official Google Research GitHub repository as well as on Hugging Face Datasets. ^[11]

What is in the MBPP dataset?

The MBPP dataset contains 974 programming tasks, each consisting of three components: ^[1]

Component	Description
Task description	A short natural language prompt describing the programming problem in English
Reference solution	A self-contained Python function that correctly solves the problem
Test cases	Three assert-based test cases that verify functional correctness

Each problem is stored as a JSON object with the following fields:

Field	Type	Description
`task_id`	Integer	A unique numeric identifier for the problem
`text`	String	The natural language problem description
`code`	String	The canonical reference implementation
`test_list`	List of strings	Three assertion-based test cases
`test_setup_code`	String	Optional import statements needed for testing
`challenge_test_list`	List of strings	Additional hidden test cases (when available)

Problem Characteristics

The problems in MBPP are intentionally designed to cover fundamental programming concepts rather than advanced algorithmic challenges. The distribution of problem types breaks down approximately as follows:

Category	Approximate Share	Examples
Mathematical operations	~58%	Arithmetic, number theory, conversions
List operations	~43%	Filtering, mapping, aggregation, sorting
String manipulation	~19%	Parsing, formatting, pattern matching
Basic control flow	Varies	Loops, conditionals, recursion

Note that categories overlap because a single problem may involve both list operations and mathematical computations. Problem descriptions average approximately 15.7 words, reflecting the concise and straightforward nature of the tasks. The average number of test cases per problem is 3.1 (some problems include additional challenge test cases beyond the standard three).

Example Problems

The following examples from the dataset illustrate the range and style of MBPP tasks.

Task ID 602: First Repeated Character

Task description:

Write a function to find the first repeated character in a given string.

Reference solution:

def first_repeated_char(str1):
    for index, c in enumerate(str1):
        if str1[:index+1].count(c) > 1:
            return c
    return "None"

Test cases:

assert first_repeated_char("abcabc") == "a"
assert first_repeated_char("abc") == "None"
assert first_repeated_char("123123") == "1"

Task ID 604: Reverse Words

Task description:

Write a function to reverse words in a given string.

Reference solution:

def reverse_words(s):
    return ' '.join(reversed(s.split()))

Test cases:

assert reverse_words("python program") == "program python"
assert reverse_words("java language") == "language java"
assert reverse_words("indian man") == "man indian"

Task ID 625: Swap First and Last Elements

Task description:

Write a function to swap first and last elements of a list.

Reference solution:

def swap_List(newList):
    size = len(newList)
    temp = newList[0]
    newList[0] = newList[size - 1]
    newList[size - 1] = temp
    return newList

Test cases:

assert swap_List([1,2,3]) == [3,2,1]
assert swap_List([1,2,3,4,4]) == [4,2,3,4,1]
assert swap_List([4,5,6]) == [6,5,4]

The model must generate a complete Python function that passes all three test cases for each problem.

How was MBPP created?

The dataset was created through an internal crowdsourcing effort at Google. ^[1] Crowd workers with basic Python knowledge were recruited from an internal pool and asked to complete three tasks for each problem:

Write a short problem statement describing a programming task in English.
Write a single self-contained Python function that solves the described problem.
Write three test cases that check for the semantic correctness of the function.

The crowd workers were instructed to create problems that would be solvable by entry-level programmers, covering programming fundamentals and standard library functionality. After the initial collection, ambiguous problem statements were revised to improve clarity and consistency.

A subset of the dataset later underwent additional review by the original paper's authors, resulting in the MBPP-sanitized split (described below).

Data Splits

The MBPP paper specifies explicit data splits for standardized evaluation: ^[1]

Split	Task IDs	Count	Purpose
Few-shot prompts	1 to 10	10	Provide in-context examples for prompting
Test set	11 to 510	500	Primary evaluation set
Validation set	511 to 600	90	Hyperparameter tuning and development
Training set	601 to 974	374	Fine-tuning data

The original paper used a 3-shot prompting setup with task IDs 2, 3, and 4 as in-context examples. The standard prompt format from the paper is:

You are an expert Python programmer, and here is your task: {prompt}
Your code should pass these tests:

{tests}
[BEGIN]
{code}
[DONE]

The [BEGIN] and [DONE] tokens serve as delimiters marking the start and end of the model's generated solution.

What is MBPP-sanitized?

A curated subset of the full dataset, known as MBPP-sanitized, contains 427 problems that have been hand-verified by the original authors. ^[1] This subset was created to address quality issues present in the full dataset, where some problems had noisy or ambiguous task descriptions, broken test cases, or other inconsistencies that are common artifacts of crowd-sourced data collection.

The sanitized version underwent a second round of annotation through an internal crowdsourcing effort at Google where reviewers improved task descriptions, verified the correctness of reference solutions, and ensured that the test cases accurately assessed functional correctness. MBPP-sanitized is distributed as a separate file (sanitized-mbpp.json) alongside the full dataset (mbpp.jsonl). It uses a slightly different schema, with each entry containing source_file, task_id, prompt (the refined task description), code, test_imports, and test_list.

Many subsequent evaluations and leaderboards use the sanitized subset rather than the full dataset because it reduces the impact of noisy or malformed problems that could distort model performance measurements. The EvalPlus framework, for instance, further refines this subset by removing additional low-quality tasks, resulting in 399 tasks (later reduced to 378 in MBPP+ v0.2.0). ^[3]

How are models scored on MBPP? The pass@k metric

MBPP uses the pass@k metric, which has become the standard evaluation metric for code generation benchmarks. This metric was formalized by Chen et al. (2021) in the Codex paper ("Evaluating Large Language Models Trained on Code"). ^[2]

Definition

The pass@k metric measures the probability that at least one of k generated code samples for a given problem passes all associated test cases. For pass@1, the model gets a single attempt per problem. For pass@10 or pass@100, the model generates multiple candidate solutions and succeeds if any one of them passes.

Unbiased Estimator

Rather than simply checking whether at least one of k samples is correct (which would introduce bias from the selection of k samples), Chen et al. proposed an unbiased estimator. ^[2] The procedure works as follows:

Generate n total code samples per problem (where n >= k).
Count the number of correct samples c that pass all test cases.
Compute the unbiased estimate:

pass@k = 1 - C(n - c, k) / C(n, k)

Here, C(a, b) denotes the binomial coefficient "a choose b." This formula calculates the complement of the probability that all k selected samples from the n generated candidates are incorrect, using sampling without replacement. If n - c < k (meaning there are fewer failing samples than the budget), pass@k equals 1.0 because at least one correct sample is guaranteed in any draw of k.

The key advantage of this estimator over a naive approach (such as 1 - (1 - pass@1)^k) is that it correctly accounts for finite-pool sampling without replacement, avoiding the independence assumptions that cause bias in simpler estimators.

Common Evaluation Settings

The most commonly reported metric is pass@1 using greedy decoding (temperature = 0), which represents the model's ability to produce a correct solution on its first attempt. Some papers also report pass@10 and pass@80 or pass@100, which measure the diversity and coverage of the model's generated solutions at higher sampling temperatures.

Evaluation Procedure

To evaluate a model on MBPP, the standard procedure is:

Present the model with the task description (and optionally a few-shot prompt with example problems and solutions).
Generate one or more code samples using the model.
Construct a test file by concatenating the generated solution code with all assert statements from the test list.
Execute the test file in a sandboxed Python environment.
Mark the solution as passing if all assertions succeed without errors (no syntax errors, runtime exceptions, or assertion failures).
Compute pass@k across all problems in the test split.

The original Austin et al. paper used temperature sampling at 0.5 and generated 80 samples per problem, reporting pass@80 as the headline metric. ^[1] Later work typically reports pass@1 with greedy decoding (temperature 0) or pass@1 averaged across multiple samples.

What did the original MBPP paper find?

Austin et al. (2021) evaluated a family of decoder-only transformer language models of varying sizes on the MBPP benchmark. The models were general-purpose language models trained on a mixture of text data, not specifically trained on code. As the paper states in its abstract, "the largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt." ^[1] Key findings include:

The largest model (137B parameters) achieved 59.6% pass@80 on MBPP using few-shot prompting with a well-designed prompt, without any fine-tuning on code. ^[1]
Fine-tuning on the held-out training portion of MBPP (Task IDs 601-974) improved performance by approximately 10 percentage points across most model sizes, bringing the 137B model to roughly 70%. ^[1]
Synthesis performance scaled log-linearly with model size, meaning that each order-of-magnitude increase in parameters produced a roughly consistent improvement in pass rate. ^[1]
A human-in-the-loop experiment, where crowd-workers could iteratively refine model outputs across multiple dialog turns, halved the error rate compared to initial single-shot predictions. ^[1]
The models showed limited ability to predict program execution outputs, suggesting that code generation relied more on pattern matching than on deep understanding of program semantics. ^[1]
On the companion MathQA-Python benchmark, the largest fine-tuned model achieved 83.8% accuracy. ^[1]

These results demonstrated for the first time that general-purpose language models, without any code-specific training, could solve a meaningful fraction of basic programming problems through few-shot prompting alone.

How have MBPP scores changed over time?

Performance on MBPP has improved dramatically since the benchmark's introduction in 2021. The following table summarizes reported pass@1 scores from notable models across different generations. Scores are drawn from original papers, the EvalPlus leaderboard, model technical reports, and third-party evaluations.

MBPP pass@1 Scores Over Time

Model	Organization	Year	MBPP pass@1 (%)	Notes
LaMDA 137B (few-shot)	Google	2021	59.6	Original MBPP paper, pass@80, largest model tested
LaMDA 137B (fine-tuned)	Google	2021	~70.0	Fine-tuned on MBPP training split
CodeGen-16B-Multi	Salesforce	2022	20.9	Multi-language open-source code model
Codex (code-davinci-002)	OpenAI	2022	58.1	Baseline zero-shot evaluation
PaLM-Coder 540B	Google	2022	75.0	PaLM fine-tuned for code
StarCoder-15B	BigCode	2023	43.6	Open model trained on The Stack
WizardCoder-15B-V1.0	Microsoft	2023	51.8	Evol-Instruct fine-tuned StarCoder
Code Llama-Python 34B	Meta	2023	67.2	Fine-tuned Llama 2 for code
Code Llama-Python 70B	Meta	2023	72.4	Largest Code Llama variant
WizardCoder-33B-V1.1	Microsoft	2023	78.9	Instruction-tuned on DeepSeek base
DeepSeek-Coder-Base-33B	DeepSeek	2023	70.6	Trained on 2T tokens of code
GPT-3.5-Turbo	OpenAI	2023	81.7	Chat-optimized model
GPT-4o	OpenAI	2024	84.8	Evaluated with planning-driven LPW workflow
Phi-3.5-MoE-instruct	Microsoft	2024	80.8	Mixture-of-experts architecture
Qwen2.5-Coder 32B Instruct	Alibaba	2024	90.2	Code-specialized Qwen variant
Qwen2.5 72B Instruct	Alibaba	2024	88.2	General-purpose large model
Llama-3.3 Nemotron Super 49B	NVIDIA	2025	91.3	NVIDIA-tuned Llama variant

The progression from roughly 60% in 2021 to over 90% by 2025 illustrates both the rapid advancement in code generation capabilities and the growing saturation of the MBPP benchmark.

MBPP+ Scores (EvalPlus)

The EvalPlus framework provides stricter evaluation with 35 times more test cases than the original. ^[3] MBPP+ scores are consistently lower than standard MBPP scores because the expanded test suite catches subtle bugs that the original three tests miss. The EvalPlus leaderboard uses a subset of the MBPP-sanitized tasks (378 in the current version) and ranks models by pass@1 with greedy decoding.

Model	Organization	MBPP+ pass@1 (%)
o1-preview	OpenAI	80.2
o1-mini	OpenAI	78.8
Qwen2.5-Coder-32B-Instruct	Alibaba	77.0
DeepSeek-Coder-V2-Instruct	DeepSeek	75.1
Gemini 1.5 Pro 002	Google	74.6
Claude 3.5 Sonnet	Anthropic	74.3
GPT-4-Turbo (Nov 2023)	OpenAI	73.3
Claude 3 Opus	Anthropic	73.3
DeepSeek-V3	DeepSeek	73.0
GPT-4o	OpenAI	72.2
Llama 3 70B Instruct	Meta	69.0
Grok Beta	xAI	65.6
Mistral Small 3.2 24B Instruct	Mistral	78.3
CodeLlama-34B	Meta	56.3

How does MBPP differ from HumanEval?

MBPP and HumanEval are the two most widely used benchmarks for evaluating code generation by large language models. While they share the same fundamental goal, they differ in several important ways.

Feature	HumanEval	MBPP
Origin	OpenAI (Chen et al., 2021)	Google Research (Austin et al., 2021)
Number of problems	164	974 (500 in test split)
Target difficulty	Moderate (interview-style)	Entry-level (introductory programming)
Problem source	Hand-written by OpenAI researchers	Crowd-sourced from Google internal workers
Prompt format	Function signature + docstring	Natural language description + 3 assert examples
Average test cases per problem	7.7	3.1
Test visibility	Tests hidden from the model	Tests shown in the prompt
Problem description length	Longer docstrings with examples	Short descriptions (~15.7 words average)
Language	Python only	Python only
Evaluation metric	pass@k	pass@k
Primary coverage	Algorithms, reasoning, comprehension	Fundamentals, standard library, math
Function signature provided	Yes	No (model must infer function name)
Sanitized subset	No (but HumanEval+ exists via EvalPlus)	Yes (427 problems)
License	MIT	CC-BY-4.0

A key structural difference is that MBPP shows the test cases to the model as part of the prompt, while HumanEval hides its test cases and instead provides a function signature with a docstring. This means MBPP models can use the assert statements as additional specification of the expected behavior, while HumanEval models must rely solely on the docstring description.

Another difference is scale: MBPP's 974 problems (500 in the test split) provide broader coverage and more statistically reliable results compared to HumanEval's 164 problems. ^[1]^[2] However, HumanEval's problems tend to be more algorithmically challenging, making it a better discriminator for advanced models.

MBPP also tests a broader skill set in one respect: the model must interpret the natural language description, choose appropriate function names and signatures, and implement the logic from scratch. HumanEval isolates the implementation step by providing the function signature and docstring, so the model only needs to complete the function body.

In practice, most evaluation suites report scores on both benchmarks. Models that perform well on one typically perform well on the other, though the relative ranking can shift depending on the model's strengths. The problem distribution also differs: approximately 89.5% of HumanEval problems are algorithmic and basic programming tasks, compared to 77% mathematical or list operation tasks in MBPP.

EvalPlus: MBPP+ and Enhanced Evaluation

Background

EvalPlus is an evaluation framework introduced by Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang from the University of Illinois at Urbana-Champaign. ^[3] The foundational work was published at NeurIPS 2023 in the paper "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation," with a follow-up paper (EvalPerf) published at COLM 2024. The core insight behind EvalPlus is that the small number of test cases in benchmarks like MBPP (three per problem) and HumanEval (average 7.7 per problem) is insufficient to catch many incorrect solutions that happen to pass the limited tests. ^[3] This phenomenon, called test insufficiency, can lead to inflated pass rates and incorrect model rankings.

MBPP+ Construction

MBPP+ extends the original MBPP test suite by approximately 35 times the number of test cases. ^[3] The test generation process combines two strategies:

LLM-driven seed generation: A language model generates diverse input seeds based on the problem description and function signature.
Type-aware mutation: The seed inputs undergo systematic mutations guided by the types of the function parameters, producing a large set of additional test inputs.

The enhanced test suite is then validated against the reference solutions to ensure correctness. This process catches solutions that pass the original three tests but fail on edge cases, boundary conditions, or uncommon inputs.

For HumanEval, the same approach created HumanEval+ with 80 times more test cases. ^[3]

MBPP+ Version History

MBPP+ has undergone several refinements as the EvalPlus team identified and resolved quality issues:

Version	Date	Number of Tasks	Changes
v0.1.0	2023	399	Initial release based on MBPP-sanitized (427 tasks) with quality filtering
v0.2.0	2023	378	Removed broken tasks (399 to 378); ~4 percentage point pass@1 improvement expected
v0.3.0	June 2024	378	Improved ground-truth solutions for Task IDs 459, 102, and 559

Impact on Scores and Rankings

The additional test cases in MBPP+ consistently lower reported pass@1 scores compared to base MBPP, sometimes by a substantial margin. More importantly, they can change the relative ranking of models. A model that appears to outperform another on the original MBPP may fall behind when evaluated on MBPP+ because its solutions happened to exploit gaps in the original test suite. For example, the EvalPlus authors showed that test insufficiency in the original benchmarks could cause mis-rankings, with some models that appeared weaker on the base benchmark actually performing better under rigorous evaluation. ^[3]

MBPP+ has been adopted by major AI organizations for benchmarking their code generation models, including Meta (for Llama 3.1 and 3.3 evaluations), Alibaba (for Qwen), DeepSeek, and Snowflake.

Other MBPP Variants

Several benchmarks have been developed that extend or build upon MBPP to address specific limitations:

MBPP Pro

In December 2024, Zhaojian Yu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang introduced MBPP Pro in the paper "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation," published in the Findings of ACL 2025. ^[7] MBPP Pro evaluates a model's ability to solve a base problem and then use that solution to address a more complex, related problem (called "self-invoking code generation"). This tests progressive reasoning and compositional problem-solving skills that go beyond standard function-level synthesis.

Results showed that most LLMs excel at standard MBPP tasks but struggle significantly with the self-invoking extension. For instance, o1-mini achieved 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro, with similar drops observed on the MBPP variants. ^[7] Instruction-tuned models showed only marginal improvements over base models on the self-invoking tasks.

MHPP (Mostly Hard Python Problems)

MHPP is a separate benchmark inspired by MBPP that increases the difficulty level substantially. MHPP features problem descriptions that average 150.2 words (roughly ten times longer than MBPP's 15.7-word average) and includes larger test suites. It was designed to discriminate among advanced models that have effectively saturated the original MBPP benchmark.

MultiPL-E

The MultiPL-E project extends MBPP (and HumanEval) to over 18 programming languages by automatically translating the Python problems and test cases into languages such as JavaScript, TypeScript, C++, Java, Rust, Go, Perl, Ruby, and others. ^[9] This allows researchers to evaluate how well code generation models perform across different programming languages using the same underlying problem set. MultiPL-E was published as an IEEE Transactions on Software Engineering paper by Cassano et al. (2023). ^[9]

BigCodeBench

BigCodeBench represents a next-generation benchmark that addresses MBPP's limitation of testing only self-contained function synthesis. It includes practical programming tasks that require interaction with external libraries, APIs, and more realistic software engineering scenarios.

What are the limitations of MBPP?

Despite its widespread adoption, MBPP has several recognized limitations:

Test Case Insufficiency

With only three test cases per problem on average, the original MBPP test suite is insufficient to verify full functional correctness. The EvalPlus project demonstrated that expanding the test suite by 35 times catches many previously undetected incorrect solutions. ^[3] Some models experience pass rate drops of 15 to 20 percentage points when evaluated with the expanded tests.

Data Contamination

Research has shown that approximately 65.4% of MBPP test instances can be traced to open-access websites. This finding, reported by Shi et al. (2024) in "On Leakage of Code Generation Evaluation Datasets" (Findings of EMNLP 2024), raises concerns that high-performing models may have encountered MBPP problems (or very similar ones) during pretraining, inflating their scores through memorization rather than genuine program synthesis ability. ^[8] As models are trained on increasingly large web crawls, the risk of benchmark contamination grows, making it difficult to determine whether high MBPP scores reflect true generalization.

Benchmark Saturation

As of 2025, state-of-the-art models achieve pass@1 scores above 90% on MBPP, with some exceeding 91%. At these levels, the benchmark loses its ability to meaningfully differentiate between top-performing models. When most models cluster near the ceiling, small differences in scores may reflect noise or evaluation variance rather than meaningful capability gaps. This saturation has led researchers to develop more challenging alternatives.

Narrow Problem Scope

The dataset is heavily skewed toward mathematical operations and list processing (77% of problems), with limited coverage of more complex programming patterns such as object-oriented design, file I/O, concurrency, error handling, database queries, or interaction with external libraries. This narrow scope means that strong MBPP performance does not necessarily translate to strong performance on real-world programming tasks.

Python-Only

MBPP evaluates only Python code generation. While Python is the most popular language for AI and data science, real-world software development involves many languages. The MultiPL-E project partially addresses this by translating MBPP problems to other languages, but the original benchmark remains Python-exclusive. ^[9]

Concise Problem Descriptions

MBPP problem descriptions average only 15.7 words, which is far shorter than real-world programming specifications. This brevity may not adequately test a model's ability to understand complex, multi-paragraph requirements. The MHPP benchmark addresses this with descriptions averaging 150.2 words.

Significance and Legacy

MBPP has played a foundational role in the development of the code generation field. Together with HumanEval, it established the standard evaluation framework (pass@k on function-level Python synthesis tasks) that nearly all subsequent code generation research has adopted. The benchmark's strengths lie in its size (974 problems provides significantly more statistical power than HumanEval's 164), its simplicity (making it accessible for rapid evaluation during model development), and its crowd-sourced nature (providing diverse problem formulations that reflect how non-experts describe programming tasks).

The benchmark has been cited thousands of times and is used as a standard evaluation in virtually every major code model release, from GPT-4 and Claude to open-source models like Code Llama, DeepSeek-Coder, StarCoder, and Qwen-Coder. ^[4]^[6]^[10] It remains a required benchmark in competitive code model evaluations, even as the community has recognized its limitations and developed more rigorous successors.

The trajectory of MBPP scores over time provides a compelling illustration of progress in AI code generation: from roughly 60% with the largest models in 2021 to above 90% by 2025, a level of improvement driven by advances in model scale, training data curation, instruction tuning, and reinforcement learning from human feedback.

Where can I download MBPP?

Accessing the Dataset

MBPP is freely available through several channels:

GitHub: google-research/google-research/tree/master/mbpp
Hugging Face Datasets: google-research-datasets/mbpp (both full and sanitized configurations; over 296,000 monthly downloads)
File formats: mbpp.jsonl (full dataset, 974 entries, one JSON object per line) and sanitized-mbpp.json (hand-verified 427-problem subset)

The dataset is released under the CC-BY-4.0 license. ^[11]

Evaluation Frameworks

Several evaluation frameworks support MBPP out of the box:

Framework	Maintainer	MBPP+ Support	Notes
EvalPlus	University of Illinois	Yes	Supports MBPP and MBPP+ with augmented test cases; `pip install evalplus`
BigCode Evaluation Harness	BigCode / Hugging Face	No	Standard MBPP evaluation for open-source code models
lm-evaluation-harness	EleutherAI	No	General-purpose LLM evaluation with MBPP task support

When running evaluations, researchers should execute generated Python code in a sandboxed environment, as model-generated code could potentially be harmful.

Benchmark	Year	Description	Relationship to MBPP
HumanEval	2021	164 Python problems with function signatures and docstrings	Complementary benchmark; often reported alongside MBPP
APPS	2021	10,000 coding problems from competitive programming sites	More difficult; tests broader range of skills
MBPP+ (EvalPlus)	2023	MBPP with 35x more test cases	Direct extension with stricter evaluation
BigCodeBench	2024	Practical programming tasks with library calls	Tests real-world coding beyond basic functions
MBPP Pro	2024	Self-invoking code generation tasks based on MBPP	Harder variant requiring compositional reasoning
MHPP	2024	Mostly Hard Python Problems with longer descriptions	Addresses MBPP's simplicity limitation
LiveCodeBench	2024	Continuously updated problems from coding contests	Addresses data contamination through temporal freshness
SWE-bench	2024	Real GitHub issues from open-source repositories	Tests end-to-end software engineering
MultiPL-E	2023	MBPP and HumanEval translated to 18+ languages	Extends MBPP to multilingual evaluation

References

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., & Sutton, C. (2021). "Program Synthesis with Large Language Models." *arXiv preprint arXiv:2108.07732*. https://arxiv.org/abs/2108.07732 ↩
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chanez, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., & Zaremba, W. (2021). "Evaluating Large Language Models Trained on Code." *arXiv preprint arXiv:2107.03374*. https://arxiv.org/abs/2107.03374 ↩
Liu, J., Xia, C. S., Wang, Y., & Zhang, L. (2023). "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation." *Advances in Neural Information Processing Systems (NeurIPS 2023)*. https://arxiv.org/abs/2305.01210 ↩
Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Sauvestre, R., Remez, T., Rapin, J., Kozhevnikov, A., Evtimov, I., Bitton, J., Bhatt, M., Ferrer, C. C., Grattafiori, A., Xiong, W., Defossez, A., Copet, J., Azhar, F., Touvron, H., Martin, L., Hambro, N., Razavi, F., Suber, J., Romero-Soriano, E., Synnaeve, G., Dey, N., Masse, B., & Sablayrolles, A. (2023). "Code Llama: Open Foundation Models for Code." *arXiv preprint arXiv:2308.12950*. https://arxiv.org/abs/2308.12950 ↩
Luo, Z., Xu, C., Zhao, P., Sun, Q., Geng, X., Hu, W., Tao, C., Ma, J., Lin, Q., & Jiang, D. (2023). "WizardCoder: Empowering Code Large Language Models with Evol-Instruct." *Proceedings of ICLR 2024*. https://arxiv.org/abs/2306.08568
Guo, D., Zhu, Q., Yang, D., Xie, Z., Dong, K., Zhang, W., Chen, G., Bi, X., Wu, Y., Li, Y. K., Luo, F., Xiong, Y., & Liang, W. (2024). "DeepSeek-Coder: When the Large Language Model Meets Programming." *arXiv preprint arXiv:2401.14196*. https://arxiv.org/abs/2401.14196 ↩
Yu, Z., Zhao, Y., Cohan, A., & Zhang, X.-P. (2024). "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation." *Findings of ACL 2025*. https://arxiv.org/abs/2412.21199 ↩
Shi, F., Fried, D., Ghazvininejad, M., Zettlemoyer, L., & Wang, S. I. (2024). "On Leakage of Code Generation Evaluation Datasets." *Findings of EMNLP 2024*. https://arxiv.org/abs/2407.07565 ↩
Cassano, F., Gouwar, J., Nguyen, D., Nguyen, S., Phipps-Costin, L., Pinckney, D., Yee, M.-H., Zi, Y., Anderson, C. J., Feldman, M. Q., Guha, A., Greenberg, M., & Jangda, A. (2023). "MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation." *IEEE Transactions on Software Engineering, 49*(7). https://arxiv.org/abs/2208.08227 ↩
Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). "StarCoder: May the Source Be with You!" *arXiv preprint arXiv:2305.06161*. https://arxiv.org/abs/2305.06161 ↩
Google Research. "MBPP: Mostly Basic Python Problems Dataset." GitHub repository. https://github.com/google-research/google-research/tree/master/mbpp ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

8 revisions by 1 contributors · full history

Suggest edit

Who created MBPP and when was it released?

What is in the MBPP dataset?

Problem Characteristics

Example Problems

How was MBPP created?

Data Splits

What is MBPP-sanitized?

How are models scored on MBPP? The pass@k metric

Definition

Unbiased Estimator

Common Evaluation Settings

Evaluation Procedure

What did the original MBPP paper find?

How have MBPP scores changed over time?

MBPP pass@1 Scores Over Time

MBPP+ Scores (EvalPlus)

How does MBPP differ from HumanEval?

EvalPlus: MBPP+ and Enhanced Evaluation

Background

MBPP+ Construction

MBPP+ Version History

Impact on Scores and Rankings

Other MBPP Variants

MBPP Pro

MHPP (Mostly Hard Python Problems)

MultiPL-E

BigCodeBench

What are the limitations of MBPP?

Test Case Insufficiency

Data Contamination

Benchmark Saturation

Narrow Problem Scope

Python-Only

Concise Problem Descriptions

Significance and Legacy

Where can I download MBPP?

Accessing the Dataset

Evaluation Frameworks

Successor and Related Benchmarks

See Also

References

Improve this article

Related Articles

LiveCodeBench

CodeContests

CRUXEval

Pass@k

Claude Sonnet 4.5

MMLU-Pro

What links here (24 of 34)

Related Articles

LiveCodeBench

CodeContests

CRUXEval

Pass@k

Claude Sonnet 4.5

MMLU-Pro

What links here (24 of 34)