MBPP (Mostly Basic Python Problems) is a code generation benchmark consisting of 974 crowd-sourced Python programming tasks designed to be solvable by entry-level programmers. Introduced by Jacob Austin, Augustus Odena, and colleagues at Google Research in August 2021, MBPP measures the ability of large language models to synthesize short Python programs from natural language descriptions. Each problem includes a task description, a reference solution, and three automated test cases. MBPP has become one of the most widely used benchmarks for evaluating code generation capabilities alongside HumanEval, and it is a standard component of LLM evaluation suites across both industry and academia.
MBPP was introduced in the paper "Program Synthesis with Large Language Models," submitted to arXiv on August 16, 2021 (arXiv:2108.07732). The full author list includes Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. All authors were affiliated with Google Research at the time of publication.
The paper presented two benchmarks for evaluating program synthesis: MBPP and a separate dataset called MathQA-Python (containing 23,914 problems). The central research question was whether large language models could generate correct code from natural language specifications, and how performance scaled with model size. The authors evaluated a collection of decoder-only transformer-based language models ranging from 244 million to 137 billion parameters on both benchmarks in few-shot and fine-tuning regimes. They found that synthesis performance scales log-linearly with model size, a finding that influenced subsequent research on scaling laws for code generation.
The motivation for creating MBPP was the absence of standardized benchmarks for evaluating natural-language-to-code synthesis at the time. While prior work had explored program synthesis in constrained domains, there was no large-scale, open dataset of simple programming tasks that could serve as a practical yardstick for general-purpose language models. The authors aimed to fill that gap with a dataset broad enough to cover fundamental programming concepts yet simple enough to be solvable by entry-level programmers.
The dataset is released under the CC-BY-4.0 license and hosted in the official Google Research GitHub repository as well as on Hugging Face Datasets.
The MBPP dataset contains 974 programming tasks, each consisting of three components:
| Component | Description |
|---|---|
| Task description | A short natural language prompt describing the programming problem in English |
| Reference solution | A self-contained Python function that correctly solves the problem |
| Test cases | Three assert-based test cases that verify functional correctness |
Each problem is stored as a JSON object with the following fields:
| Field | Type | Description |
|---|---|---|
task_id | Integer | A unique numeric identifier for the problem |
text | String | The natural language problem description |
code | String | The canonical reference implementation |
test_list | List of strings | Three assertion-based test cases |
test_setup_code | String | Optional import statements needed for testing |
challenge_test_list | List of strings | Additional hidden test cases (when available) |
The problems in MBPP are intentionally designed to cover fundamental programming concepts rather than advanced algorithmic challenges. The distribution of problem types breaks down approximately as follows:
| Category | Approximate Share | Examples |
|---|---|---|
| Mathematical operations | ~58% | Arithmetic, number theory, conversions |
| List operations | ~43% | Filtering, mapping, aggregation, sorting |
| String manipulation | ~19% | Parsing, formatting, pattern matching |
| Basic control flow | Varies | Loops, conditionals, recursion |
Note that categories overlap because a single problem may involve both list operations and mathematical computations. Problem descriptions average approximately 15.7 words, reflecting the concise and straightforward nature of the tasks. The average number of test cases per problem is 3.1 (some problems include additional challenge test cases beyond the standard three).
The following examples from the dataset illustrate the range and style of MBPP tasks.
Task ID 602: First Repeated Character
Task description:
Write a function to find the first repeated character in a given string.
Reference solution:
def first_repeated_char(str1):
for index, c in enumerate(str1):
if str1[:index+1].count(c) > 1:
return c
return "None"
Test cases:
assert first_repeated_char("abcabc") == "a"
assert first_repeated_char("abc") == "None"
assert first_repeated_char("123123") == "1"
Task ID 604: Reverse Words
Task description:
Write a function to reverse words in a given string.
Reference solution:
def reverse_words(s):
return ' '.join(reversed(s.split()))
Test cases:
assert reverse_words("python program") == "program python"
assert reverse_words("java language") == "language java"
assert reverse_words("indian man") == "man indian"
Task ID 625: Swap First and Last Elements
Task description:
Write a function to swap first and last elements of a list.
Reference solution:
def swap_List(newList):
size = len(newList)
temp = newList<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup>
newList<sup><a href="#cite_note-0" class="cite-ref">[0]</a></sup> = newList[size - 1]
newList[size - 1] = temp
return newList
Test cases:
assert swap_List([1,2,3]) == [3,2,1]
assert swap_List([1,2,3,4,4]) == [4,2,3,4,1]
assert swap_List([4,5,6]) == [6,5,4]
The model must generate a complete Python function that passes all three test cases for each problem.
The dataset was created through an internal crowdsourcing effort at Google. Crowd workers with basic Python knowledge were recruited from an internal pool and asked to complete three tasks for each problem:
The crowd workers were instructed to create problems that would be solvable by entry-level programmers, covering programming fundamentals and standard library functionality. After the initial collection, ambiguous problem statements were revised to improve clarity and consistency.
A subset of the dataset later underwent additional review by the original paper's authors, resulting in the MBPP-sanitized split (described below).
The MBPP paper specifies explicit data splits for standardized evaluation:
| Split | Task IDs | Count | Purpose |
|---|---|---|---|
| Few-shot prompts | 1 to 10 | 10 | Provide in-context examples for prompting |
| Test set | 11 to 510 | 500 | Primary evaluation set |
| Validation set | 511 to 600 | 90 | Hyperparameter tuning and development |
| Training set | 601 to 974 | 374 | Fine-tuning data |
The original paper used a 3-shot prompting setup with task IDs 2, 3, and 4 as in-context examples. The standard prompt format from the paper is:
You are an expert Python programmer, and here is your task: {prompt}
Your code should pass these tests:
{tests}
[BEGIN]
{code}
[DONE]
The [BEGIN] and [DONE] tokens serve as delimiters marking the start and end of the model's generated solution.
A curated subset of the full dataset, known as MBPP-sanitized, contains 427 problems that have been hand-verified by the original authors. This subset was created to address quality issues present in the full dataset, where some problems had noisy or ambiguous task descriptions, broken test cases, or other inconsistencies that are common artifacts of crowd-sourced data collection.
The sanitized version underwent a second round of annotation through an internal crowdsourcing effort at Google where reviewers improved task descriptions, verified the correctness of reference solutions, and ensured that the test cases accurately assessed functional correctness. MBPP-sanitized is distributed as a separate file (sanitized-mbpp.json) alongside the full dataset (mbpp.jsonl). It uses a slightly different schema, with each entry containing source_file, task_id, prompt (the refined task description), code, test_imports, and test_list.
Many subsequent evaluations and leaderboards use the sanitized subset rather than the full dataset because it reduces the impact of noisy or malformed problems that could distort model performance measurements. The EvalPlus framework, for instance, further refines this subset by removing additional low-quality tasks, resulting in 399 tasks (later reduced to 378 in MBPP+ v0.2.0).
MBPP uses the pass@k metric, which has become the standard evaluation metric for code generation benchmarks. This metric was formalized by Chen et al. (2021) in the Codex paper ("Evaluating Large Language Models Trained on Code").
The pass@k metric measures the probability that at least one of k generated code samples for a given problem passes all associated test cases. For pass@1, the model gets a single attempt per problem. For pass@10 or pass@100, the model generates multiple candidate solutions and succeeds if any one of them passes.
Rather than simply checking whether at least one of k samples is correct (which would introduce bias from the selection of k samples), Chen et al. proposed an unbiased estimator. The procedure works as follows:
pass@k = 1 - C(n - c, k) / C(n, k)
Here, C(a, b) denotes the binomial coefficient "a choose b." This formula calculates the complement of the probability that all k selected samples from the n generated candidates are incorrect, using sampling without replacement. If n - c < k (meaning there are fewer failing samples than the budget), pass@k equals 1.0 because at least one correct sample is guaranteed in any draw of k.
The key advantage of this estimator over a naive approach (such as 1 - (1 - pass@1)^k) is that it correctly accounts for finite-pool sampling without replacement, avoiding the independence assumptions that cause bias in simpler estimators.
The most commonly reported metric is pass@1 using greedy decoding (temperature = 0), which represents the model's ability to produce a correct solution on its first attempt. Some papers also report pass@10 and pass@80 or pass@100, which measure the diversity and coverage of the model's generated solutions at higher sampling temperatures.
To evaluate a model on MBPP, the standard procedure is:
assert statements from the test list.The original Austin et al. paper used temperature sampling at 0.5 and generated 80 samples per problem, reporting pass@80 as the headline metric. Later work typically reports pass@1 with greedy decoding (temperature 0) or pass@1 averaged across multiple samples.
Austin et al. (2021) evaluated a family of decoder-only transformer language models of varying sizes on the MBPP benchmark. The models were general-purpose language models trained on a mixture of text data, not specifically trained on code. Key findings include:
These results demonstrated for the first time that general-purpose language models, without any code-specific training, could solve a meaningful fraction of basic programming problems through few-shot prompting alone.
Performance on MBPP has improved dramatically since the benchmark's introduction in 2021. The following table summarizes reported pass@1 scores from notable models across different generations. Scores are drawn from original papers, the EvalPlus leaderboard, model technical reports, and third-party evaluations.
| Model | Organization | Year | MBPP pass@1 (%) | Notes |
|---|---|---|---|---|
| LaMDA 137B (few-shot) | 2021 | 59.6 | Original MBPP paper, pass@80, largest model tested | |
| LaMDA 137B (fine-tuned) | 2021 | ~70.0 | Fine-tuned on MBPP training split | |
| CodeGen-16B-Multi | Salesforce | 2022 | 20.9 | Multi-language open-source code model |
| Codex (code-davinci-002) | OpenAI | 2022 | 58.1 | Baseline zero-shot evaluation |
| PaLM-Coder 540B | 2022 | 75.0 | PaLM fine-tuned for code | |
| StarCoder-15B | BigCode | 2023 | 43.6 | Open model trained on The Stack |
| WizardCoder-15B-V1.0 | Microsoft | 2023 | 51.8 | Evol-Instruct fine-tuned StarCoder |
| Code Llama-Python 34B | Meta | 2023 | 67.2 | Fine-tuned Llama 2 for code |
| Code Llama-Python 70B | Meta | 2023 | 72.4 | Largest Code Llama variant |
| WizardCoder-33B-V1.1 | Microsoft | 2023 | 78.9 | Instruction-tuned on DeepSeek base |
| DeepSeek-Coder-Base-33B | DeepSeek | 2023 | 70.6 | Trained on 2T tokens of code |
| GPT-3.5-Turbo | OpenAI | 2023 | 81.7 | Chat-optimized model |
| GPT-4o | OpenAI | 2024 | 84.8 | Evaluated with planning-driven LPW workflow |
| Phi-3.5-MoE-instruct | Microsoft | 2024 | 80.8 | Mixture-of-experts architecture |
| Qwen2.5-Coder 32B Instruct | Alibaba | 2024 | 90.2 | Code-specialized Qwen variant |
| Qwen2.5 72B Instruct | Alibaba | 2024 | 88.2 | General-purpose large model |
| Llama-3.3 Nemotron Super 49B | NVIDIA | 2025 | 91.3 | NVIDIA-tuned Llama variant |
The progression from roughly 60% in 2021 to over 90% by 2025 illustrates both the rapid advancement in code generation capabilities and the growing saturation of the MBPP benchmark.
The EvalPlus framework provides stricter evaluation with 35 times more test cases than the original. MBPP+ scores are consistently lower than standard MBPP scores because the expanded test suite catches subtle bugs that the original three tests miss. The EvalPlus leaderboard uses a subset of the MBPP-sanitized tasks (378 in the current version) and ranks models by pass@1 with greedy decoding.
| Model | Organization | MBPP+ pass@1 (%) |
|---|---|---|
| o1-preview | OpenAI | 80.2 |
| o1-mini | OpenAI | 78.8 |
| Qwen2.5-Coder-32B-Instruct | Alibaba | 77.0 |
| DeepSeek-Coder-V2-Instruct | DeepSeek | 75.1 |
| Gemini 1.5 Pro 002 | 74.6 | |
| Claude 3.5 Sonnet | Anthropic | 74.3 |
| GPT-4-Turbo (Nov 2023) | OpenAI | 73.3 |
| Claude 3 Opus | Anthropic | 73.3 |
| DeepSeek-V3 | DeepSeek | 73.0 |
| GPT-4o | OpenAI | 72.2 |
| Llama 3 70B Instruct | Meta | 69.0 |
| Grok Beta | xAI | 65.6 |
| Mistral Small 3.2 24B Instruct | Mistral | 78.3 |
| CodeLlama-34B | Meta | 56.3 |
MBPP and HumanEval are the two most widely used benchmarks for evaluating code generation by large language models. While they share the same fundamental goal, they differ in several important ways.
| Feature | HumanEval | MBPP |
|---|---|---|
| Origin | OpenAI (Chen et al., 2021) | Google Research (Austin et al., 2021) |
| Number of problems | 164 | 974 (500 in test split) |
| Target difficulty | Moderate (interview-style) | Entry-level (introductory programming) |
| Problem source | Hand-written by OpenAI researchers | Crowd-sourced from Google internal workers |
| Prompt format | Function signature + docstring | Natural language description + 3 assert examples |
| Average test cases per problem | 7.7 | 3.1 |
| Test visibility | Tests hidden from the model | Tests shown in the prompt |
| Problem description length | Longer docstrings with examples | Short descriptions (~15.7 words average) |
| Language | Python only | Python only |
| Evaluation metric | pass@k | pass@k |
| Primary coverage | Algorithms, reasoning, comprehension | Fundamentals, standard library, math |
| Function signature provided | Yes | No (model must infer function name) |
| Sanitized subset | No (but HumanEval+ exists via EvalPlus) | Yes (427 problems) |
| License | MIT | CC-BY-4.0 |
A key structural difference is that MBPP shows the test cases to the model as part of the prompt, while HumanEval hides its test cases and instead provides a function signature with a docstring. This means MBPP models can use the assert statements as additional specification of the expected behavior, while HumanEval models must rely solely on the docstring description.
Another difference is scale: MBPP's 974 problems (500 in the test split) provide broader coverage and more statistically reliable results compared to HumanEval's 164 problems. However, HumanEval's problems tend to be more algorithmically challenging, making it a better discriminator for advanced models.
MBPP also tests a broader skill set in one respect: the model must interpret the natural language description, choose appropriate function names and signatures, and implement the logic from scratch. HumanEval isolates the implementation step by providing the function signature and docstring, so the model only needs to complete the function body.
In practice, most evaluation suites report scores on both benchmarks. Models that perform well on one typically perform well on the other, though the relative ranking can shift depending on the model's strengths. The problem distribution also differs: approximately 89.5% of HumanEval problems are algorithmic and basic programming tasks, compared to 77% mathematical or list operation tasks in MBPP.
EvalPlus is an evaluation framework introduced by Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang from the University of Illinois at Urbana-Champaign. The foundational work was published at NeurIPS 2023 in the paper "Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation," with a follow-up paper (EvalPerf) published at COLM 2024. The core insight behind EvalPlus is that the small number of test cases in benchmarks like MBPP (three per problem) and HumanEval (average 7.7 per problem) is insufficient to catch many incorrect solutions that happen to pass the limited tests. This phenomenon, called test insufficiency, can lead to inflated pass rates and incorrect model rankings.
MBPP+ extends the original MBPP test suite by approximately 35 times the number of test cases. The test generation process combines two strategies:
The enhanced test suite is then validated against the reference solutions to ensure correctness. This process catches solutions that pass the original three tests but fail on edge cases, boundary conditions, or uncommon inputs.
For HumanEval, the same approach created HumanEval+ with 80 times more test cases.
MBPP+ has undergone several refinements as the EvalPlus team identified and resolved quality issues:
| Version | Date | Number of Tasks | Changes |
|---|---|---|---|
| v0.1.0 | 2023 | 399 | Initial release based on MBPP-sanitized (427 tasks) with quality filtering |
| v0.2.0 | 2023 | 378 | Removed broken tasks (399 to 378); ~4 percentage point pass@1 improvement expected |
| v0.3.0 | June 2024 | 378 | Improved ground-truth solutions for Task IDs 459, 102, and 559 |
The additional test cases in MBPP+ consistently lower reported pass@1 scores compared to base MBPP, sometimes by a substantial margin. More importantly, they can change the relative ranking of models. A model that appears to outperform another on the original MBPP may fall behind when evaluated on MBPP+ because its solutions happened to exploit gaps in the original test suite. For example, the EvalPlus authors showed that test insufficiency in the original benchmarks could cause mis-rankings, with some models that appeared weaker on the base benchmark actually performing better under rigorous evaluation.
MBPP+ has been adopted by major AI organizations for benchmarking their code generation models, including Meta (for Llama 3.1 and 3.3 evaluations), Alibaba (for Qwen), DeepSeek, and Snowflake.
Several benchmarks have been developed that extend or build upon MBPP to address specific limitations:
In December 2024, Zhaojian Yu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang introduced MBPP Pro in the paper "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation," published in the Findings of ACL 2025. MBPP Pro evaluates a model's ability to solve a base problem and then use that solution to address a more complex, related problem (called "self-invoking code generation"). This tests progressive reasoning and compositional problem-solving skills that go beyond standard function-level synthesis.
Results showed that most LLMs excel at standard MBPP tasks but struggle significantly with the self-invoking extension. For instance, o1-mini achieved 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro, with similar drops observed on the MBPP variants. Instruction-tuned models showed only marginal improvements over base models on the self-invoking tasks.
MHPP is a separate benchmark inspired by MBPP that increases the difficulty level substantially. MHPP features problem descriptions that average 150.2 words (roughly ten times longer than MBPP's 15.7-word average) and includes larger test suites. It was designed to discriminate among advanced models that have effectively saturated the original MBPP benchmark.
The MultiPL-E project extends MBPP (and HumanEval) to over 18 programming languages by automatically translating the Python problems and test cases into languages such as JavaScript, TypeScript, C++, Java, Rust, Go, Perl, Ruby, and others. This allows researchers to evaluate how well code generation models perform across different programming languages using the same underlying problem set. MultiPL-E was published as an IEEE Transactions on Software Engineering paper by Cassano et al. (2023).
BigCodeBench represents a next-generation benchmark that addresses MBPP's limitation of testing only self-contained function synthesis. It includes practical programming tasks that require interaction with external libraries, APIs, and more realistic software engineering scenarios.
Despite its widespread adoption, MBPP has several recognized limitations:
With only three test cases per problem on average, the original MBPP test suite is insufficient to verify full functional correctness. The EvalPlus project demonstrated that expanding the test suite by 35 times catches many previously undetected incorrect solutions. Some models experience pass rate drops of 15 to 20 percentage points when evaluated with the expanded tests.
Research has shown that approximately 65.4% of MBPP test instances can be traced to open-access websites. This finding, reported by Shi et al. (2024) in "On Leakage of Code Generation Evaluation Datasets" (Findings of EMNLP 2024), raises concerns that high-performing models may have encountered MBPP problems (or very similar ones) during pretraining, inflating their scores through memorization rather than genuine program synthesis ability. As models are trained on increasingly large web crawls, the risk of benchmark contamination grows, making it difficult to determine whether high MBPP scores reflect true generalization.
As of 2025, state-of-the-art models achieve pass@1 scores above 90% on MBPP, with some exceeding 91%. At these levels, the benchmark loses its ability to meaningfully differentiate between top-performing models. When most models cluster near the ceiling, small differences in scores may reflect noise or evaluation variance rather than meaningful capability gaps. This saturation has led researchers to develop more challenging alternatives.
The dataset is heavily skewed toward mathematical operations and list processing (77% of problems), with limited coverage of more complex programming patterns such as object-oriented design, file I/O, concurrency, error handling, database queries, or interaction with external libraries. This narrow scope means that strong MBPP performance does not necessarily translate to strong performance on real-world programming tasks.
MBPP evaluates only Python code generation. While Python is the most popular language for AI and data science, real-world software development involves many languages. The MultiPL-E project partially addresses this by translating MBPP problems to other languages, but the original benchmark remains Python-exclusive.
MBPP problem descriptions average only 15.7 words, which is far shorter than real-world programming specifications. This brevity may not adequately test a model's ability to understand complex, multi-paragraph requirements. The MHPP benchmark addresses this with descriptions averaging 150.2 words.
MBPP has played a foundational role in the development of the code generation field. Together with HumanEval, it established the standard evaluation framework (pass@k on function-level Python synthesis tasks) that nearly all subsequent code generation research has adopted. The benchmark's strengths lie in its size (974 problems provides significantly more statistical power than HumanEval's 164), its simplicity (making it accessible for rapid evaluation during model development), and its crowd-sourced nature (providing diverse problem formulations that reflect how non-experts describe programming tasks).
The benchmark has been cited thousands of times and is used as a standard evaluation in virtually every major code model release, from GPT-4 and Claude to open-source models like Code Llama, DeepSeek-Coder, StarCoder, and Qwen-Coder. It remains a required benchmark in competitive code model evaluations, even as the community has recognized its limitations and developed more rigorous successors.
The trajectory of MBPP scores over time provides a compelling illustration of progress in AI code generation: from roughly 60% with the largest models in 2021 to above 90% by 2025, a level of improvement driven by advances in model scale, training data curation, instruction tuning, and reinforcement learning from human feedback.
MBPP is freely available through several channels:
mbpp.jsonl (full dataset, 974 entries, one JSON object per line) and sanitized-mbpp.json (hand-verified 427-problem subset)The dataset is released under the CC-BY-4.0 license.
Several evaluation frameworks support MBPP out of the box:
| Framework | Maintainer | MBPP+ Support | Notes |
|---|---|---|---|
| EvalPlus | University of Illinois | Yes | Supports MBPP and MBPP+ with augmented test cases; pip install evalplus |
| BigCode Evaluation Harness | BigCode / Hugging Face | No | Standard MBPP evaluation for open-source code models |
| lm-evaluation-harness | EleutherAI | No | General-purpose LLM evaluation with MBPP task support |
When running evaluations, researchers should execute generated Python code in a sandboxed environment, as model-generated code could potentially be harmful.
| Benchmark | Year | Description | Relationship to MBPP |
|---|---|---|---|
| HumanEval | 2021 | 164 Python problems with function signatures and docstrings | Complementary benchmark; often reported alongside MBPP |
| APPS | 2021 | 10,000 coding problems from competitive programming sites | More difficult; tests broader range of skills |
| MBPP+ (EvalPlus) | 2023 | MBPP with 35x more test cases | Direct extension with stricter evaluation |
| BigCodeBench | 2024 | Practical programming tasks with library calls | Tests real-world coding beyond basic functions |
| MBPP Pro | 2024 | Self-invoking code generation tasks based on MBPP | Harder variant requiring compositional reasoning |
| MHPP | 2024 | Mostly Hard Python Problems with longer descriptions | Addresses MBPP's simplicity limitation |
| LiveCodeBench | 2024 | Continuously updated problems from coding contests | Addresses data contamination through temporal freshness |
| SWE-bench | 2024 | Real GitHub issues from open-source repositories | Tests end-to-end software engineering |
| MultiPL-E | 2023 | MBPP and HumanEval translated to 18+ languages | Extends MBPP to multilingual evaluation |