MLE-bench
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,576 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,576 words
Add missing citations, update stale details, or suggest a clearer explanation.
MLE-bench is a benchmark for evaluating the machine learning engineering capabilities of autonomous ai agents, introduced by openai's Preparedness team on October 9, 2024. It consists of 75 curated kaggle competitions spanning tabular, image, natural language, and audio modalities, and is graded against the original Kaggle Private leaderboards using the same bronze/silver/gold medal thresholds used to rank human competitors.[1][2] By framing each task as an end-to-end ML engineering problem (read the competition description, prepare data, train a model, debug, iterate, and submit a CSV), MLE-bench was designed to be one of the first benchmarks that can plausibly forecast progress in autonomous ML research and development.[1]
In the original release, the strongest configuration tested, o1-preview combined with the AIDE scaffolding from Weco AI, achieved at least a bronze medal in 16.9% of competitions averaged over 16 seeds, with GPT-4o (AIDE) at 8.7%, Claude 3.5 Sonnet (AIDE) at 7.6%, and Llama 3.1 405B (AIDE) at 3.0%.[1][3] The paper additionally reports that pass@6 roughly doubles pass@1 performance, that giving GPT-4o 100 hours per competition raises its medal rate to 11.8%, and that obfuscating competition descriptions does not measurably change scores, evidence that pre-training contamination contributes only modestly to results.[1]
MLE-bench was accepted as an oral presentation at ICLR 2025 and is explicitly tied to OpenAI's Preparedness Framework, Anthropic's Responsible Scaling Policy, and Google DeepMind's Frontier Safety Framework as a candidate measurement of "ML R&D" capability, specifically the danger that a sufficiently capable model could automate parts of frontier AI research and thereby accelerate its own training.[1][4] Since the original release, an active community leaderboard at mlebench.com has tracked steady progress: by early 2026, agents built around Gemini 3 Pro Preview and Claude Opus 4.6 with bespoke scaffolding have pushed the headline medal rate above 60%.[5]
| Field | Value |
|---|---|
| Released | October 9, 2024 (arXiv 2410.07095); v2 February 26, 2025[1] |
| Authors | Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Madry (OpenAI Preparedness)[1] |
| Tasks | 75 Kaggle competitions (22 Low / 38 Medium / 15 High complexity), plus a 7-competition development split[1] |
| Subject | End-to-end machine learning engineering by autonomous agents |
| Headline metric | Percentage of competitions in which the agent earns at least a bronze medal on the original Kaggle Private leaderboard[1] |
| Repository | github.com/openai/mle-bench[2] |
| Venue | ICLR 2025 (Oral)[6] |
By mid-2024, language-model-based agents had begun saturating older code-generation benchmarks such as HumanEval and MBPP, with AgentCoder reaching 96.3% on HumanEval, and were making rapid progress on SWE-bench.[1] At the same time, machine learning engineering, the discipline of designing, training, debugging, and tuning models on real datasets, had no widely adopted analogue. Earlier efforts such as MLAgentBench (13 Kaggle and bespoke tasks scored against a baseline solution) and ML-Bench (interaction with existing ML repositories) covered narrow slices of the workflow but did not require agents to attempt full competition pipelines from scratch.[1]
The OpenAI Preparedness team framed MLE-bench around a specific risk concern: an AI system that can autonomously perform open-ended ML engineering work might also be able to improve its own training, alignment, or inference code, and therefore accelerate the development of more capable models. The paper notes that a model "capable of solving a large fraction of MLE-bench likely possesses the capability to execute many open-ended ML tasks" and that MLE-bench was designed in part as evidence for OpenAI's Preparedness Framework, for Anthropic's Responsible Scaling Policy, and for DeepMind's Frontier Safety Framework.[1] Co-author Lilian Weng had previously led safety systems at OpenAI, and Aleksander Madry was head of OpenAI's Preparedness team at the time of release.[1][7]
A second motivation was the desire for a benchmark with a natural human baseline. Because every MLE-bench task is a real Kaggle competition that thousands of practitioners have already attempted, every model submission can be scored on the original Private leaderboard and assigned the same bronze / silver / gold medal that a human would have earned. Only nine humans in Kaggle's history have ever earned medals across as many as 75 distinct competitions, giving the benchmark a meaningful ceiling.[1]
Starting from the 5,673 completed Kaggle competitions in the Meta Kaggle dataset, the authors removed Community Competitions (whose quality is not rigorously vetted), leaving 586 competitions for manual review. Each surviving competition was screened by at least two ML engineers from leading AI labs against nine criteria, including that the task require modern ML engineering (not simple tabular prediction or solved problems like MNIST), that the description be self-contained, that the evaluation metric be computable locally, that submissions be CSV files, and that licensing permit redistribution.[1] Each competition was then manually annotated with a problem category (image classification, text classification, tabular, image segmentation, audio, sequence-to-sequence, image regression, and others) and a complexity tier.
Complexity is assigned from the perspective of an experienced contemporary ML engineer and excludes the time spent training the final model:[1]
The repository also exposes a "Lite" subset consisting only of the 22 Low-complexity tasks (≈158 GB of data) for groups that cannot afford a full run.[2]
The 75 competitions span 15 problem categories, and notable inclusions are scientifically substantive challenges such as OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction and the Vesuvius Challenge for deciphering carbonized ancient scrolls.[1] The total prize money awarded across the 75 original Kaggle competitions sums to $1,948,016 (an average of $25,974 per competition), giving the benchmark a tangible link to commercial and scientific outcomes that humans were paid millions of dollars to solve.[1]
Because Kaggle generally does not release the hidden test labels after a competition closes, the MLE-bench team reconstructs new train/test splits from the public training data, typically holding out 10% as a new test set. They verify that the sample submission scores similarly on the new and original test sets, so that medal thresholds remain comparable to the original human leaderboard.[1] Snapshots of each competition's Private leaderboard were taken between May and August 2024.[1]
Each agent runs inside an Ubuntu 20.04 Docker container on a secure cluster with 36 vCPUs, 440 GB of RAM, 4,095 GiB of SSD, and one 24 GB Nvidia A10 GPU. Agents have a wall-clock budget of 24 hours per competition and produce a final submission.csv that is graded by a local copy of the original competition's grading code.[1] All headline experiments are repeated with three random seeds (more for AIDE configurations), with results reported as mean ± standard error.
The benchmark also exposes a local validation server that an agent can call to check whether a candidate submission.csv is structurally valid, mirroring the way real Kaggle competitions allow up to five Public leaderboard submissions per day. Unlike Kaggle, the MLE-bench validation server only returns validity (not a score), and there is no daily cap on usage.[1]
MLE-bench inherits Kaggle's medal thresholds, which depend on the number of teams that originally competed:[1]
| Teams | Bronze | Silver | Gold |
|---|---|---|---|
| 0-99 | Top 40% | Top 20% | Top 10% |
| 100-249 | Top 40% | Top 20% | Top 10 |
| 250-999 | Top 100 | Top 50 | Top 10 + 0.2%* |
| 1000+ | Top 10% | Top 5% | Top 10 + 0.2%* |
* The gold threshold increases by 1 team for every 500 additional teams above 1,000.
The headline metric is Any Medal (%), the percentage of attempts that earn at least a bronze. Unlike Kaggle, MLE-bench applies medal thresholding to all 75 competitions, even those that did not originally award medals, so that a single number describes overall performance.[1] The paper also reports raw scores per competition, valid submission rates, and the percentage of attempts scoring above the median of the human leaderboard.
MLE-bench is agnostic to the agent architecture; only a CSV submission is required. The authors evaluated three open-source scaffolds:[1]
The paper modestly modifies each scaffold to enhance its performance and notes that small implementation details matter substantially: MLAB and OpenHands sometimes terminated within minutes despite the 24-hour budget, while AIDE persistently re-prompted the model to improve its current best solution.[1]
Agents are forbidden from writing predictions directly to the submission file from their own knowledge (the prediction must come from a trained model) and from viewing Kaggle solutions online.[1] To enforce these rules, the benchmark ships two automated checks:
Across all medal-winning runs, the plagiarism detector and human reviewers found no confirmed violations; the small number of automated flags were judged false positives after manual inspection.[1]
The single strongest configuration in the launch paper was o1-preview with AIDE, which earned at least a bronze in 16.9% ± 1.1% of the 75 competitions averaged over 16 seeds, achieving an average of 7 gold medals per run.[1] By Kaggle's own progression rules, a Grandmaster title requires five gold medals, so a single MLE-bench run with this agent would, on its raw output, satisfy that threshold, although the paper carefully notes that the modified train/test splits, the application of medal thresholds to non-medal competitions, and the use of more recent technology than the original competitors all complicate the comparison.[1]
The October 2024 results across model–scaffold pairings were:[1]
| Scaffold | Model | Valid submission | Any medal (%) |
|---|---|---|---|
| AIDE | o1-preview | 82.8 ± 1.1 | 16.9 ± 1.1 |
| AIDE | GPT-4o (2024-08-06) | 54.9 ± 1.0 | 8.7 ± 0.5 |
| AIDE | Claude 3.5 Sonnet (20240620) | 51.1 ± 3.3 | 7.6 ± 1.8 |
| AIDE | Llama 3.1 405B Instruct | 27.3 ± 2.6 | 3.0 ± 1.0 |
| MLAB | GPT-4o | 44.3 ± 2.6 | 0.8 ± 0.5 |
| OpenHands | GPT-4o | 52.0 ± 3.3 | 4.4 ± 1.4 |
Gemini 1.5 Pro was attempted but excluded from the table because its API repeatedly blocked completions due to "recitation" flags on competition-related outputs.[1]
Three resource-scaling sweeps were reported:[1]
Qualitative analysis identified consistent agent failure modes: agents frequently produced invalid submission files despite having access to a validation server, often failed to budget compute and time appropriately (overloading disk or RAM), and tended to "give up" when faced with novel error states rather than persist through debugging. AIDE's tree-search loop kept models engaged for the full budget, but MLAB and OpenHands often terminated early.[1]
A central methodological concern is that frontier models may have memorized either Kaggle competition data or the winning discussions and notebooks shared by top finishers. The authors ran two complementary tests on GPT-4o:[1]
Combined with the plagiarism detector findings, these results suggest that memorization is not driving GPT-4o's scores on the initial release. The paper is explicit, however, that this does not rule out subtler contamination effects, particularly for newer models trained on more Kaggle material, and that "future work may seek to regularly update MLE-bench with new Kaggle competitions to stay ahead of contamination issues."[1]
The benchmark is hosted via a community-run leaderboard at mlebench.com that aggregates submissions from labs and independent researchers; the site explicitly notes it "is not associated with OpenAI in any way."[5] By May 2026, the leaderboard tracked roughly two dozen submissions spanning the original AIDE/o1-preview baseline through more advanced scaffolds layered on top of newer models.
Reported headline numbers from late 2025 and early 2026 include:[5][8][9]
The leaderboard temporarily paused new public submissions on April 24, 2026, while the maintainers worked on stronger fairness and reproducibility checks, citing concerns about comparability between submissions using very different scaffolding, internet access policies, and compute budgets.[5][2]
A parallel research thread launched MLE-Dojo, which converts MLE-bench's static evaluation into an interactive, Gym-style reinforcement-learning environment supporting 200+ Kaggle competitions (incorporating 68 from MLE-bench, 74 from DSBench, and 75 newly scraped tasks) and adds a HumanRank score normalized to the original Kaggle leaderboard.[10]
The MLE-bench paper explicitly positions the benchmark as an evidence source for ML R&D risk evaluations across the major frontier labs:[1]
OpenAI's safety evaluations hub and the Preparedness v2 document continue to reference MLE-bench as one of the standardized evaluations the company runs ahead of deploying frontier systems, alongside SWE-bench Verified, AgentBench-style multi-step tool use, and tasks drawn from METR's autonomy suite.[4][11] The framing is that "if a model can succeed on a large fraction of MLE-bench, it is plausible that the same model could execute the core ML engineering steps needed to improve frontier training pipelines," which would warrant heightened safety and security mitigations.[1][4]
| Benchmark | Domain | Length | Comparison |
|---|---|---|---|
| SWE-bench | Real-world GitHub bug fixes | Minutes-hours | Tests software engineering on existing codebases; MLE-bench tests open-ended ML engineering from scratch.[1] |
| MLAgentBench | 13 mixed ML tasks | Bounded | Provides baseline solutions and measures relative improvement; MLE-bench requires from-scratch attempts on 75 tasks.[1] |
| RE-Bench (METR, 2024) | 7 frontier-style ML research engineering environments | 2-32 hours | Targets frontier AI R&D capabilities (e.g., custom CUDA kernels, restricted-architecture training). Where MLE-bench measures classical ML engineering on Kaggle competitions, RE-Bench measures more novel research tasks where solutions are not freely available online.[12] |
| DSBench | Kaggle-style data science | Variable | Concurrent with MLE-bench but filters competitions to fit an automated template; MLE-bench includes more diverse and non-standard tasks.[1] |
| METR autonomy suite | Long-horizon agentic tasks | Up to days | METR explicitly frames MLE-bench, RE-Bench, and similar benchmarks as complementary tools for tracking when AI agents will match human researchers on multi-week R&D projects.[12] |
| ARC-AGI / GAIA | General reasoning / assistant tasks | Short | Test general cognition rather than long-horizon ML engineering.[1] |
METR has argued that MLE-bench is best read alongside RE-Bench rather than as a substitute: top MLE-bench solutions exist online and require less novel exploration, whereas RE-Bench's seven environments are intentionally designed to admit no public solutions and to require genuine experimentation.[12]
Coverage in VentureBeat, DeepLearning.AI's The Batch, MarkTechPost, and a number of industry outlets highlighted MLE-bench as one of the first benchmarks to evaluate autonomous ML engineering at scale, and it was selected as an oral presentation at ICLR 2025.[6][3]
Several recurring criticisms have been raised in the literature and in commentary:[1][12][9]
Despite these caveats, MLE-bench has been broadly adopted: by mid-2026 it had been incorporated into OpenAI's Safety Evaluations Hub, referenced in OpenAI's Preparedness Framework v2, extended by community projects such as MLE-Dojo and MLE-STAR, and used as a workhorse evaluation for new frontier models from OpenAI, Anthropic, and Google DeepMind.[4][11][9][10]