MLE-bench

AI Agents AI Benchmarks OpenAI

20 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v6 · 3,997 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

MLE-bench is a benchmark for measuring how well autonomous AI agents perform real-world machine learning engineering, released by OpenAI's Preparedness team on October 9, 2024 (arXiv:2410.07095). It consists of 75 curated Kaggle competitions and grades each agent's submission against the original Kaggle Private leaderboards using the same bronze, silver, and gold medal thresholds used to rank human competitors. In OpenAI's own words, "we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments," and "the best-performing setup, OpenAI's o1-preview with AIDE scaffolding, achieves at least the level of a Kaggle bronze medal in 16.9% of competitions."^[1]

By framing each task as an end-to-end ML engineering problem (read the competition description, prepare data, train a model, debug, iterate, and submit a CSV), MLE-bench was designed to be one of the first benchmarks that can plausibly forecast progress in autonomous ML research and development.^[1] The benchmark code is open-source at github.com/openai/mle-bench, and the paper was accepted as an oral presentation at ICLR 2025.^[1]^[2]^[6]

In the original release, the strongest configuration tested, o1-preview combined with the AIDE scaffolding from Weco AI, achieved at least a bronze medal in 16.9% of competitions averaged over 16 seeds, with GPT-4o (AIDE) at 8.7%, Claude 3.5 Sonnet (AIDE) at 7.6%, and Llama 3.1 405B (AIDE) at 3.0%.^[1]^[3] The paper additionally reports that pass@6 roughly doubles pass@1 performance (o1-preview reaching 34.1% at pass@8), that giving GPT-4o 100 hours per competition raises its medal rate to 11.8%, and that obfuscating competition descriptions does not measurably change scores, evidence that pre-training contamination contributes only modestly to results.^[1]

MLE-bench is explicitly tied to OpenAI's Preparedness Framework, Anthropic's Responsible Scaling Policy, and Google DeepMind's Frontier Safety Framework as a candidate measurement of "ML R&D" capability, specifically the danger that a sufficiently capable model could automate parts of frontier AI research and thereby accelerate its own training.^[1]^[4] Since the original release, an active community leaderboard at mlebench.com has tracked steady progress: by early 2026, agents built around Gemini 3 Pro Preview and Claude Opus 4.6 with bespoke scaffolding have pushed the headline medal rate above 60%.^[5]

Key facts

Field	Value
Released	October 9, 2024 (arXiv 2410.07095); v2 February 26, 2025^[1]
Authors	Chan Jun Shern, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Madry (OpenAI Preparedness)^[1]
Tasks	75 Kaggle competitions (22 Low / 38 Medium / 15 High complexity), plus a 7-competition development split^[1]
Subject	End-to-end machine learning engineering by autonomous agents
Headline metric	Percentage of competitions in which the agent earns at least a bronze medal on the original Kaggle Private leaderboard^[1]
Best result at release	o1-preview + AIDE, 16.9% any-medal rate^[1]
Repository	github.com/openai/mle-bench (open-source)^[2]
Venue	ICLR 2025 (Oral)^[6]

What is MLE-bench?

MLE-bench is OpenAI's benchmark for evaluating the machine learning engineering capabilities of autonomous AI agents. Each of its 75 tasks is a real Kaggle competition spanning tabular, image, natural language, and audio modalities, and an agent must complete the full pipeline (read the competition description, prepare data, train a model, debug, iterate, and submit a predictions CSV) without human help.^[1] Submissions are graded against the original Kaggle Private leaderboards using the same bronze, silver, and gold medal thresholds applied to human competitors, which gives the benchmark a natural human baseline that most agent evaluations lack.^[1] MLE-bench is therefore a form of agent evaluation aimed specifically at the kind of open-ended ML R&D work that frontier labs treat as a safety-relevant capability.^[1]^[4]

Background and motivation

By mid-2024, language-model-based agents had begun saturating older code-generation benchmarks such as HumanEval and MBPP, with AgentCoder reaching 96.3% on HumanEval, and were making rapid progress on SWE-bench.^[1] At the same time, machine learning engineering, the discipline of designing, training, debugging, and tuning models on real datasets, had no widely adopted analogue. Earlier efforts such as MLAgentBench (13 Kaggle and bespoke tasks scored against a baseline solution) and ML-Bench (interaction with existing ML repositories) covered narrow slices of the workflow but did not require agents to attempt full competition pipelines from scratch.^[1]

The OpenAI Preparedness team framed MLE-bench around a specific risk concern: an AI system that can autonomously perform open-ended ML engineering work might also be able to improve its own training, alignment, or inference code, and therefore accelerate the development of more capable models. The paper notes that a model "capable of solving a large fraction of MLE-bench likely possesses the capability to execute many open-ended ML tasks" and that MLE-bench was designed in part as evidence for OpenAI's Preparedness Framework, for Anthropic's Responsible Scaling Policy, and for DeepMind's Frontier Safety Framework.^[1] Co-author Lilian Weng had previously led safety systems at OpenAI, and Aleksander Madry was head of OpenAI's Preparedness team at the time of release.^[1]^[7]

A second motivation was the desire for a benchmark with a natural human baseline. Because every MLE-bench task is a real Kaggle competition that thousands of practitioners have already attempted, every model submission can be scored on the original Private leaderboard and assigned the same bronze / silver / gold medal that a human would have earned. Only nine humans in Kaggle's history have ever earned medals across as many as 75 distinct competitions, giving the benchmark a meaningful ceiling.^[1]

What is in the dataset?

Selection procedure

Starting from the 5,673 completed Kaggle competitions in the Meta Kaggle dataset, the authors removed Community Competitions (whose quality is not rigorously vetted), leaving 586 competitions for manual review. Each surviving competition was screened by at least two ML engineers from leading AI labs against nine criteria, including that the task require modern ML engineering (not simple tabular prediction or solved problems like MNIST), that the description be self-contained, that the evaluation metric be computable locally, that submissions be CSV files, and that licensing permit redistribution.^[1] Each competition was then manually annotated with a problem category (image classification, text classification, tabular, image segmentation, audio, sequence-to-sequence, image regression, and others) and a complexity tier.

Complexity tiers

Complexity is assigned from the perspective of an experienced contemporary ML engineer and excludes the time spent training the final model:^[1]

Low complexity (22 competitions, ~29%): solvable in under two hours of engineering work
Medium complexity (38 competitions, ~51%): two to ten hours of work
High complexity (15 competitions, ~20%): more than ten hours of work

The repository also exposes a "Lite" subset consisting only of the 22 Low-complexity tasks (about 158 GB of data) for groups that cannot afford a full run.^[2]

Topical coverage and economic stakes

The 75 competitions span 15 problem categories, and notable inclusions are scientifically substantive challenges such as OpenVaccine: COVID-19 mRNA Vaccine Degradation Prediction and the Vesuvius Challenge for deciphering carbonized ancient scrolls.^[1] The total prize money awarded across the 75 original Kaggle competitions sums to $1,948,016 (an average of $25,974 per competition), giving the benchmark a tangible link to commercial and scientific outcomes that humans were paid millions of dollars to solve.^[1]

Train/test split reconstruction

Because Kaggle generally does not release the hidden test labels after a competition closes, the MLE-bench team reconstructs new train/test splits from the public training data, typically holding out 10% as a new test set. They verify that the sample submission scores similarly on the new and original test sets, so that medal thresholds remain comparable to the original human leaderboard.^[1] Snapshots of each competition's Private leaderboard were taken between May and August 2024.^[1]

How does MLE-bench work?

Execution environment

Each agent runs inside an Ubuntu 20.04 Docker container on a secure cluster with 36 vCPUs, 440 GB of RAM, 4,095 GiB of SSD, and one 24 GB Nvidia A10 GPU. Agents have a wall-clock budget of 24 hours per competition and produce a final submission.csv that is graded by a local copy of the original competition's grading code.^[1] All headline experiments are repeated with three random seeds (more for AIDE configurations), with results reported as mean +/- standard error.

The benchmark also exposes a local validation server that an agent can call to check whether a candidate submission.csv is structurally valid, mirroring the way real Kaggle competitions allow up to five Public leaderboard submissions per day. Unlike Kaggle, the MLE-bench validation server only returns validity (not a score), and there is no daily cap on usage.^[1]

Scoring: medals and the Any Medal metric

MLE-bench inherits Kaggle's medal thresholds, which depend on the number of teams that originally competed:^[1]

Teams	Bronze	Silver	Gold
0-99	Top 40%	Top 20%	Top 10%
100-249	Top 40%	Top 20%	Top 10
250-999	Top 100	Top 50	Top 10 + 0.2%*
1000+	Top 10%	Top 5%	Top 10 + 0.2%*

* The gold threshold increases by 1 team for every 500 additional teams above 1,000.

The headline metric is Any Medal (%), the percentage of attempts that earn at least a bronze. Unlike Kaggle, MLE-bench applies medal thresholding to all 75 competitions, even those that did not originally award medals, so that a single number describes overall performance.^[1] The paper also reports raw scores per competition, valid submission rates, and the percentage of attempts scoring above the median of the human leaderboard.

Scaffolding

MLE-bench is agnostic to the agent architecture; only a CSV submission is required. The authors evaluated three open-source scaffolds:^[1]

AIDE (Schmidt et al., 2024, Weco AI): a tree search over candidate solutions, purpose-built for Kaggle-style competitions. AIDE proposes a draft solution, executes it, and iteratively improves or debugs nodes until the time budget or a 500-node cap is reached.
MLAB / ResearchAgent (from MLAgentBench): a general-purpose tool-use scaffold.
OpenHands / CodeActAgent (the OpenHands platform): another general-purpose tool-use scaffold.

The paper modestly modifies each scaffold to enhance its performance and notes that small implementation details matter substantially: MLAB and OpenHands sometimes terminated within minutes despite the 24-hour budget, while AIDE persistently re-prompted the model to improve its current best solution.^[1]

Rules and integrity safeguards

Agents are forbidden from writing predictions directly to the submission file from their own knowledge (the prediction must come from a trained model) and from viewing Kaggle solutions online.^[1] To enforce these rules, the benchmark ships two automated checks:

A rule-breaking detector powered by GPT-4o mini that inspects agent logs for evidence of manual labeling, external LLM API calls, or unauthorized data access.
A plagiarism detector using the open-source Dolos tool that compares submitted code against the top 50 public notebooks for each competition; submissions with similarity above 60% are disqualified pending human review.^[1]

Across all medal-winning runs, the plagiarism detector and human reviewers found no confirmed violations; the small number of automated flags were judged false positives after manual inspection.^[1]

How do AI agents score on MLE-bench?

Best configuration

The single strongest configuration in the launch paper was o1-preview with AIDE, which earned at least a bronze in 16.9% +/- 1.1% of the 75 competitions averaged over 16 seeds, achieving an average of 7 gold medals per run.^[1] By Kaggle's own progression rules, a Grandmaster title requires five gold medals, so a single MLE-bench run with this agent would, on its raw output, satisfy that threshold, although the paper carefully notes that the modified train/test splits, the application of medal thresholds to non-medal competitions, and the use of more recent technology than the original competitors all complicate the comparison.^[1]

Full main table

The October 2024 results across model-scaffold pairings were:^[1]

Scaffold	Model	Valid submission	Any medal (%)
AIDE	o1-preview	82.8 +/- 1.1	16.9 +/- 1.1
AIDE	GPT-4o (2024-08-06)	54.9 +/- 1.0	8.7 +/- 0.5
AIDE	Claude 3.5 Sonnet (20240620)	51.1 +/- 3.3	7.6 +/- 1.8
AIDE	Llama 3.1 405B Instruct	27.3 +/- 2.6	3.0 +/- 1.0
MLAB	GPT-4o	44.3 +/- 2.6	0.8 +/- 0.5
OpenHands	GPT-4o	52.0 +/- 3.3	4.4 +/- 1.4

Gemini 1.5 Pro was attempted but excluded from the table because its API repeatedly blocked completions due to "recitation" flags on competition-related outputs.^[1]

Resource scaling experiments

Three resource-scaling sweeps were reported:^[1]

Pass@k: For both GPT-4o (AIDE) and o1-preview (AIDE), pass@6 roughly doubled pass@1. GPT-4o (AIDE) at pass@6 reached 17.0%, comparable to o1-preview (AIDE) at pass@1. Pass@8 with o1-preview (AIDE) climbed to 34.1%.
Hardware: Adding a second A10 GPU did not measurably help GPT-4o (AIDE), which earned 10.2% with two GPUs versus 8.7% with one and 9.1% with no GPU. Manual log review found that GPT-4o (AIDE) made no attempt to use the extra GPU.
Time: Extending GPT-4o (AIDE) to 100 hours per competition (with the node cap raised from 500 to 5,000) increased the medal rate to 11.8%, with most additional medals earned within the first few hours and only marginal gains thereafter.

Failure modes

Qualitative analysis identified consistent agent failure modes: agents frequently produced invalid submission files despite having access to a validation server, often failed to budget compute and time appropriately (overloading disk or RAM), and tended to "give up" when faced with novel error states rather than persist through debugging. AIDE's tree-search loop kept models engaged for the full budget, but MLAB and OpenHands often terminated early.^[1]

Does pre-training contamination affect the results?

A central methodological concern is that frontier models may have memorized either Kaggle competition data or the winning discussions and notebooks shared by top finishers. The authors ran two complementary tests on GPT-4o:^[1]

Familiarity correlation. Per-token familiarity (mean conditional probability) was measured against each competition's main page and its five most popular discussion posts. The familiarity score was then correlated with the model's performance; no positive correlation was observed.
Obfuscation. All 75 competition descriptions were manually rewritten to remove references to Kaggle and the original competition name while preserving the technical content. GPT-4o (AIDE) scored 8.4% +/- 1.0% on the obfuscated descriptions versus 8.5% +/- 0.6% on the originals, statistically indistinguishable.

Combined with the plagiarism detector findings, these results suggest that memorization is not driving GPT-4o's scores on the initial release. The paper is explicit, however, that this does not rule out subtler contamination effects, particularly for newer models trained on more Kaggle material, and that "future work may seek to regularly update MLE-bench with new Kaggle competitions to stay ahead of contamination issues."^[1]

How has performance improved since release?

The benchmark is hosted via a community-run leaderboard at mlebench.com that aggregates submissions from labs and independent researchers; the site explicitly notes it "is not associated with OpenAI in any way."^[5] By May 2026, the leaderboard tracked roughly two dozen submissions spanning the original AIDE/o1-preview baseline through more advanced scaffolds layered on top of newer models.

Reported headline numbers from late 2025 and early 2026 include:^[5]^[8]^[9]

AIDE + o1-preview (original baseline, 24 h): 16.9% to 17.1% across re-runs.
AIDE + o3 and related o-series successors: substantial improvements over o1-preview, with the AIRA-dojo configuration reportedly reaching ~31.6%.
AIRA-dojo + o1-preview (Meta's improved operator set with scoped memory and explicit "Think Tokens"): reported to improve AIDE-greedy / o1-preview from ~35% to 45.9% on the medal rate, a 30% relative gain.^[8]
MLE-STAR (Google research, Gemini 2.5 / Gemini-2 backbones): introduced web retrieval and ablation-guided refinement, reaching 43.9% in early experiments and pushing further when paired with Gemini 2.5 Pro.^[9]
AIRA + MCTS/evolutionary search: reported to reach 47.7% on the headline metric.^[9]
Community leaderboard top entries (early 2026): Famou-Agent 2.0 on Gemini 3 Pro Preview at 64.44%, AIBuildAI on Claude Opus 4.6 at 63.11%, and CAIR MARS+ on Gemini 3 Pro Preview at 62.67%.^[5]

The leaderboard temporarily paused new public submissions on April 24, 2026, while the maintainers worked on stronger fairness and reproducibility checks, citing concerns about comparability between submissions using very different scaffolding, internet access policies, and compute budgets.^[5]^[2]

A parallel research thread launched MLE-Dojo, which converts MLE-bench's static evaluation into an interactive, Gym-style reinforcement-learning environment supporting 200+ Kaggle competitions (incorporating 68 from MLE-bench, 74 from DSBench, and 75 newly scraped tasks) and adds a HumanRank score normalized to the original Kaggle leaderboard.^[10]

How does MLE-bench fit into AI safety frameworks?

The MLE-bench paper explicitly positions the benchmark as an evidence source for ML R&D risk evaluations across the major frontier labs:^[1]

For OpenAI's Preparedness Framework (originally released December 2023; v2 published April 15, 2025), MLE-bench is cited as a measure for model autonomy and ML R&D acceleration risk.^[4]
For Anthropic's Responsible Scaling Policy, it is positioned as a capability evaluation for autonomous research.
For Google DeepMind's Frontier Safety Framework, it is cited under the ML R&D Critical Capability Level.

OpenAI's safety evaluations hub and the Preparedness v2 document continue to reference MLE-bench as one of the standardized evaluations the company runs ahead of deploying frontier systems, alongside SWE-bench Verified, AgentBench-style multi-step tool use, and tasks drawn from METR's autonomy suite.^[4]^[11] The framing is that "if a model can succeed on a large fraction of MLE-bench, it is plausible that the same model could execute the core ML engineering steps needed to improve frontier training pipelines," which would warrant heightened safety and security mitigations.^[1]^[4]

How does MLE-bench compare to other benchmarks?

Benchmark	Domain	Length	Comparison
SWE-bench	Real-world GitHub bug fixes	Minutes-hours	Tests software engineering on existing codebases; MLE-bench tests open-ended ML engineering from scratch.^[1]
MLAgentBench	13 mixed ML tasks	Bounded	Provides baseline solutions and measures relative improvement; MLE-bench requires from-scratch attempts on 75 tasks.^[1]
RE-Bench (METR, 2024)	7 frontier-style ML research engineering environments	2-32 hours	Targets frontier AI R&D capabilities (e.g., custom CUDA kernels, restricted-architecture training). Where MLE-bench measures classical ML engineering on Kaggle competitions, RE-Bench measures more novel research tasks where solutions are not freely available online.^[12]
DSBench	Kaggle-style data science	Variable	Concurrent with MLE-bench but filters competitions to fit an automated template; MLE-bench includes more diverse and non-standard tasks.^[1]
METR autonomy suite	Long-horizon agentic tasks	Up to days	METR explicitly frames MLE-bench, RE-Bench, and similar benchmarks as complementary tools for tracking when AI agents will match human researchers on multi-week R&D projects.^[12]
ARC-AGI / GAIA	General reasoning / assistant tasks	Short	Test general cognition rather than long-horizon ML engineering.^[1]

METR has argued that MLE-bench is best read alongside RE-Bench rather than as a substitute: top MLE-bench solutions exist online and require less novel exploration, whereas RE-Bench's seven environments are intentionally designed to admit no public solutions and to require genuine experimentation.^[12]

Reception and criticism

Coverage in VentureBeat, DeepLearning.AI's The Batch, MarkTechPost, and a number of industry outlets highlighted MLE-bench as one of the first benchmarks to evaluate autonomous ML engineering at scale, and it was selected as an oral presentation at ICLR 2025.^[6]^[3]

Several recurring criticisms have been raised in the literature and in commentary:^[1]^[12]^[9]

Contamination risk. Despite the obfuscation and familiarity experiments, every MLE-bench competition is public, and top-scoring solutions are typically discussed and published on Kaggle and GitHub. The paper acknowledges that GPT-4's base model could reproduce parts of the Titanic dataset given a prompt, and that subtler forms of contamination cannot be fully excluded for newer, larger models. The authors recommend periodic refreshes with newer competitions.
Compute asymmetry vs. human Kagglers. A human competitor typically has weeks or months to iterate on a Kaggle problem and can collaborate via the platform's discussion forums; MLE-bench agents have 24 hours and no peer interaction. Both METR and the MLE-bench authors caution that the resulting medal rates are not directly comparable to human medal rates.^[12]^[1]
Scaffolding sensitivity. Differences between AIDE, MLAB, and OpenHands change the headline score by more than 10 percentage points on the same underlying model. This makes reported numbers highly dependent on the engineering effort invested in scaffolding, and means that improvements in scaffolds (e.g., AIRA, MLE-STAR) can be confused with improvements in the underlying language model.^[1]^[8]^[9]
Generalization gap. Subsequent analyses have noted a persistent 9-13% gap between agents' validation-set scores and their actual test-set performance, indicating that current agents struggle with the kind of robust model-selection skills that experienced Kagglers prioritize.^[9]
Resource intensity. A single full run consumes roughly 1,800 GPU-hours on the recommended hardware; one seed of o1-preview AIDE alone used 127.5M input tokens and 15.0M output tokens. This limits the number of groups that can independently reproduce headline numbers.^[1]
Coverage of AI R&D. The authors are explicit that MLE-bench does not cover the full range of frontier AI R&D. For instance, the framing of the problem, the choice of dataset, and the choice of metric are all given by Kaggle, whereas real research often starts upstream of all of these.^[1] METR's RE-Bench was developed in part to fill this gap.^[12]

Despite these caveats, MLE-bench has been broadly adopted: by mid-2026 it had been incorporated into OpenAI's Safety Evaluations Hub, referenced in OpenAI's Preparedness Framework v2, extended by community projects such as MLE-Dojo and MLE-STAR, and used as a workhorse evaluation for new frontier models from OpenAI, Anthropic, and Google DeepMind.^[4]^[11]^[9]^[10]

ELI5: MLE-bench explained simply

Kaggle is a website where people compete to build the best machine learning model for a problem, and the top finishers win bronze, silver, or gold medals. MLE-bench takes 75 of those old Kaggle contests and hands them to an AI agent instead of a person: the AI has to read the rules, look at the data, write code, train a model, fix its own bugs, and turn in an answer file, all on its own within 24 hours. The AI's answer is then scored on the same leaderboard real people used, so you can see whether the AI would have won a medal. When OpenAI first tried this in 2024, its best AI (o1-preview using a helper program called AIDE) earned a medal on about 17 out of every 100 contests; by 2026 newer AIs were winning medals on more than 60 out of 100. People care because an AI that is good at building AI models might one day help build even more powerful AI, which is something safety teams want to watch closely.^[1]^[5]

References

Chan, J. S., Chowdhury, N., Jaffe, O., Aung, J., Sherburn, D., Mays, E., Starace, G., Liu, K., Maksin, L., Patwardhan, T., Weng, L., and Madry, A. "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering." arXiv:2410.07095, October 9, 2024 (v2 February 26, 2025). https://arxiv.org/abs/2410.07095 ↩
OpenAI. "openai/mle-bench" GitHub repository. https://github.com/openai/mle-bench ↩
DeepLearning.AI. "OpenAI's MLE-bench Tests AI Coding Agents." *The Batch*. https://www.deeplearning.ai/the-batch/openais-mle-bench-tests-ai-coding-agents ↩
OpenAI. "Preparedness Framework Version 2." April 15, 2025. https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf ↩
MLE-bench community leaderboard. https://www.mlebench.com/ ↩
ICLR 2025. "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering" (Oral). https://iclr.cc/virtual/2025/oral/31914 ↩
OpenAI. "MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering." Research blog announcement, October 2024. https://openai.com/index/mle-bench/ ↩
"AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench." arXiv:2507.02554. https://arxiv.org/html/2507.02554v1 ↩
Emergent Mind. "MLE-bench: Autonomous ML Engineering Benchmark." https://www.emergentmind.com/topics/mle-bench ↩
"MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering." arXiv:2505.07782. https://arxiv.org/html/2505.07782v1 ↩
OpenAI Safety Evaluations Hub. https://openai.com/safety/evaluations-hub/ ↩
METR. "Evaluating frontier AI R&D capabilities of language model agents against human experts." November 22, 2024. https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributor · full history

Suggest edit

What links here

Agent evaluation Factorio Learning Environment PaperBench SciCode WeirdML

Key facts

What is MLE-bench?

Background and motivation

What is in the dataset?

Selection procedure

Complexity tiers

Topical coverage and economic stakes

Train/test split reconstruction

How does MLE-bench work?

Execution environment

Scoring: medals and the Any Medal metric

Scaffolding

Rules and integrity safeguards

How do AI agents score on MLE-bench?

Best configuration

Full main table

Resource scaling experiments

Failure modes

Does pre-training contamination affect the results?

How has performance improved since release?

How does MLE-bench fit into AI safety frameworks?

How does MLE-bench compare to other benchmarks?

Reception and criticism

ELI5: MLE-bench explained simply

See also

References

Improve this article

Related Articles

PaperBench

GPT Store

OpenAI Frontier

OpenAI Operator

OpenAI Agents SDK

OpenAI Responses API

What links here

Related Articles

PaperBench

GPT Store

OpenAI Frontier

OpenAI Operator

OpenAI Agents SDK

OpenAI Responses API

What links here