WeirdML

AI Benchmarks Developer Tools

12 min read

Updated May 10, 2026

Suggest edit History Talk

RawGraph

Last edited

May 10, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 2,497 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

WeirdML
Overview
Full name	WeirdML (Weird Machine Learning)
Description	Benchmark testing whether large language models can do real ML engineering on small, unusual datasets by writing PyTorch code, getting feedback, and iterating
First released	January 16, 2025 (v1)
Latest version	WeirdML v2 (June 2025)
Author	Håvard Tveit Ihle
Affiliation	Norwegian Defence Research Establishment (FFI)
Hosting and support	Epoch AI Benchmarking Hub; METR sponsored API costs
Technical Details
Type	ML engineering, code generation, agentic iteration
Tasks (v1)	6
Tasks (v2)	19 (6 public, 13 hidden)
Iterations per run	5 submissions, 4 rounds of feedback
Hardware	NVIDIA TITAN V GPU, 12 GB memory
Per submission timeout	600 seconds
Runs per model and task	At least 15 (5 for the most expensive models)
Scoring	Mean across runs of the best test accuracy across the 5 iterations
Language	Python (PyTorch)
Performance
Top score (v1, Jan 2025)	About 51% (Claude 3.5 Sonnet)
Top score (v2 public leaderboard)	72.2% (GPT-5.2)
Saturated	No
Resources
Website	htihle.github.io/weirdml.html
v1 archive	htihle.github.io/weirdml_v1.html
Introductory post	LessWrong, Jan 16, 2025
Time horizons code	github.com/htihle/weirdml-time-horizons
Epoch AI listing	epoch.ai/benchmarks/weirdml
License	MIT (time horizons repository)

WeirdML is a benchmark for evaluating how well large language models can do hands on machine learning engineering. It was created by Norwegian researcher Håvard Tveit Ihle of the Norwegian Defence Research Establishment (FFI) and introduced on LessWrong on January 16, 2025. Each task is a small, deliberately quirky ML problem with limited training data or an unusual input representation. The model has to read the prompt, design an approach, write a complete PyTorch script that loads the data, trains, and evaluates, then iterate on its solution after seeing terminal output and a held out test accuracy.^[1]^[2]

WeirdML v1 covered six tasks. WeirdML v2, announced in June 2025, kept the six public tasks and added 13 hidden tasks for 19 total, plus tracking of API cost, output tokens, and lines of code. WeirdML v2 has been integrated into Epoch AI's Benchmarking Hub, with METR helping fund the API spend.^[3]^[4]^[5]

Origins and motivation

Ihle is a former astrophysicist who worked on cosmological data pipelines for the COMAP and Cosmoglobe experiments before shifting toward AI evaluation, generalization, and robustness. In the LessWrong post he framed WeirdML as a response to a gap: standard ML benchmarks fix a dataset and reward leaderboard climbing, while code benchmarks cover short functions or competitive puzzles. Neither tells you whether a model can sit with a small unfamiliar dataset and figure out what to do.^[1]^[6]

WeirdML targets four capabilities at once: understanding the data and structure of the problem, picking a sensible architecture and training setup, producing PyTorch code that runs, and using feedback to fix bugs. Ihle built the automated pipeline as a part time project over about two months. The original v1 evaluation cost roughly $200 in API calls, dominated by o1 preview at about two dollars per run.^[1] The word "weird" is meant literally: some tasks format their data in ways that defeat the most obvious approach (images as unordered patches, shapes as point clouds) to push models past pattern matching against tutorials seen in pretraining.

How a WeirdML task works

Each task is presented as a self contained prompt with the problem description, training and test data paths, and a small example of how to load the data with NumPy or PyTorch. The model returns a Python script that handles everything from data loading through final evaluation. The code runs inside an isolated Docker container on a single NVIDIA TITAN V GPU with 12 GB of memory and a 600 second timeout. Network access is disabled, so the model cannot pull pretrained weights at run time.^[1]^[3]

A single run gives the model five submissions. After each attempt, the harness returns terminal output (errors and test accuracy if the run completed) and asks for a revision. The accuracy reported for the run is the best across the five submissions. Each model gets at least 15 runs per task; the headline score is the mean of best per run accuracies. The most expensive reasoning models (o1 preview, o3 pro, Claude thinking variants) get only five runs. A full evaluation can cost thousands of dollars, which is why METR's funding mattered for v2.^[1]^[4]

Tasks in WeirdML

WeirdML v2 has 19 tasks; only six are public. The other 13 are held out so the benchmark stays informative as public solutions accumulate. The six public tasks shipped with v1 in January 2025.^[2]^[3]

Task	Setup	What makes it tricky
Shapes (Easy)	Classify five shapes (circle, square, triangle, pentagon, star) from 512 noisy 2D coordinates. Centered, fixed orientation and size. 1000 training examples.	Inputs are unordered point clouds, not images; the model must handle permutation invariance and noise.
Shapes (Hard)	Same as Easy but with random translation, rotation, and scaling per sample.	Adds invariant features or aggressive data augmentation.
Image Patch Shuffling (Easy)	Reconstruct 27x27 grayscale Fashion MNIST images from nine shuffled 9x9 patches.	A jigsaw problem rather than classification.
Image Patch Shuffling (Hard)	Reconstruct from RGB patches randomly sampled from larger Imagenette images with varying backgrounds.	Performance hovers near chance for almost every model.
Chess Game Outcome	Predict win, loss, or draw from algebraic notation move sequences. 1000 amateur games.	Sequence input with no pretrained chess knowledge available at run time.
Unsupervised Digit Recognition	Classify digits with only 26 labeled and around 16,000 unlabeled samples.	A semi supervised pipeline must be built end to end.

The 13 hidden v2 tasks broaden the suite to cover more imaging problems, more sequential and tabular data, additional unsupervised setups, and tasks designed to span a wider difficulty range so that the leaderboard does not collapse around a small number of saturated entries.^[3]^[4]

Methodology details

Choice	Why it matters
Best of five within a run	Rewards getting a working solution at any iteration
Mean over many runs (15+)	Reduces the high run to run variance typical in code generation benchmarks
Strict resource limits	Forces models to engineer a solution that fits, not brute force a giant net
Test set isolation	Prevents peeking even if the model's file handling is sloppy
No internet during execution	Blocks downloads of pretrained weights mid run

Ihle reported that on v1, the gap between five independent tries and five iterations with feedback was smaller than expected for non reasoning models. Most of the value of iteration came from more shots on goal, with extra benefit from feedback concentrated in reasoning models such as o1 mini, o1 preview, and gemini-2.0-flash-thinking. Newer reasoning models have widened that gap.^[1]^[2]

WeirdML v1 results

When Ihle published the v1 leaderboard in January 2025, Claude 3.5 Sonnet led at about 51% mean accuracy across the six tasks, with OpenAI's o1 preview close behind.^[1]^[2]

Model	Average across 6 tasks
Claude 3.5 Sonnet	50.94%
o1 preview	48.82%
o1 mini	45.58%
Claude 3.5 Haiku	43.75%
Gemini 2.0 Flash Thinking	42.82%

Shapes (Easy) was effectively solved (o1 preview reached about 98%). Shapes (Hard) topped out near 60% on Claude 3.5 Sonnet. Chess outcome prediction stalled around 74% for the same model. Image Patch Shuffling (Hard) was unsolved, with most models near chance. Unsupervised Digit Recognition had a high first attempt failure rate but Claude 3.5 Sonnet averaged around 80% when its pipeline worked.^[1]^[2]

WeirdML v2 results

Ihle announced v2 in June 2025 alongside the Epoch AI integration. The v2 leaderboard uses 17 of the 19 tasks for the public score and reports much higher numbers than v1, partly because the model lineup improved and partly because v2 averages across a wider task set where some are easier on average.^[3]^[4]^[7]

Rank	Model	WeirdML v2 score
1	GPT-5.2	72.20%
2	Gemini 3 Pro	69.93%
3	Claude Opus 4.5	63.70%
4	OpenAI o3	58.21%
5	Gemini 2.5 Pro (Jun 2025)	54.03%
6	o4 mini (high)	52.56%
7	GPT-OSS 120B	48.17%
8	OpenAI o1	47.56%
9	Grok 4	45.73%
10	Kimi K2 (thinking, official)	42.79%

Ihle has posted snapshots tied to specific releases. When GPT-5 launched, he reported it leading at 56.3% (beating o3 pro at 53.9%) with gpt-5 mini matching o3 at a fraction of the cost. GPT-5 wrote much more code per attempt (median 324 lines vs about 133 for o3).^[7] Later releases such as GPT-5.4 (around 57.4%), GPT-5.5 (around 67.1%), and Claude Opus 4.7 (around 76.4%) continued the climb.^[8] Smaller open source baselines stay below 10%, with Mixtral 8x7B around 3.17%, and the pool exceeds 30 evaluated models.^[7]

Per task patterns shifted. Shapes (Hard) is no longer near chance for top models, with the strongest reasoning models reaching about 90%. Image Patch Shuffling (Hard) is still the toughest public task, though leaders have crept above chance. Most of the headroom now lives in the hidden v2 tasks.^[3]^[4]

WeirdML time horizons

In February 2026 Ihle published a follow up analysis called "WeirdML Time Horizons" on LessWrong, with code in the weirdml time horizons GitHub repository (MIT license). The idea borrows from METR's task duration framing: estimate how long a median professional ML researcher would need to solve each task without AI help, then ask at what task length each model crosses 50% success.^[9]^[10]

Ihle uses a panel of four LLMs to estimate per task human completion times at five accuracy thresholds (25%, 50%, 70%, 90%, 95%). Estimates become hours (1 day = 8h, 1 week = 40h) and feed a logistic fit, with block bootstrap resampling for uncertainty. The headline result: WeirdML time horizons roughly double every five months, from about 24 minutes for GPT-4 in June 2023 to roughly 38 hours for Claude Opus 4.6 in February 2026. That doubling rate is close to METR's reported seven month doubling, despite different tasks and methodology.^[9]^[10]

Model	Release	Time horizon (50% success)
Claude Opus 4.6 (adaptive)	Feb 2026	About 37.7 hours
GPT-5.2 (xhigh)	Dec 2025	About 30.6 hours
Gemini 3 Pro (high)	Nov 2025	About 22.3 hours
GPT-5 (high)	Aug 2025	About 14.5 hours
o3 pro (high)	Jun 2025	About 11.8 hours
o1 preview	Sep 2024	About 6.2 hours
Claude 3.5 Sonnet	Jun 2024	About 1.9 hours
GPT-4	Jun 2023	About 24 minutes

The LLM panel likely overestimates absolute completion times, especially at high accuracy thresholds, so the hours should be read with skepticism. The doubling rate is more robust. A calibrated variant in the repository gives smaller absolute values but a similar doubling time of about six months.^[9]

Comparison with other benchmarks

Benchmark	How it differs from WeirdML
HumanEval and MBPP	No data, training, or iteration; pure short function generation
SWE-bench	Software engineering on existing repos, not ML modeling from scratch
MLE-bench	Kaggle style ML competitions with larger datasets and longer budgets
MLAgentBench	ML research style tasks; broader scope, less focus on small weird datasets
GPQA	Graduate science multiple choice; no code execution
SciCode	Scientific computing problems decomposed into subproblems
RE-Bench	Open ended research engineering tasks judged by experts

WeirdML is distinguished by three features: tasks are deliberately small and quirky, the harness automates a five iteration loop with execution feedback, and strict GPU and time limits force practical ML thinking.^[1]^[3]^[5]

Reception and use

Epoch AI's benchmarking hub added WeirdML v2 alongside Aider Polyglot, Balrog, and the Factorio Learning Environment when it expanded to feature trusted external leaderboards. WeirdML scores feed into the Epoch Capabilities Index, an aggregate measure across many benchmarks.^[5]^[11] Independent aggregators such as NeoSignal cite WeirdML alongside SWE-bench Verified and GPQA. Frontier releases since mid 2025 have included WeirdML scores in third party comparisons, especially for GPT-5, Claude Opus 4 variants, Gemini 3 Pro, and Grok 4.^[7]^[8] Ihle has framed WeirdML's role as keeping a meaningful signal alive while standard benchmarks saturate.^[1]^[12]

Limitations

Limitation	Description
Fixed framework	All solutions are PyTorch; JAX, TensorFlow, and Julia are not measured
Small task count	19 tasks in v2 (17 in the public score); per task noise is non trivial
Hardware specific	TITAN V GPU and 12 GB memory are unusual versus modern production hardware
Short attempt window	A 600 second budget rules out longer training runs
Hidden task drift	Hidden v2 tasks cut overfitting risk but make per task interpretation harder
Cost of evaluation	Full evaluation of reasoning models can cost thousands of dollars

References

Ihle, Håvard Tveit. "Introducing the WeirdML Benchmark." LessWrong, January 16, 2025. https://www.lesswrong.com/posts/LfQCzph7rc2vxpweS/introducing-the-weirdml-benchmark ↩
Ihle, Håvard Tveit. "WeirdML v1." Personal website. https://htihle.github.io/weirdml_v1.html ↩
Ihle, Håvard Tveit. "WeirdML." Personal website (current v2 page). https://htihle.github.io/weirdml.html ↩
Ihle, Håvard Tveit. "WeirdML v2 is now out." Twitter post, June 27, 2025. https://x.com/htihle/status/1938603525702930849 ↩
Epoch AI. "WeirdML (v2) benchmark page." https://epoch.ai/benchmarks/weirdml ↩
Ihle, Håvard Tveit. Personal homepage. https://htihle.github.io/ ↩
NeoSignal. "WeirdML benchmark scores." https://neosignal.io/benchmarks/weirdml ↩
Mowshowitz, Zvi. "GPT-5.5: Capabilities and Reactions." Don't Worry About the Vase, 2026. https://www.lesswrong.com/posts/5ytcFayxqZsXN8rNw/gpt-5-5-capabilities-and-reactions ↩
Ihle, Håvard Tveit. "WeirdML Time Horizons." LessWrong, February 16, 2026. https://www.lesswrong.com/posts/hoQd3rE7WEaduBmMT/weirdml-time-horizons ↩
Ihle, Håvard Tveit. "weirdml-time-horizons." GitHub repository. https://github.com/htihle/weirdml-time-horizons ↩
Epoch AI. "We've added four new benchmarks to the Epoch AI Benchmarking Hub." Twitter post, May 2025. https://x.com/EpochAIResearch/status/1919831883875062184 ↩
Ihle, Håvard Tveit. LessWrong shortform comments. https://www.greaterwrong.com/posts/Mrcsc7bEKjSbjq2op/havard-tveit-ihle-s-shortform ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

AIME 2025 Artificial Analysis

Origins and motivation

How a WeirdML task works

Tasks in WeirdML

Methodology details

WeirdML v1 results

WeirdML v2 results

WeirdML time horizons

Comparison with other benchmarks

Reception and use

Limitations

See also

References

Improve this article

Related Articles

Artificial Analysis

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

What links here

Related Articles

Artificial Analysis

AA-LCR

CharXiv

GSO

LLM Benchmarks Timeline

LLM Comparisons

What links here