WeirdML
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 · 2,497 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
12 citations
Review status
Source-backed
Revision
v2 · 2,497 words
Add missing citations, update stale details, or suggest a clearer explanation.
| WeirdML | |
|---|---|
| Overview | |
| Full name | WeirdML (Weird Machine Learning) |
| Description | Benchmark testing whether large language models can do real ML engineering on small, unusual datasets by writing PyTorch code, getting feedback, and iterating |
| First released | January 16, 2025 (v1) |
| Latest version | WeirdML v2 (June 2025) |
| Author | Håvard Tveit Ihle |
| Affiliation | Norwegian Defence Research Establishment (FFI) |
| Hosting and support | Epoch AI Benchmarking Hub; METR sponsored API costs |
| Technical Details | |
| Type | ML engineering, code generation, agentic iteration |
| Tasks (v1) | 6 |
| Tasks (v2) | 19 (6 public, 13 hidden) |
| Iterations per run | 5 submissions, 4 rounds of feedback |
| Hardware | NVIDIA TITAN V GPU, 12 GB memory |
| Per submission timeout | 600 seconds |
| Runs per model and task | At least 15 (5 for the most expensive models) |
| Scoring | Mean across runs of the best test accuracy across the 5 iterations |
| Language | Python (PyTorch) |
| Performance | |
| Top score (v1, Jan 2025) | About 51% (Claude 3.5 Sonnet) |
| Top score (v2 public leaderboard) | 72.2% (GPT-5.2) |
| Saturated | No |
| Resources | |
| Website | htihle.github.io/weirdml.html |
| v1 archive | htihle.github.io/weirdml_v1.html |
| Introductory post | LessWrong, Jan 16, 2025 |
| Time horizons code | github.com/htihle/weirdml-time-horizons |
| Epoch AI listing | epoch.ai/benchmarks/weirdml |
| License | MIT (time horizons repository) |
WeirdML is a benchmark for evaluating how well large language models can do hands on machine learning engineering. It was created by Norwegian researcher Håvard Tveit Ihle of the Norwegian Defence Research Establishment (FFI) and introduced on LessWrong on January 16, 2025. Each task is a small, deliberately quirky ML problem with limited training data or an unusual input representation. The model has to read the prompt, design an approach, write a complete PyTorch script that loads the data, trains, and evaluates, then iterate on its solution after seeing terminal output and a held out test accuracy.[1][2]
WeirdML v1 covered six tasks. WeirdML v2, announced in June 2025, kept the six public tasks and added 13 hidden tasks for 19 total, plus tracking of API cost, output tokens, and lines of code. WeirdML v2 has been integrated into Epoch AI's Benchmarking Hub, with METR helping fund the API spend.[3][4][5]
Ihle is a former astrophysicist who worked on cosmological data pipelines for the COMAP and Cosmoglobe experiments before shifting toward AI evaluation, generalization, and robustness. In the LessWrong post he framed WeirdML as a response to a gap: standard ML benchmarks fix a dataset and reward leaderboard climbing, while code benchmarks cover short functions or competitive puzzles. Neither tells you whether a model can sit with a small unfamiliar dataset and figure out what to do.[1][6]
WeirdML targets four capabilities at once: understanding the data and structure of the problem, picking a sensible architecture and training setup, producing PyTorch code that runs, and using feedback to fix bugs. Ihle built the automated pipeline as a part time project over about two months. The original v1 evaluation cost roughly $200 in API calls, dominated by o1 preview at about two dollars per run.[1] The word "weird" is meant literally: some tasks format their data in ways that defeat the most obvious approach (images as unordered patches, shapes as point clouds) to push models past pattern matching against tutorials seen in pretraining.
Each task is presented as a self contained prompt with the problem description, training and test data paths, and a small example of how to load the data with NumPy or PyTorch. The model returns a Python script that handles everything from data loading through final evaluation. The code runs inside an isolated Docker container on a single NVIDIA TITAN V GPU with 12 GB of memory and a 600 second timeout. Network access is disabled, so the model cannot pull pretrained weights at run time.[1][3]
A single run gives the model five submissions. After each attempt, the harness returns terminal output (errors and test accuracy if the run completed) and asks for a revision. The accuracy reported for the run is the best across the five submissions. Each model gets at least 15 runs per task; the headline score is the mean of best per run accuracies. The most expensive reasoning models (o1 preview, o3 pro, Claude thinking variants) get only five runs. A full evaluation can cost thousands of dollars, which is why METR's funding mattered for v2.[1][4]
WeirdML v2 has 19 tasks; only six are public. The other 13 are held out so the benchmark stays informative as public solutions accumulate. The six public tasks shipped with v1 in January 2025.[2][3]
| Task | Setup | What makes it tricky |
|---|---|---|
| Shapes (Easy) | Classify five shapes (circle, square, triangle, pentagon, star) from 512 noisy 2D coordinates. Centered, fixed orientation and size. 1000 training examples. | Inputs are unordered point clouds, not images; the model must handle permutation invariance and noise. |
| Shapes (Hard) | Same as Easy but with random translation, rotation, and scaling per sample. | Adds invariant features or aggressive data augmentation. |
| Image Patch Shuffling (Easy) | Reconstruct 27x27 grayscale Fashion MNIST images from nine shuffled 9x9 patches. | A jigsaw problem rather than classification. |
| Image Patch Shuffling (Hard) | Reconstruct from RGB patches randomly sampled from larger Imagenette images with varying backgrounds. | Performance hovers near chance for almost every model. |
| Chess Game Outcome | Predict win, loss, or draw from algebraic notation move sequences. 1000 amateur games. | Sequence input with no pretrained chess knowledge available at run time. |
| Unsupervised Digit Recognition | Classify digits with only 26 labeled and around 16,000 unlabeled samples. | A semi supervised pipeline must be built end to end. |
The 13 hidden v2 tasks broaden the suite to cover more imaging problems, more sequential and tabular data, additional unsupervised setups, and tasks designed to span a wider difficulty range so that the leaderboard does not collapse around a small number of saturated entries.[3][4]
| Choice | Why it matters |
|---|---|
| Best of five within a run | Rewards getting a working solution at any iteration |
| Mean over many runs (15+) | Reduces the high run to run variance typical in code generation benchmarks |
| Strict resource limits | Forces models to engineer a solution that fits, not brute force a giant net |
| Test set isolation | Prevents peeking even if the model's file handling is sloppy |
| No internet during execution | Blocks downloads of pretrained weights mid run |
Ihle reported that on v1, the gap between five independent tries and five iterations with feedback was smaller than expected for non reasoning models. Most of the value of iteration came from more shots on goal, with extra benefit from feedback concentrated in reasoning models such as o1 mini, o1 preview, and gemini-2.0-flash-thinking. Newer reasoning models have widened that gap.[1][2]
When Ihle published the v1 leaderboard in January 2025, Claude 3.5 Sonnet led at about 51% mean accuracy across the six tasks, with OpenAI's o1 preview close behind.[1][2]
| Model | Average across 6 tasks |
|---|---|
| Claude 3.5 Sonnet | 50.94% |
| o1 preview | 48.82% |
| o1 mini | 45.58% |
| Claude 3.5 Haiku | 43.75% |
| Gemini 2.0 Flash Thinking | 42.82% |
Shapes (Easy) was effectively solved (o1 preview reached about 98%). Shapes (Hard) topped out near 60% on Claude 3.5 Sonnet. Chess outcome prediction stalled around 74% for the same model. Image Patch Shuffling (Hard) was unsolved, with most models near chance. Unsupervised Digit Recognition had a high first attempt failure rate but Claude 3.5 Sonnet averaged around 80% when its pipeline worked.[1][2]
Ihle announced v2 in June 2025 alongside the Epoch AI integration. The v2 leaderboard uses 17 of the 19 tasks for the public score and reports much higher numbers than v1, partly because the model lineup improved and partly because v2 averages across a wider task set where some are easier on average.[3][4][7]
| Rank | Model | WeirdML v2 score |
|---|---|---|
| 1 | GPT-5.2 | 72.20% |
| 2 | Gemini 3 Pro | 69.93% |
| 3 | Claude Opus 4.5 | 63.70% |
| 4 | OpenAI o3 | 58.21% |
| 5 | Gemini 2.5 Pro (Jun 2025) | 54.03% |
| 6 | o4 mini (high) | 52.56% |
| 7 | GPT-OSS 120B | 48.17% |
| 8 | OpenAI o1 | 47.56% |
| 9 | Grok 4 | 45.73% |
| 10 | Kimi K2 (thinking, official) | 42.79% |
Ihle has posted snapshots tied to specific releases. When GPT-5 launched, he reported it leading at 56.3% (beating o3 pro at 53.9%) with gpt-5 mini matching o3 at a fraction of the cost. GPT-5 wrote much more code per attempt (median 324 lines vs about 133 for o3).[7] Later releases such as GPT-5.4 (around 57.4%), GPT-5.5 (around 67.1%), and Claude Opus 4.7 (around 76.4%) continued the climb.[8] Smaller open source baselines stay below 10%, with Mixtral 8x7B around 3.17%, and the pool exceeds 30 evaluated models.[7]
Per task patterns shifted. Shapes (Hard) is no longer near chance for top models, with the strongest reasoning models reaching about 90%. Image Patch Shuffling (Hard) is still the toughest public task, though leaders have crept above chance. Most of the headroom now lives in the hidden v2 tasks.[3][4]
In February 2026 Ihle published a follow up analysis called "WeirdML Time Horizons" on LessWrong, with code in the weirdml time horizons GitHub repository (MIT license). The idea borrows from METR's task duration framing: estimate how long a median professional ML researcher would need to solve each task without AI help, then ask at what task length each model crosses 50% success.[9][10]
Ihle uses a panel of four LLMs to estimate per task human completion times at five accuracy thresholds (25%, 50%, 70%, 90%, 95%). Estimates become hours (1 day = 8h, 1 week = 40h) and feed a logistic fit, with block bootstrap resampling for uncertainty. The headline result: WeirdML time horizons roughly double every five months, from about 24 minutes for GPT-4 in June 2023 to roughly 38 hours for Claude Opus 4.6 in February 2026. That doubling rate is close to METR's reported seven month doubling, despite different tasks and methodology.[9][10]
| Model | Release | Time horizon (50% success) |
|---|---|---|
| Claude Opus 4.6 (adaptive) | Feb 2026 | About 37.7 hours |
| GPT-5.2 (xhigh) | Dec 2025 | About 30.6 hours |
| Gemini 3 Pro (high) | Nov 2025 | About 22.3 hours |
| GPT-5 (high) | Aug 2025 | About 14.5 hours |
| o3 pro (high) | Jun 2025 | About 11.8 hours |
| o1 preview | Sep 2024 | About 6.2 hours |
| Claude 3.5 Sonnet | Jun 2024 | About 1.9 hours |
| GPT-4 | Jun 2023 | About 24 minutes |
The LLM panel likely overestimates absolute completion times, especially at high accuracy thresholds, so the hours should be read with skepticism. The doubling rate is more robust. A calibrated variant in the repository gives smaller absolute values but a similar doubling time of about six months.[9]
| Benchmark | How it differs from WeirdML |
|---|---|
| HumanEval and MBPP | No data, training, or iteration; pure short function generation |
| SWE-bench | Software engineering on existing repos, not ML modeling from scratch |
| MLE-bench | Kaggle style ML competitions with larger datasets and longer budgets |
| MLAgentBench | ML research style tasks; broader scope, less focus on small weird datasets |
| GPQA | Graduate science multiple choice; no code execution |
| SciCode | Scientific computing problems decomposed into subproblems |
| RE-Bench | Open ended research engineering tasks judged by experts |
WeirdML is distinguished by three features: tasks are deliberately small and quirky, the harness automates a five iteration loop with execution feedback, and strict GPU and time limits force practical ML thinking.[1][3][5]
Epoch AI's benchmarking hub added WeirdML v2 alongside Aider Polyglot, Balrog, and the Factorio Learning Environment when it expanded to feature trusted external leaderboards. WeirdML scores feed into the Epoch Capabilities Index, an aggregate measure across many benchmarks.[5][11] Independent aggregators such as NeoSignal cite WeirdML alongside SWE-bench Verified and GPQA. Frontier releases since mid 2025 have included WeirdML scores in third party comparisons, especially for GPT-5, Claude Opus 4 variants, Gemini 3 Pro, and Grok 4.[7][8] Ihle has framed WeirdML's role as keeping a meaningful signal alive while standard benchmarks saturate.[1][12]
| Limitation | Description |
|---|---|
| Fixed framework | All solutions are PyTorch; JAX, TensorFlow, and Julia are not measured |
| Small task count | 19 tasks in v2 (17 in the public score); per task noise is non trivial |
| Hardware specific | TITAN V GPU and 12 GB memory are unusual versus modern production hardware |
| Short attempt window | A 600 second budget rules out longer training runs |
| Hidden task drift | Hidden v2 tasks cut overfitting risk but make per task interpretation harder |
| Cost of evaluation | Full evaluation of reasoning models can cost thousands of dollars |