Task-completion time horizon (METR)
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,016 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 · 2,016 words
Add missing citations, update stale details, or suggest a clearer explanation.
The task-completion time horizon is a metric for AI capability proposed by METR that expresses a model's ability in units of human time: it is the length of task, measured by how long a human expert takes to do it, that the model can finish with a given success rate. The headline version is the 50%-task-completion time horizon, the human task length at which a model succeeds half the time. Introduced in the March 2025 paper "Measuring AI Ability to Complete Long Software Tasks" by Thomas Kwa, Ben West, and colleagues at METR, the metric is best known for an empirical finding: on a suite of software and research tasks, the time horizon of frontier models has roughly doubled every seven months from 2019 through 2025. [1][2]
The appeal of the metric is that it puts a single, interpretable number on otherwise hard-to-compare systems. Saying a model scores 71% on some benchmark tells you little about what it can actually do. Saying it can complete tasks that take a human about an hour, with even odds, is something a non-specialist can reason about, and it ties model progress to a quantity (human labor time) that matters for forecasting economic and safety impacts.
Most AI evaluations report an average score on a fixed set of problems. That number drifts as benchmarks saturate, and it does not translate across tasks of different difficulty. METR's idea is to flip the framing. Instead of asking what fraction of tasks a model solves, ask: how hard a task, in human terms, can the model handle before its reliability drops to a chosen threshold? Difficulty is operationalized as the time a skilled human needs to complete the task. [1]
The result is a measure on a meaningful axis. A model with a 5-minute horizon can knock out quick scripted jobs but falls apart on anything longer. A model with a 1-hour horizon can carry out tasks with many steps that a person would budget real time for. Because the unit is human time rather than a percentage, horizons from different model generations and different task mixes are at least roughly comparable, and the trend over time becomes legible. [2]
METR built the metric mainly around autonomous software engineering and machine-learning research work, the kind of activity AI agents attempt when given a coding environment, a shell, and a goal. The motivation is partly practical and partly about AI safety: long-horizon autonomy is one of the capabilities that would make a system genuinely able to act in the world without close supervision, so tracking it is useful for risk assessment as well as for capability forecasting. [1][2]
The original study used a combined collection of 170 tasks that METR calls METR-HRS, drawn from three sources. [3]
HCAST (Human-Calibrated Autonomy Software Tasks) supplies 97 tasks spanning software engineering, cybersecurity, machine learning, and general reasoning, with human completion times ranging from a few minutes up to several hours. RE-Bench contributes 7 open-ended machine-learning research engineering tasks, the longest in the set, where a baseline can run many hours. SWAA (Software Atomic Actions) is a set of 66 very short tasks that METR built specifically for this work, with human times from roughly 1 second to 30 seconds. [3]
The short SWAA tasks matter for one reason: without them, the easiest tasks took humans about a minute, and early models could not register on that scale at all. Adding second-scale tasks let METR place models as old as GPT-2 and GPT-3 on the same curve as 2025 systems, which is what makes a multi-year trend possible. Across all three suites, the tasks span from about 1 second to roughly 16 hours of human time. [2][3]
The human side of the comparison is not estimated casually. METR reports over 800 human baseline runs totaling 2,529 hours of work, performed by contractors and staff with relevant expertise, to establish how long each task actually takes a competent person. Those baselines define the time axis against which every model is scored. [3]
For each model, METR runs it on every task (with multiple attempts) and records success or failure. They then fit a logistic regression that predicts the probability of success as a function of the logarithm of the human task length. The fitted curve slopes downward: models almost always succeed at very short tasks and almost always fail at very long ones. [2][3]
The 50% time horizon is read off this curve as the human task length at which the predicted success probability crosses one half. The same machinery yields other thresholds. The 80% time horizon, the length at which a model succeeds 80% of the time, is the figure to watch if you care about reliability rather than coin-flip odds, and it is considerably shorter. For Claude 3.7 Sonnet the paper reports a 50% horizon of 59 minutes against an 80% horizon of about 15 minutes, so demanding higher reliability shrinks the usable task length by roughly four to five times. [2][3] This methodology is related to item response theory, which models the chance that a test-taker answers an item of given difficulty. [3]
The pattern shows up cleanly in raw success rates too. METR found that current models had close to a 100% success rate on tasks taking humans under about 4 minutes, but under a 10% success rate on tasks taking more than about 4 hours. The horizon sits in the transition zone between those regimes. [2]
The central empirical claim is that the 50% time horizon of frontier models has grown exponentially, doubling roughly every seven months over 2019 to 2025. Fitting an exponential to the frontier models in the dataset gives a high goodness of fit, with a reported coefficient of determination near 0.98. [1][2]
The table below lists the 50% and 80% horizons from METR's published data for the models in the original study. Values are in minutes of human task time. [3][4]
| Model | Release | 50% horizon | 80% horizon |
|---|---|---|---|
| GPT-2 | Feb 2019 | ~0.05 min (a few seconds) | ~0.01 min |
| GPT-3.5 Turbo Instruct | 2022 | ~0.6 min | ~0.3 min |
| GPT-4 (0314) | Mar 2023 | ~4 min | ~0.9 min |
| Claude 3 Opus | Mar 2024 | ~4 min | ~0.6 min |
| GPT-4o | May 2024 | ~7 min | ~1.3 min |
| Claude 3.5 Sonnet (June) | Jun 2024 | ~11 min | ~1.7 min |
| o1-preview | Sep 2024 | ~20 min | ~4.4 min |
| Claude 3.5 Sonnet (Oct) | Oct 2024 | ~21 min | ~2.6 min |
| o1 | Dec 2024 | ~39 min | ~7 min |
| Claude 3.7 Sonnet | Feb 2025 | ~59 min | ~15 min |
The ordering is not perfectly monotonic, since chat-oriented models like GPT-4 and Claude 3 Opus land in a similar 4-minute range while reasoning-trained models such as the o1 series jump well ahead. But the long-run climb is steep: from a few seconds in 2019 to about an hour for the best model in the original study, an increase of several orders of magnitude. [3][4]
METR has continued to add models to a live tracker since publication. By the August 2025 evaluation, GPT-5 was measured at a 50% horizon of about 2 hours 17 minutes (95% confidence interval roughly 65 minutes to 4 hours 25 minutes), compared with about 1 hour 30 minutes for OpenAI's o3. [5] METR's later updates report that frontier horizons kept rising and that the recent doubling time looks faster than the original seven months, on the order of a few months by 2026. [6][7]
If the seven-month doubling held, the horizon would reach a full working month (about 167 hours of human time) at 50% reliability around late 2029, with a confidence interval of roughly two years on either side. METR framed this more loosely in the blog post: extrapolating the trend suggests that within under a decade, AI agents could independently complete a large fraction of software tasks that currently take humans days or weeks. [1][2]
The forecast is sensitive to which data you fit. METR noted the trend may have accelerated in 2024, and fitting only the 2024-2025 points shortens the estimated arrival of month-long-task capability by about 2.5 years, putting meaningful probability mass in 2027 and early 2028. The wide spread between these scenarios is itself one of the paper's points: short extrapolations are far more robust than the long ones. [2]
METR is direct about how much weight the metric can bear, and several caveats deserve emphasis.
The long-range extrapolation rests on a handful of frontier data points. A high coefficient of determination on roughly eleven models is a thin basis for projecting an exponential half a decade into the future, and small changes in trend can move the multi-year forecasts by years. [2][8]
The tasks are not representative of real work. They are well-specified, automatically scorable, and self-contained, which makes them tractable to measure but unlike the open-ended, underspecified, context-heavy jobs people actually do. METR studied this directly by scoring task "messiness" and found that a one-point increase in messiness on their scale reduced mean success rates by about 8.1%, which suggests horizons measured on clean benchmarks overstate real-world capability. [3]
Reliability is a separate axis from the headline number. A 50% horizon means even odds, and for anything you would actually delegate, the 80% horizon (several times shorter) is the more honest figure. The gap also tends to widen as tasks get longer, so a model that looks capable at 50% may be far less dependable in practice. [3]
The scope is narrow. The original measurement is dominated by software and machine-learning engineering; later METR work shows horizons vary substantially across domains, so a single number does not generalize to all kinds of tasks. [9] There are also methodological worries about whether the score can be gamed by tuning the task distribution, and commentary (including from METR itself) has stressed that the chart is widely over-interpreted as a clean law of progress when it is a noisy empirical fit on a specific suite. [7][8][10]
The time horizon has become one of the more cited inputs to AI capability forecasting because it converts model progress into a quantity that bears on autonomy. The longer the tasks a system can complete unsupervised, the more plausible it is that the system could carry out extended, multi-step plans on its own, which is part of why METR, an organization focused on evaluating frontier models for dangerous capabilities, designed the metric in the first place. The trend has been folded into timeline forecasts for transformative AI and is frequently referenced in AI safety and governance discussions, though responsible use of it means carrying along the caveats rather than treating the doubling line as a law. [1][8][10]