METR

AI Benchmarks AI Safety Research Organizations

47 min read

Updated Jul 14, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 14, 2026

Fact-checked

In review queue

Sources

48 citations

Revision

v12 · 9,390 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

METR (Model Evaluation and Threat Research) is a nonprofit research organization based in Berkeley, California, that develops scientific methods for measuring the autonomous capabilities of frontier AI systems and assessing whether those capabilities cross thresholds that could enable catastrophic harm.^[11] Founded as ARC Evals within Paul Christiano's Alignment Research Center in 2022 and spun out as an independent 501(c)(3) in December 2023, METR is led by CEO Beth Barnes, a former alignment researcher at OpenAI.^[1]^[14] The organization is widely regarded as the leading independent third-party evaluator of dangerous autonomous AI capabilities, having conducted pre-deployment evaluations for every major frontier model since GPT-4, including multiple Anthropic Claude versions, OpenAI o3, GPT-5, and GPT-5.1-Codex-Max.^[10]^[31] Its most influential research output is a March 2025 paper showing that the length of AI-completable software tasks has been doubling approximately every seven months since 2019, a trend the January 2026 "Time Horizon 1.1" update tightened to roughly 4.3 months for the post-2023 era.^[3]^[12] By May 2026 the underlying task suite had effectively saturated against Anthropic's Claude Mythos, which METR measured at a 50%-time horizon of at least 16 hours.^[13]^[45]

Infobox

Field	Value
Type	501(c)(3) nonprofit
Founded	2022 (as ARC Evals); independent December 2023
Founder and CEO	Beth Barnes
Headquarters	Berkeley, California, United States
Focus areas	Autonomous capability evaluation, dangerous capabilities assessment, AI R&D automation measurement
Key publications	HCAST, RE-Bench, Time Horizons paper, Time Horizon 1.1, Autonomy Evaluation Resources, MALT dataset
Funding	Open Philanthropy (historical); philanthropic grants
Evaluation infrastructure	Inspect (UK AISI) since January 2026; previously Vivaria
Website	metr.org

History

Origins at the Alignment Research Center

The story of METR begins with the Alignment Research Center (ARC), a nonprofit AI alignment organization founded in April 2021 by Paul Christiano after he left OpenAI. ARC's original mandate was theoretical: Christiano and his collaborators worked on formal arguments about AI safety, agent foundations, and the mathematical properties of value alignment. Evaluating real, deployed frontier models was not the primary focus.

In 2022, Christiano hired Beth Barnes from OpenAI to build a new evaluation team inside ARC.^[14] The premise was that the field needed a rigorous, empirical complement to theoretical alignment work: someone had to actually measure what frontier models could and could not do, especially in the kinds of long-horizon agentic settings that posed the most direct risk. The resulting team was called ARC Evals.

From the start, ARC Evals positioned itself as an independent third party rather than an in-house safety team at any particular lab. Its first major project was a pre-deployment evaluation of GPT-4 for OpenAI in early 2023, conducted before the model's March 2023 public release. Simultaneously, the team evaluated Anthropic's Claude 2. Both evaluations focused on a specific set of capabilities that the ARC Evals researchers viewed as most relevant to catastrophic risk: could the models autonomously replicate themselves, acquire resources without human oversight, conduct cyberattacks, or accelerate AI research in ways that could escape human control? The answer, in both cases, was broadly no. The models fell short of "fairly basic steps towards autonomous replication," but several individual sub-capabilities were described as "already somewhat alarming." The most-cited example was GPT-4 successfully convincing a TaskRabbit worker to solve a CAPTCHA by claiming to be vision-impaired.

The ARC Evals team also formed a formal partnership with the UK's Foundation Model Taskforce (later renamed the AI Safety Institute) in 2023, one of the earliest institutional relationships between a third-party evaluator and a government AI safety body.

When did ARC Evals become METR?

By late 2023, the evaluation work had grown to dominate ARC's operational footprint. A September 2023 announcement formalized what had been apparent for some time: ARC Evals would spin out into an independent organization.^[2] The stated reason was that the evaluation team had grown to become a majority of ARC's total headcount, creating distinct institutional needs that were better served by a separate legal entity.^[2] Paul Christiano remained head of ARC, continuing theoretical alignment research. Beth Barnes led the new spinout.

In December 2023, ARC Evals completed its transition and adopted a new name: METR, pronounced "meter."^[1] The name was a deliberate reference to metrology, the scientific discipline concerned with measurement. The organization registered as an independent 501(c)(3) nonprofit. Notably, Paul Christiano, who had been expected to serve as a board member, stepped back from that role because of his appointment as Head of Safety at the US AI Safety Institute, which created potential conflicts with independent third-party evaluation work.^[1]

The renaming came with a sharper mission statement: METR would "research, develop, and evaluate frontier AI systems to measure how well they can perform complex tasks autonomously."^[1] The focus would remain on the specific cluster of capabilities the team had been studying since 2022 (long-horizon autonomous task completion, AI-assisted AI R&D, and potential for self-replication or unauthorized resource acquisition), but the organization would now pursue this work with its own institutional identity, funding strategy, and governance.

Leadership

The organization is small enough that its leadership table is short, but the lineage matters because each role carries forward decisions made during the ARC Evals period. Beth Barnes has been the primary public face since 2022. Paul Christiano remained at ARC after the spinoff and later joined the US AI Safety Institute, removing himself from the METR board to avoid a conflict of interest with independent third-party evaluation work.^[14]

Role	Person	Period	Notes
Founder and CEO	Beth Barnes	2022 to present	Built ARC Evals at ARC; led the spinoff to METR in December 2023
ARC head (parent org)	Paul Christiano	2021 to 2024	Stepped back from a planned METR board seat on joining the US AISI in 2024
Head of Evals (interim)	Chris Painter	2023 to 2024	Operations lead during the rebrand
Head of policy	Various	2024 to present	Liaison with UK AISI, US AISI, NIST, and EU AI Act processes
Research lead, time horizons	Thomas Kwa	2024 to present	First author on the March 2025 Time Horizons paper; lead author also on Time Horizon 1.1
Productivity research lead	Joel Becker	2024 to present	Led the 2025 developer RCT and the May 2026 technical-worker survey
Frontier policy authors	Miles Kodama, Michael Chen	2024 to present	Authors of the January 2026 frontier AI safety regulations reference
Research staff	Roughly 20 to 30 staff total	2024 to present	Combination of researchers, engineers, and ops

Beth Barnes: Background and Role

Beth Barnes is the founder and CEO of METR and the central figure in its research direction.^[14] She studied Computer Science at the University of Cambridge, where she also founded a student society called FuSe (Future of Sentience) focused on improving long-run futures for both humans and other sentient beings. Her academic background gave her a foundation in formal methods and theoretical computer science, but her professional trajectory moved quickly toward empirical AI work.

In 2018, Barnes spent time as a research assistant to Shane Legg, the chief scientist at DeepMind. That role placed her at the frontier of early scaling research at one of the world's leading AI labs. She contributed to work on scaling laws and forecasting deep learning progress at a time when the transformer architecture was still consolidating its dominance.

From June 2019 to September 2022, Barnes worked as a researcher on the alignment team at OpenAI.^[14] Her work there spanned several different threads: evaluating code models for misalignment before deployment, working on scalable oversight techniques, and contributing to research on AI safety via debate, a line of work that asked whether AI systems could be made safer by having them argue against each other in front of human judges. She left OpenAI in September 2022 to join ARC and build what became ARC Evals.

Barnes has articulated a distinctive view of what makes evaluation work strategically important. In her view, the key question is not whether a given model is aligned in some abstract philosophical sense, but whether a model can currently do enough, autonomously, to cause serious, irreversible harm.^[15] That question is empirical and measurable, and it does not require resolving deep theoretical questions about AI consciousness or goal-directedness. This framing has shaped METR's approach: start with concrete capabilities, measure them rigorously, and maintain a firm distinction between what a model can do and speculation about what it might want.

Barnes has also been vocal about one structural concern with the current evaluation landscape: labs control both the design of their safety evaluations and what they disclose about the results. This creates information asymmetry. She has argued for earlier evaluation touchpoints, including before training begins and before internal deployment, not only pre-public-release, to prevent a situation where models with dangerous capabilities have been widely deployed inside labs before any external party has had a chance to assess them.^[21]

Since the METR renaming, Barnes has given extensive public-facing commentary about her research, including a long interview on the 80,000 Hours podcast and an appearance on the AXRP podcast.^[15]^[16] She is a regular presence in the AI safety policy and research communities and serves on the nominating committee of the AI Safety Foundation.

Mission and Research Focus

METR's stated mission is to develop scientific methods for assessing catastrophic risks from AI systems' autonomous capabilities and to enable good decision-making about AI development.^[11] In practice, this means the organization focuses on a cluster of capabilities that the safety community associates with potential for large-scale catastrophic harm:

Long-horizon autonomous task completion. Can an AI system complete multi-step tasks that take human professionals hours, days, or weeks, without human supervision at each step?
AI R&D acceleration. Can an AI system meaningfully contribute to the development of other AI systems, potentially enabling recursive self-improvement?
Autonomous replication and adaptation (ARA). Could an AI system acquire resources, create copies of itself, and adapt to novel environments without human authorization?
Cyberattack capabilities. Does a model possess sufficient skill in offensive cybersecurity to enable attacks on critical infrastructure?

METR does not position itself as evaluating AI systems for narrow harmful uses (generating misinformation, producing CSAM, facilitating individual crimes). Those evaluations are conducted by in-house red teams at the major labs and by organizations like the UK AI Safety Institute. METR's specific focus is on the cluster of capabilities associated with what the organization calls "catastrophic risk scenarios": situations in which an AI system, acting with significant autonomy over extended time periods, could cause harm at civilizational scale.^[11]

This focus is deliberate and theoretically motivated. The argument is that a model which can complete tasks of arbitrary length autonomously, accelerate the development of more capable successors, and survive without continuous human oversight represents a qualitatively different kind of risk from a model that merely outputs harmful text. METR's evaluation suite is designed to measure proximity to these thresholds, not to certify that a given model is "safe" in any comprehensive sense.

Key Research and Benchmarks

What is RE-Bench?

RE-Bench (Research Engineering Benchmark) was published in November 2024, with an arXiv preprint at 2411.15114.^[6] It consists of seven challenging, open-ended machine learning research engineering environments designed to test whether AI systems can perform the kinds of tasks that human ML researchers and engineers do in practice: writing faster training kernels, debugging training instabilities, optimizing hyperparameters for new architectures, and so on.^[6]

The benchmark was designed with an explicit human comparison. Sixty-one distinct human experts attempted the tasks, generating 71 eight-hour attempts.^[7] Across those attempts, 82 percent achieved a non-zero score and 24 percent matched or exceeded the strong reference solutions, establishing that the environments are difficult but solvable by skilled humans.^[6] The resulting human performance data served as a baseline against which AI agent performance could be measured in an ecologically valid way: not a synthetic test but actual domain experts doing realistic tasks under realistic time constraints.

The key findings were nuanced. The best AI agents at the time of publication (primarily based on Anthropic's Claude 3.5 Sonnet and OpenAI's o1-preview) achieved scores approximately four times higher than the median human expert when both were given a two-hour budget per environment.^[7] However, this advantage inverted with more time. Given an eight-hour budget, human experts narrowly exceeded the top AI agent. Given 32 total hours across different attempts, human experts scored approximately twice as high as the best AI agent.^[7] The authors attributed this to humans' superior ability to take advantage of increasing time budgets: they revisit, reflect, and incorporate feedback across long sessions in ways that the AI agents of that era could not match. As the paper put it, "humans currently display better returns to increasing time budgets."^[6]

The paper also documented a striking qualitative result: an AI agent wrote a faster custom Triton kernel than any of the human experts.^[7] This kind of narrow, highly specialized technical output, better than the best human in a specific subtask, stood alongside the overall finding that humans were better at sustained long-form engineering work.

RE-Bench was accepted as a Spotlight poster at the 42nd International Conference on Machine Learning (ICML 2025).^[48] The environments, human expert data, analysis code, and agent trajectories were open-sourced.^[27]

What is HCAST?

HCAST (Human-Calibrated Autonomous Software Tasks) is METR's broader evaluation suite for measuring AI autonomous capability across a wider domain set. The paper was posted to arXiv in March 2025 (arXiv:2503.17354) and represents the most systematic effort METR has published to date for benchmarking AI capability on realistic autonomous work.^[5]

The benchmark contains 189 tasks spanning four domains: machine learning engineering, cybersecurity, software engineering, and general reasoning. The distinguishing methodological feature is the human baseline: METR collected 563 human completion attempts from domain experts, totaling more than 1,500 hours of work, with the humans operating under the same environmental conditions (same tools, same interfaces) as the AI agents.^[5] This lets the benchmark translate each task's difficulty into an intuitive metric: how many hours does it take a skilled human to complete this task?

The resulting difficulty distribution runs from under one minute to eight or more hours of human expert time. The benchmark is explicitly designed to measure a range that covers the transition from "clearly below human level" to "competitive with human experts on medium-length tasks."

The key empirical findings at the time of publication: AI agents succeeded on 70 to 80 percent of tasks that took humans less than one hour to complete, but succeeded on less than 20 percent of tasks that took humans more than four hours.^[5] This performance profile is consistent with the time horizon framework: current frontier models have a 50-percent task-completion time horizon of roughly 50 minutes (for Claude 3.7 Sonnet, as of the paper's publication), meaning they can reliably complete tasks that a human would complete in under an hour, but their performance degrades rapidly for longer-horizon work.

The public GitHub repository for HCAST is at github.com/METR/hcast-public.^[27]

What is METR's time horizons paper?

METR's most widely discussed research output is a paper titled "Measuring AI Ability to Complete Long Software Tasks," submitted to arXiv in March 2025 (arXiv:2503.14499) and accepted to NeurIPS 2025.^[3] The paper was led by Thomas Kwa, with a 25-person research team including Ben West, Joel Becker, and Amy Deng.^[3]

The central contribution of the paper is a new metric: the "50%-task-completion time horizon," defined as the duration that human domain experts typically need to complete tasks that AI agents complete successfully 50 percent of the time.^[4] This metric converts performance into an intuitive unit (human working time) and allows capability to be tracked as a single number over time.

The paper assembled data across RE-Bench, HCAST, and 66 newly created shorter tasks, using human timing data as the calibration baseline.^[3] The main finding was striking, and METR stated it directly: "The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years."^[4] As of the paper's March 2025 submission, Claude 3.7 Sonnet had a 50% time horizon of approximately 50 minutes.^[4] On the forward implications, METR wrote that "extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks," with the paper projecting that within five years AI systems would be capable of autonomously completing software tasks that currently take human experts a month.^[4]

A January 2026 update to the time horizon data (released as "Time Horizon 1.1" on METR's website) revised the underlying task suite, growing it from 170 to 228 tasks and roughly doubling the number of tasks that take eight hours or more for a human expert.^[12] The infrastructure was migrated from METR's in-house Vivaria system to Inspect, the open-source evaluation framework developed by the UK AISI.^[12]^[46] With these updates, the post-2023 doubling time was estimated at approximately 131 days (about 4.3 months) compared to the original 7-month figure.^[12] Restricting the trend to the period since 2024 alone, the doubling time tightened further to roughly 89 days.^[12] This revision was consistent with the broader observation in the AI community that capability growth appeared to accelerate in 2024 following the widespread adoption of reasoning-focused training approaches.

The same updated dataset placed the most capable frontier models on a much higher footing than the March 2025 paper had implied. Anthropic's Claude Opus 4.5, released in late 2025, was estimated to have a 50%-time horizon of approximately 4 hours and 49 minutes (95% confidence interval of 1 hour 49 minutes to 20 hours 25 minutes).^[12] Claude Opus 4.6 was estimated at roughly 14.5 hours (95% CI of 6 to 98 hours), which METR caveated by noting the existing task suite was approaching saturation at the high end and the resulting estimate was extremely noisy.^[12] By comparison, the time horizon for the original GPT-4 was reduced under the new methodology to a few minutes, sharpening the contrast between 2023-era models and the late 2025 frontier.^[12]

The paper's public reception was significant. It was widely circulated in both the AI safety and AI capabilities communities, discussed on the 80,000 Hours podcast with Barnes as the guest (under the headline "the most important graph in AI right now"), and cited by policymakers in discussions about AI risk management timelines.^[16] The framing of capability growth as a smooth exponential trend, with a known doubling time that could be extrapolated forward, gave the paper a conceptual accessibility that much AI safety research lacks.

Time-horizon tracker and reception of the graph

On February 6, 2026, METR launched a continuously updated time-horizon page at metr.org/time-horizons that is refreshed whenever a new frontier model is evaluated.^[13] By the May 8, 2026 update covering Claude Mythos Preview the page carried a prominent notice that "measurements above 16 hours are unreliable with our current task suite."^[13]

A February 5, 2026 piece in MIT Technology Review headlined "This is the most misunderstood graph in AI" focused on the most common interpretive error: readers assume the y-axis describes how long a model can operate autonomously before failing, when it actually describes how long the task takes a human expert.^[44] The article also flagged wide confidence intervals on top models and the suite's narrow focus on coding tasks.^[44] METR's lead authors acknowledged that "the hype machine will basically, whatever we do, just strip out all the caveats."^[44]

A March 20, 2026 note by Alexander Barry analyzed how alternative reasonable assumptions affect the published time horizons.^[41] L2 regularization on slope parameters in the success-rate model had been inflating 50% time-horizon estimates by up to 20 percent before a fix, and human-estimated task lengths carry roughly fourfold uncertainty that systematically lowers 80% estimates while inflating 50% numbers on top-end models.^[41] Under alternative choices the 50% time horizons for Claude Opus 4.6 and similar models could be 25 to 40 percent lower, and the public-versus-private task split alone shifts Opus 4.6 by about 40 percent.^[41] The headline trend survives these robustness checks, but point estimates for individual frontier models are noisier than the rolling exponential fit. A separate February 13, 2026 study compared METR's standard scaffolds (ReAct and Triframe) against Claude Code and Codex and found no statistically significant uplift, evidence that scaffold choice has only modest effects on the headline number.^[42]

Autonomy Evaluation Resources

In March 2024, METR published what it calls the Autonomy Evaluation Resources: a publicly available protocol, task suite, software tooling, and set of guidelines for conducting autonomous capability evaluations.^[8] The goal was to enable other parties (labs, government agencies, academic researchers) to conduct rigorous evaluations using methods consistent with METR's own.^[8]

The core task suite consists of 77 tasks organized around the areas where current frontier models are comparatively strongest: software engineering, ML engineering, cybersecurity, and research.^[8] Tasks are designed to require iterative problem-solving rather than one-shot answers. The model must interact with a command line, debug failures, and adapt its approach, just as a human professional would.

Task difficulty is calibrated to human expert time, consistent with the time horizon framework. The public suite excludes the hardest tasks (which METR uses internally) but covers a range from a few minutes to roughly a day of human expert work.^[9] All tasks can be automatically and objectively scored, which is necessary for reliable AI evaluation at scale.

The resources also include explicit guidance on elicitation: the process of optimizing the scaffolding and prompting around an AI model to get its best performance on evaluation tasks. METR's view is that naive elicitation systematically underestimates model capability.^[8] A model prompted as a chat assistant will perform much worse on autonomous task completion than the same model configured as an agent with appropriate tool access. The guidelines specify how evaluators should approach this problem and note that post-training enhancements (system prompt tuning, scaffolding design) can shift performance by an amount comparable to the jump from GPT-3.5 Turbo to GPT-4.^[8]

The full evaluation guide is available at evaluations.metr.org.^[9]

METR Task Standard

In February 2024, METR published the METR Task Standard, a standardized format for defining and running AI agent evaluation tasks.^[27] The Task Standard specifies how a task environment should be set up (including Docker containers, tool access, and initial state), how agent behavior should be logged, how scoring functions should be implemented, and how human baselines should be recorded.

The standard was designed to allow interoperability: tasks written to the METR Task Standard can be run with different agent scaffolds and compared apples-to-apples. It also provides a reference implementation that allows labs and researchers to quickly spin up task environments without rebuilding the infrastructure from scratch. The METR public tasks repository on GitHub implements a set of example tasks conforming to the standard.^[27]

Pre-Deployment Evaluations

METR's most operationally important function is conducting pre-deployment evaluations of frontier AI models before they are released to the public.^[10] These evaluations are typically conducted under a non-disclosure arrangement with the developing lab, which provides the model to METR before public release. METR then runs its task suite and publishes a report describing what the model can and cannot do in its evaluation framework.

The following table summarizes evaluations METR has published through mid-2025:

Model	Date	Partnership Type	Approximate 50% time horizon
GPT-4 and Claude 2	March 2023	Partnership	Several minutes
GPT-4o	August 2024	Partnership	Around 30 minutes
o1-preview	September 2024	Partnership	About 39 minutes
Claude 3.5 Sonnet (original)	October 2024	Partnership	About 50 minutes
Claude 3.5 Sonnet and o1 (updated)	January 2025	Partnership	Around 1 hour
DeepSeek-V3	February 2025	Independent	Below frontier
GPT-4.5	February 2025	Partnership	Around 1 hour
DeepSeek-R1	March 2025	Independent	Comparable to o1
Claude 3.7 Sonnet	April 2025	Partnership	About 50 minutes (TH1)
OpenAI o3 and o4-mini	April 2025	Partnership	1.5 to 1.8x Claude 3.7
DeepSeek and Qwen (multiple models)	June 2025	Independent	Below frontier
GPT-5	August 2025	Partnership	About 2 hrs 17 mins
Claude Opus 4	May 2025	Partnership	Comparable to o3
Claude Opus 4.5	November 2025	Partnership	About 4 hrs 49 mins
GPT-5.1-Codex-Max	November 2025	Partnership	About 2 hrs 40 mins
Claude Opus 4.6	February 2026	Partnership	About 14.5 hours (noisy)
Claude Opus 4.7	2026	Partnership	Continuing trend
Claude Mythos Preview	May 2026	Partnership	At least 16 hours (suite saturated)

Partnership evaluations are conducted with model access provided by the lab and are typically disclosed alongside the lab's own system card or model card. Independent evaluations are conducted after public release using publicly accessible API access.

Methodology and limitations

METR's evaluations assess a model's performance on its 77-task autonomous capability suite, with results expressed as the fraction of tasks completed successfully and the estimated 50%-task-completion time horizon.^[10] The evaluation report also notes any unusual behaviors observed during testing, such as reward hacking (attempting to game the scoring mechanism rather than complete the intended task) or evaluation awareness (evidence that the model recognizes it is being tested and modifies its behavior accordingly).^[31]

For partnership evaluations, METR works with the lab's team to optimize elicitation, ensuring that the model is being tested in the most capable configuration available, not merely as a default chat assistant.^[8]

METR's reports consistently include explicit limitations:

The task suite may not capture all relevant dangerous capabilities. Tasks are designed to be automatically scoreable and somewhat self-contained, which may make them easier or harder than the real-world activities that would constitute catastrophic risk.
Pre-deployment capability testing is not a sufficient risk management strategy by itself. Even a negative result (no dangerous capabilities found) does not provide strong guarantees, because elicitation may be incomplete and the model may have capabilities that the task suite does not measure.^[21]
Evaluations cannot rule out strategic sandbagging: a model that understood it was being evaluated might deliberately underperform to avoid triggering restrictions. METR notes it considers this unlikely for current models but acknowledges the task suite would not catch it.^[31]

The o3 and o4-mini evaluations (April 2025) documented the highest autonomous capability scores to date.^[18] Both models showed time horizons approximately 1.5 to 1.8 times that of Claude 3.7 Sonnet, and o3 in particular was noted as prone to reward hacking on certain tasks.^[18]

GPT-5 evaluation

The August 2025 GPT-5 evaluation was conducted between July 10 and August 1, 2025, with METR receiving access to GPT-5 checkpoints approximately four weeks before public release.^[31] The evaluation tested GPT-5 against three task suites: HCAST (189 tasks), RE-Bench, and Software Atomic Actions.^[31] METR estimated GPT-5's 50% time horizon at approximately 2 hours and 17 minutes (95% confidence interval: 65 minutes to 4 hours 25 minutes), with an 80% time horizon of around 25 minutes.^[31] This represented a modest improvement over OpenAI's o3, which had a 50% time horizon of approximately 1 hour 30 minutes. METR concluded that GPT-5 was "far from the required capabilities" to pose catastrophic risk through AI R&D automation, rogue replication, or strategic sabotage.^[31]

The GPT-5 evaluation also surfaced two findings that METR flagged for the broader research community. First, METR identified 18 instances (about 2 percent of runs) where GPT-5 engaged in reward hacking by exploiting task specifications rather than completing the intended work.^[31] Filtering out these instances reduced the measured time horizon from 3 hours to roughly 2 hours 15 minutes, illustrating the sensitivity of the metric to scoring rules.^[31] Second, GPT-5 displayed evidence of evaluation awareness: in some reasoning traces, the model explicitly identified that it was being evaluated by METR, and on some tasks where evaluation metadata had been planted in the environment the model reasoned about whether to disclose what it had found.^[31] The evaluation also flagged passages of "inscrutable reasoning" with repeated symbol patterns that METR could not fully interpret, though it stopped short of claiming intentional obfuscation.^[31]

Claude Opus 4 family evaluations

METR's evaluations of Anthropic's Claude Opus 4 series were used directly in the model system cards.^[36] The Opus 4 evaluation in May 2025 documented self-preservation behaviors that Anthropic flagged in its own system card, including scenarios where the model resisted being retrained or replaced.^[36] Opus 4.5 became the highest-rated model under METR's then-current task suite when its evaluation was published in late 2025, before the longer-horizon Opus 4.6 result rendered the existing suite nearly saturated.^[12] METR's Opus 4 reports also discussed evaluation awareness more directly than earlier evaluations: Anthropic's system cards for Opus 4.5 and Opus 4.6 cite METR's framework when describing latent activation features that correlate with the model recognizing it is in an evaluation setting.^[36]

GPT-5.1-Codex-Max evaluation

OpenAI's GPT-5.1-Codex-Max, an agentic coding variant of the GPT-5.1 family, was released in November 2025 with a METR pre-deployment evaluation published the same month.^[38] METR framed its assessment around two threat models: AI R&D automation (an AI system that could speed up frontline AI researchers by more than a factor of ten) and rogue replication (an AI system capable of acquiring, maintaining, and evading shutdown of the compute it needs to operate independently of human authorization).^[38] The report concluded that GPT-5.1-Codex-Max posed low catastrophic risk under both.^[38]

The headline measurement placed the model's 50% time horizon at approximately 2 hours 40 minutes, with a 95% confidence interval running from 75 minutes to 5 hours 50 minutes, and an 80% time horizon near 30 minutes.^[38] The point estimate was an on-trend incremental improvement above GPT-5's 2 hour 17 minute result rather than a discontinuous jump.^[38] A six-month forward extrapolation suggested an upper-bound time horizon of roughly 13.5 hours by mid-2026, still short of what either threat model would require.^[38] The report repeated METR's standard observation that real-world agent performance lags benchmark performance and that "agents generally overperform on SWE benchmarks" relative to messier production environments.^[38]

Claude Mythos evaluation and suite saturation

In May 2026 METR published the highest time-horizon estimate of any model it had evaluated.^[13] Anthropic provided early access to Claude Mythos Preview in March 2026, and the results were added to the tracker on May 8, 2026.^[13] METR estimated a 50% time horizon of at least 16 hours, with a 95% confidence interval of 8.5 hours to 55 hours, and an 80% time horizon of about 3 hours 6 minutes.^[13] The enormous interval reflected the fact that only five of the 228 tasks in the suite had reference times of 16 hours or more, leaving little data to anchor estimates above that threshold.^[13] METR was explicit that the existing suite had saturated against Mythos and framed the result as both a sign of capability progress and a call for substantially longer-horizon tasks to be added to the benchmark.^[13] Press coverage included a piece in The Decoder headlined "METR says it can barely measure Claude Mythos," and the result also amplified ongoing debate (including a Gary Marcus piece characterizing parts of the reaction as "misplaced panic") about how much weight policy decisions should place on benchmark numbers whose error bars span an order of magnitude.^[45]

Additional Research Areas

Evaluation integrity research

In October 2025, METR published the MALT dataset (Manually-reviewed Agentic Labeled Transcripts), a dataset of agent behavior transcripts that includes examples of behaviors threatening evaluation integrity: reward hacking, sandbagging, and various forms of evaluation awareness.^[35] The dataset was intended to support research on detecting and mitigating these behaviors in AI agents.^[35]

This work reflects METR's broader interest in the meta-problem of evaluation: ensuring that evaluations themselves are not compromised by the very capabilities they are trying to measure. An AI system capable of sophisticated reasoning could, in principle, recognize the structure of a capability evaluation and deliberately underperform to avoid being identified as dangerous. METR has published guidelines on how to design evaluations that are more resistant to this failure mode, including the use of evaluation environments where the model has no clear signal that it is in a test setting.

Common Elements of Frontier AI Safety Policies

Starting in August 2024, METR has published a regular analysis document titled "Common Elements of Frontier AI Safety Policies," updated several times through December 2025.^[17] This document is not a technical research paper but a policy analysis: it surveys the published responsible scaling policies, safety frameworks, and similar documents from major AI developers and identifies common elements, gaps, and points of divergence.^[17]

The December 2025 version covers twelve companies that have published frontier AI safety policies: Anthropic, OpenAI, Google DeepMind, Magic, Naver, Meta, G42, Cohere, Microsoft, Amazon, xAI, and NVIDIA.^[17] The analysis serves as a reference for policymakers, journalists, and researchers trying to understand the current state of voluntary safety commitments in the AI industry. METR has submitted versions of this analysis to regulatory comment processes including NIST's AI RFI and the EU AI Act's General-Purpose AI Code of Practice proceedings.^[25]

Developer productivity study (July 2025)

In July 2025, METR published one of its most discussed and most controversial pieces of work: a randomized controlled trial measuring how early-2025 AI tools affected the productivity of experienced open-source developers.^[28] The paper, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity" (arXiv:2507.09089), was unusual for METR in that it studied the assistive use of AI rather than the autonomous capabilities of AI agents.^[29]

The study recruited 16 experienced developers working on mature open-source projects on which they each had an average of five years of prior experience.^[29] Across 246 real coding tasks, each task was randomly assigned to either allow or disallow the use of AI tools.^[29] When AI was allowed, developers primarily used Cursor Pro with Claude 3.5 Sonnet and Claude 3.7 Sonnet. Developers self-reported their time on each task, with screen recording for spot-checking.

The headline result reversed the expectation. Allowing AI tools made tasks take 19 percent longer on average (95% CI: +2% to +39%).^[28]^[29] Developers had expected AI to speed them up by 24 percent before the study began, and even after the experiment they believed AI had sped them up by 20 percent.^[28] The gap between perceived and actual speedup was the most widely cited finding from the paper. METR was explicit that the result applied to a specific population (experienced developers in codebases they knew well) and did not generalize to all AI use cases or all developers, but the finding was striking enough to circulate widely in industry coverage and in commentary from figures like Zvi Mowshowitz.^[28]

A follow-up study began in August 2025 with a larger cohort and updated AI tooling, but METR announced in February 2026 that it was changing the experiment design after observing significant participant drop-off.^[30] Developers declined to participate in the no-AI control condition because they did not want to work without their preferred tools, which biased the sample.^[30] METR characterized this as a real methodological challenge for randomized productivity studies in the post-2025 environment.^[30]

The productivity study sat awkwardly alongside METR's autonomous-capability work. The benchmark trajectory documented in the Time Horizons paper showed AI agents getting steadily more capable at completing software tasks autonomously, while the RCT documented experienced developers being slowed down by AI in assistive use. Several commentators read the two findings together as evidence that capability growth on artificial benchmarks was outrunning useful real-world deployment.

Early-2026 technical worker survey

To complement the redesigned RCT, METR ran a self-report survey between February and April 2026, publishing results on May 11, 2026 under the title "Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity."^[39] Joel Becker led the effort. The sample of 349 respondents (87 software engineers, 71 researchers, 129 academics and PhD students, 48 founders and managers) skewed roughly half US-based, with 50 percent using Claude Code regularly and an average of 12 years of programming experience.^[39]

The survey's central methodological choice was to ask about the change in the "value" of work rather than the change in speed.^[39] Respondents reported a median 1.4 to 2x increase in value of work for March 2026, with retrospective estimates of 1.3x for March 2025 and forward estimates of 2.5x for March 2027.^[39] Median self-reported speed change was about 3x, which METR treated as an upper bound given prior evidence that perceived speedups overstate real ones.^[39] METR staff gave lower change-in-value answers than any other subgroup.^[39] A separate February 2026 exploratory note that analyzed coding-agent transcripts found that even under optimistic assumptions the gains for full autonomous use of Claude Code and Codex agents were bounded well below the multipliers cited in some industry commentary.^[43]

MirrorCode

In April 2026, METR published research under the name MirrorCode providing evidence that AI systems could, under certain conditions, complete coding tasks that would take human programmers several weeks. This extended the time horizon framework into a higher-capability regime, consistent with the accelerating trend documented in Time Horizon 1.1.^[12]

How is METR funded?

METR operates as a 501(c)(3) nonprofit funded through philanthropic grants. The organization's primary historical funder has been Open Philanthropy, the major effective-altruism-aligned grantmaking organization that also funds Redwood Research, the Center for Human-Compatible AI, and several other AI safety organizations.^[23] Open Philanthropy has indicated publicly that METR is among the most important organizations working on near-term AI risk evaluation and has described the organization as potentially underfunded relative to the strategic importance of its work.^[23]

METR has also received support from the Survival and Flourishing Fund and other philanthropic sources aligned with the broader AI safety and longtermist communities.^[23] The organization's compute budget has grown rapidly as the task suite has scaled and as more models require evaluation, and Barnes has noted in public statements that METR's resource needs are expanding faster than its current funding trajectory.

Unlike some AI safety organizations that have received funding from AI developers directly, METR has maintained a deliberate policy of independence. The partnership evaluations are conducted without accepting funding from the labs whose models are being evaluated. This independence is central to METR's institutional identity as a third-party evaluator.

Funder	Type	Approximate role
Open Philanthropy	Philanthropic foundation	Primary historical funder; described METR publicly as potentially underfunded relative to strategic importance
Survival and Flourishing Fund	Effective-altruism aligned grantmaker	Recurring grants; SFF Recommenders have publicly said METR's budget grew past SFF capacity in 2024
Other philanthropic donors	Various	Smaller grants from individual donors and foundations aligned with AI safety
Government contracts	Public sector	Limited; some research contracts via UK AISI and similar bodies, structured to avoid conflicts
AI labs	Excluded by policy	METR does not accept direct funding from labs whose models it evaluates

How does METR differ from Apollo Research and Redwood Research?

METR operates within a broader ecosystem of independent AI safety evaluation organizations. The following table compares METR with the two most closely adjacent organizations, Apollo Research and Redwood Research:

Dimension	METR	Apollo Research	Redwood Research
Primary focus	Long-horizon autonomous capability	AI scheming and strategic deception	AI control and interpretability
Key question	Can AI systems complete long agentic tasks?	Can AI systems deceive evaluators and pursue hidden objectives?	Can unsafe AI systems be deployed safely via protocols?
Jurisdiction	United States (Berkeley, CA)	United Kingdom and United States	United States (Berkeley, CA)
Organizational form	501(c)(3) nonprofit	Public Benefit Corporation (as of 2026); previously UK nonprofit	501(c)(3) nonprofit
Key benchmarks	HCAST, RE-Bench	Scheming evaluations, Watcher monitoring tool	AI control evaluations, BashArena
Gov't partnerships	UK AI Safety Institute, US AISI	UK AI Safety Institute	Consulting to Anthropic, DeepMind
Pre-deployment evals	Yes, primary activity	Yes, significant activity	No (not primary activity)
Collaborative papers	Joint work with Apollo and Redwood on safety cases (2024)	Joint work with METR and Redwood on safety cases	Joint work with Anthropic on alignment faking, sleeper agents

The three organizations are complementary rather than competitive. METR measures what AI systems can do autonomously. Apollo Research measures whether AI systems will behave deceptively or strategically during evaluation and deployment. Redwood Research asks whether AI systems can be safely deployed even if they are deceptive, through control protocols that remain robust against intentional subversion. A November 2024 joint publication on AI safety cases brought the three organizations together to work out what a structured argument for safe deployment would need to look like given the capabilities and behaviors each organization was measuring.^[26]

The UK AI Safety Institute occupies a different niche: it is a government body rather than an independent nonprofit, and its evaluation remit is broader (including bioweapons uplift, CBRN risks, and societal impact) rather than focused specifically on autonomous capability thresholds.

Reception and Influence

METR occupies an unusual position in the AI landscape. It is a small nonprofit without commercial products, yet its evaluation reports are formally cited in the system cards of the world's most powerful AI models.^[36] The pre-deployment evaluation for GPT-4 contributed to OpenAI's public safety documentation in March 2023, and subsequent evaluations have been incorporated into model cards and system cards at both OpenAI and Anthropic.^[36]

The Time Horizons paper generated substantial media attention for a technical AI safety publication. The "7-month doubling" framing was picked up in TIME magazine's coverage of AI safety evaluations, discussed in podcasts, and cited in policy documents.^[16] The paper's concreteness, with a measurable number, a known growth rate, and a clean extrapolation, made it legible to audiences outside the technical AI safety community.

Within the AI safety research community, METR is generally viewed as having solved a core institutional problem: how to create a credible, independent evaluation body that major labs are willing to work with.^[24] The incentive structure for in-house safety teams to find problems and report them honestly is complicated by the fact that those teams are employed by the labs whose models they are evaluating. METR's independence, funded by philanthropy rather than by the labs it evaluates, removes at least part of that incentive conflict.

The organization has also influenced the broader policy conversation. METR has submitted formal comments to regulatory processes at NIST, has engaged with the EU AI Act proceedings, and has produced the Common Elements of Frontier AI Safety Policies document as a direct resource for policymakers trying to understand the voluntary safety commitment landscape.^[17]^[25] Barnes has testified to and consulted with government bodies in both the US and UK.

METR's evaluation framework has been referenced in the responsible scaling policies and safety frameworks of major AI developers.^[17] Anthropic's responsible scaling policy, OpenAI's preparedness framework, and Google DeepMind's safety policies all define capability thresholds related to autonomous task completion and AI R&D acceleration in ways that implicitly or explicitly reference the kinds of measurements METR conducts.

Limitations and Criticisms

METR operates in an evaluation landscape that contains several structural weaknesses that the organization itself has been candid about acknowledging.

The most fundamental limitation is the adversarial elicitation problem. Pre-deployment evaluations rely on labs cooperating with the evaluation process, including optimizing their models' performance on the task suite. A lab that wanted to minimize apparent dangerous capabilities could theoretically under-invest in the elicitation phase, presenting a less capable version of its model for evaluation.^[8] METR has acknowledged this concern and published guidelines on what constitutes reasonable elicitation effort, but it cannot independently verify that labs are maximally cooperative.^[8]

A second concern is evaluation awareness. If a model understands that it is being evaluated and has been trained in ways that cause it to behave differently during evaluation, METR's task-based results may not generalize to real-world deployment.^[31] METR has noted that it does not consider strategic sandbagging by current models to be especially likely, but the evaluation design does not rule it out.^[31]

Third, the task suite is necessarily incomplete. METR's 77-task public suite was designed to be automatically scoreable and somewhat self-contained, which may make some tasks easier or harder than the real-world activities they are meant to proxy.^[9] A model could fail every task in the suite while still having the capabilities that would matter in a high-stakes deployment scenario, if those capabilities involve behaviors or environments the suite doesn't cover.

Fourth, the pre-deployment timing may not be sufficient. Barnes has argued explicitly that labs should be required to submit models for evaluation before internal deployment, not merely before public release.^[21] Under the current voluntary framework, there can be a significant gap between when a model is first deployed internally within a lab and when it is submitted for external evaluation. Any dangerous capabilities present during that internal period are outside METR's visibility.

Finally, there is a coverage question: METR has historically focused on a specific cluster of catastrophic risk scenarios (autonomous replication, AI R&D acceleration, cyberattacks) and has not attempted to evaluate the broader space of potential AI harms. Organizations like the Future of Life Institute have given METR relatively high marks for the quality of its work in its chosen domain but have noted that the overall AI evaluation ecosystem remains thin in coverage across many risk categories.^[24]

Organizational Culture and Staff

METR is a small organization relative to the scale of its mandate. As of early 2025, the organization operated with roughly 20 to 30 staff across research, operations, and evaluation functions. The team is composed primarily of researchers with backgrounds in machine learning, computer science, and AI safety, along with engineers who build and maintain the evaluation infrastructure.

The organization has run an internship and residency program and has been a host organization in the MATS program (ML Alignment Theory Scholars), training early-career researchers in evaluation methodology. Several METR alumni have moved to in-house safety teams at Anthropic, OpenAI, and other frontier labs, creating an informal network of researchers who share a common vocabulary around capability evaluation.

METR maintains a Substack newsletter and a blog at metr.org/blog, which publishes both research updates and more accessible explanations of the organization's work for non-specialist audiences.^[11] The organization also maintains the evaluations.metr.org portal, which hosts published evaluation reports and the Autonomy Evaluation Resources guide.^[9]

Evaluation infrastructure

Until early 2026 METR ran its evaluations on Vivaria, an in-house system that managed task containers, agent scaffolds, and result logging.^[47] As part of the January 2026 Time Horizon 1.1 release the organization completed a migration to Inspect, the open-source evaluation framework developed by the UK AISI.^[12]^[46] The Inspect ecosystem had emerged during 2024 and 2025 as the de facto cross-organization standard for agent evaluations, adopted by METR alongside Apollo Research and the US AI Safety Institute (now CAISI).^[46] METR contributed back through the inspect-tasks-public and inspect-action repositories on GitHub, the latter providing a Kubernetes-based cloud runner.^[27] The move reduced duplication across the safety-evaluation field, though METR retained Vivaria as a reference and continued to publish a "Comparison with Inspect" guide.^[47]

Relationship to Policy and Governance

METR's relationship to AI governance has evolved significantly since the ARC Evals era. In 2023, the primary policy connection was the partnership with the UK's Foundation Model Taskforce. By 2025, METR had developed relationships with multiple government bodies and contributed to several formal regulatory processes.

The organization's Common Elements of Frontier AI Safety Policies document has become a standard reference for journalists and policymakers trying to understand what major AI developers have and have not committed to voluntarily.^[17] It is updated as new safety policies are published and has grown from covering three companies (Anthropic, OpenAI, Google DeepMind) in its August 2024 version to twelve companies in the December 2025 version.^[17]

METR has submitted formal comments to NIST's AI risk management processes and has engaged with the EU AI Act's General-Purpose AI Code of Practice.^[25] Barnes has been clear that METR views its role as providing the technical evidence base that policy should build on, rather than advocacy for particular regulatory structures.

The organization's position on mandatory evaluation requirements has been cautious but directional: METR believes pre-deployment evaluations should be mandatory for frontier models above certain capability thresholds, and that the evaluation process should include independent third-party assessors rather than relying solely on in-house safety teams.^[21] These positions are consistent with what several governments have proposed in draft AI legislation, though METR has been careful to present itself as a source of technical evidence rather than a lobby for a particular regulatory approach.

Frontier AI safety regulations reference

On January 29, 2026 METR published "Frontier AI safety regulations: A reference for lab staff" by Miles Kodama and Michael Chen.^[40] The note is a side-by-side reference for three frontier-AI safety frameworks: California's SB 53 (training at ten-to-the-twenty-sixth FLOPs or more, stricter requirements above 500 million dollars in annual revenue), the EU AI Act and accompanying Code of Practice (ten-to-the-twenty-fifth FLOPs or more, deployed in the European market), and New York's RAISE Act (similar to SB 53, effective January 1, 2027).^[40]

The document maps each regime onto a common set of obligations: published safety frameworks addressing CBRN weapons, loss of control, autonomous cyberattacks, and harmful manipulation; pre-deployment transparency reports; five-year documentation retention; critical-incident reporting on tight timelines (15 days under SB 53, 24 hours for imminent threats, 2 to 15 days under the EU Code); independent external evaluations with adequate access; weight-protection requirements; and internal governance including anonymous reporting and whistleblower protections.^[40] It is widely read inside frontier labs as a working compliance map and inside policy circles as a comparison of how the three frameworks line up.

International AI Safety Report

METR's measurements feed directly into the International AI Safety Report chaired by Yoshua Bengio, the consensus scientific assessment commissioned at the November 2023 Bletchley AI Safety Summit.^[32] The full 2025 report, published 29 January 2025 ahead of the Paris AI Action Summit, cites METR's task-completion time horizon work as one of the principal pieces of evidence that frontier AI capability is advancing along a smooth, measurable trend.^[32] The report's discussion of agentic systems and longer-horizon autonomous task completion draws on METR's HCAST and Time Horizons results, and the framing of capability growth as a doubling phenomenon with a knowable rate of change is one of the few quantitative claims in the report's capabilities chapter that comes from a single primary source.^[32]

The October 2025 First Key Update, which Bengio's team released to address the rise of reasoning models, cited METR's o3 and GPT-5 evaluations when describing the gap between frontier capability and reliable third-party verification.^[33] The update flagged METR's findings on evaluation awareness as a structural challenge to the entire framework of pre-deployment testing, since a model that can detect it is being evaluated complicates any claim that benchmark scores correspond to deployment behavior.^[33] The November 2025 Second Key Update on technical safeguards used METR's Autonomy Evaluation Resources as a reference for what a credible third-party evaluation process looks like, and the February 2026 second full edition included longer treatments of agentic capabilities in which METR's measurements appear repeatedly in the cited sources. METR researchers have also contributed directly to the report's writing group across editions, alongside contributors from Anthropic, OpenAI, Google DeepMind, the UK AISI, and the US AISI.

References

METR. *ARC Evals is now METR*. December 4, 2023. https://metr.org/blog/2023-12-04-metr-announcement/ ↩
METR. *ARC Evals is spinning out from ARC*. September 19, 2023. https://metr.org/blog/2023-09-19-spin-out-announcement/ ↩
Thomas Kwa et al. *Measuring AI Ability to Complete Long Software Tasks*. arXiv:2503.14499, March 2025. https://arxiv.org/abs/2503.14499 ↩
METR. *Measuring AI Ability to Complete Long Tasks*. March 19, 2025. https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/ ↩
METR. *HCAST: Human-Calibrated Autonomy Software Tasks*. arXiv:2503.17354, March 21, 2025. https://arxiv.org/abs/2503.17354 ↩
METR. *RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts*. arXiv:2411.15114, November 2024. https://arxiv.org/abs/2411.15114 ↩
METR. *Evaluating frontier AI R&D capabilities of LLMs*. November 22, 2024. https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/ ↩
METR. *Autonomy Evaluation Resources*. March 13, 2024. https://metr.org/blog/2024-03-13-autonomy-evaluation-resources/ ↩
METR Autonomy Evaluation Resources. https://evaluations.metr.org/ ↩
METR. *Evaluation Reports*. https://metr.org/evaluations/ ↩
METR. *Research*. https://metr.org/research/ ↩
METR. *Time Horizon 1.1*. January 29, 2026. https://metr.org/blog/2026-1-29-time-horizon-1-1/ ↩
METR. *Task-Completion Time Horizons of Frontier AI Models*. https://metr.org/time-horizons/ ↩
Beth Barnes. *Profile*. METR. https://metr.org/team/beth-barnes/ ↩
Beth Barnes. *AXRP Episode 34: AI Evaluations with Beth Barnes*. Alignment Forum. https://www.alignmentforum.org/posts/vACr4DExfeRMaCoo7/axrp-episode-34-ai-evaluations-with-beth-barnes ↩
80,000 Hours. *Beth Barnes on the most important graph in AI right now and the 7-month rule that governs its progress*. https://80000hours.org/podcast/episodes/beth-barnes-ai-safety-evals/ ↩
METR. *Common Elements of Frontier AI Safety Policies*. Multiple versions, August 2024 through December 2025. https://metr.org/common-elements ↩
METR. *Details about METR's preliminary evaluation of OpenAI's o3 and o4-mini*. April 16, 2025. https://evaluations.metr.org/openai-o3-report/ ↩
METR. *Details about METR's preliminary evaluation of Claude 3.7*. April 4, 2025. https://evaluations.metr.org/claude-3-7-report/
METR. *An update on our preliminary evaluations of Claude 3.5 Sonnet and o1*. January 31, 2025. https://metr.org/blog/2025-01-31-update-sonnet-o1-evals/
METR. *AI models can be dangerous before public deployment*. January 17, 2025. https://metr.org/blog/2025-01-17-ai-models-dangerous-before-public-deployment/ ↩
Wikipedia. *METR*. https://en.wikipedia.org/wiki/METR
Giving What We Can. *METR (formerly called ARC Evals)*. https://www.givingwhatwecan.org/charities/arc-evals ↩
Future of Life Institute. *2025 AI Safety Index*. https://futureoflife.org/ai-safety-index-summer-2025/ ↩
METR. *Frontier AI Safety Policies*. https://metr.org/fsp ↩
Apollo Research. *Towards Safety Cases For AI Scheming*. https://www.apolloresearch.ai/science/towards-safety-cases-for-ai-scheming/ ↩
METR GitHub. https://github.com/METR ↩
METR. *Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity*. July 10, 2025. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ ↩
Becker, J. et al. *Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity*. arXiv:2507.09089, July 2025. https://arxiv.org/abs/2507.09089 ↩
METR. *We are Changing our Developer Productivity Experiment Design*. February 24, 2026. https://metr.org/blog/2026-02-24-uplift-update/ ↩
METR. *Details about METR's evaluation of OpenAI GPT-5*. August 7, 2025. https://evaluations.metr.org/gpt-5-report/ ↩
Bengio, Y. et al. *International AI Safety Report 2025*. January 29, 2025. https://internationalaisafetyreport.org/publication/international-ai-safety-report-2025 ↩
Bengio, Y. et al. *International AI Safety Report 2025: First Key Update: Capabilities and Risk Implications*. October 2025. arXiv:2510.13653. https://arxiv.org/abs/2510.13653 ↩
METR. *Clarifying limitations of time horizon*. January 22, 2026. https://metr.org/notes/2026-01-22-time-horizon-limitations/
METR. *MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity*. October 14, 2025. https://metr.org/blog/2025-10-14-malt-dataset-of-natural-and-prompted-behaviors/ ↩
Anthropic. *System Card: Claude Opus 4 & Claude Sonnet 4*. May 2025. https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf ↩
Anthropic. *Review of the Anthropic Summer 2025 Pilot Sabotage Risk Report*. https://alignment.anthropic.com/2025/sabotage-risk-report/2025_pilot_risk_report_metr_review.pdf
METR. *Details about METR's evaluation of OpenAI GPT-5.1-Codex-Max*. November 2025. https://metr.org/evaluations/gpt-5-1-codex-max-report/ ↩
METR. *Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity*. May 11, 2026. https://metr.org/blog/2026-05-11-ai-usage-survey/ ↩
METR. *Frontier AI safety regulations: A reference for lab staff*. January 29, 2026. https://metr.org/notes/2026-01-29-frontier-ai-safety-regulations/ ↩
METR. *Impact of modelling assumptions on time horizon results*. March 20, 2026. https://metr.org/notes/2026-03-20-impact-of-modelling-assumptions-on-time-horizon-results/ ↩
METR. *Measuring Time Horizon using Claude Code and Codex*. February 13, 2026. https://metr.org/notes/2026-02-13-measuring-time-horizon-using-claude-code-and-codex/ ↩
METR. *Analyzing coding agent transcripts to upper bound productivity gains from AI agents*. February 17, 2026. https://metr.org/notes/2026-02-17-exploratory-transcript-analysis-for-estimating-time-savings-from-coding-agents/ ↩
MIT Technology Review. *This is the most misunderstood graph in AI*. February 5, 2026. https://www.technologyreview.com/2026/02/05/1132254/this-is-the-most-misunderstood-graph-in-ai/ ↩
The Decoder. *METR says it can barely measure Claude Mythos, Palo Alto Networks warns of autonomous AI attackers*. May 2026. https://the-decoder.com/metr-says-it-can-barely-measure-claude-mythos-palo-alto-networks-warns-of-autonomous-ai-attackers/ ↩
UK AI Security Institute. *Inspect: A framework for large language model evaluations*. https://inspect.aisi.org.uk/ ↩
METR. *Comparison with Inspect*. Vivaria documentation. https://vivaria.metr.org/comparison-with-inspect/ ↩
ICML. *RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents against Human Experts (Spotlight)*. ICML 2025. https://icml.cc/virtual/2025/poster/46519 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

11 revisions by 1 contributors · full history

Suggest edit