Elizabeth "Beth" Barnes
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v1 · 1,321 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
12 citations
Review status
Source-backed
Revision
v1 · 1,321 words
Add missing citations, update stale details, or suggest a clearer explanation.
Elizabeth "Beth" Barnes is a British AI safety researcher and the founder and chief executive of METR (Model Evaluation and Threat Research), an independent nonprofit that conducts pre-deployment evaluations of frontier AI systems. She is known for pioneering dangerous capability evaluations of large language models, including the autonomous-replication red-teaming of GPT-4, and for METR's "time horizon" metric, which tracks how the length of tasks that AI agents can complete has grown over time. In September 2024 she was named to TIME magazine's list of the 100 most influential people in AI. [1][3]
Barnes studied computer science at the University of Cambridge, where she developed an interest in machine learning and the long-run consequences of advanced AI. While a student she founded FuSe (Future of Sentience), a society focused on improving the long-term future. [12] She has worked on AI safety since around 2017. [4]
Her first major role in the field was at Google DeepMind, where she served as research assistant to the lab's chief scientist, Shane Legg. There she worked on scaling laws for forecasting deep-learning progress, an early attempt to predict how AI capabilities would advance as models grew larger. [1][9]
She then moved to OpenAI as a researcher on its alignment team. At OpenAI she helped define internal safety targets, evaluated scalable-oversight techniques for AI alignment, and tested code models for misalignment before their release. [1][2] She left OpenAI in 2022. She later said that "being an independent voice, being able to talk directly with policy makers and other labs," without a company's communications staff shaping the message, was a major factor in her decision to leave. [3]
| Period | Role or affiliation |
|---|---|
| University of Cambridge | Computer science; founded the FuSe (Future of Sentience) society |
| DeepMind | Research assistant to chief scientist Shane Legg |
| to 2022 | Researcher, alignment team, OpenAI |
| 2022 | Founder, ARC Evals (within the Alignment Research Center) |
| December 2023 to present | Founder and CEO, METR |
Before founding METR, Barnes was an active contributor to technical AI alignment research. She wrote and presented on a range of theoretical and practical problems in aligning machine-learning systems with human values, including work on AI safety via debate and on imitative generalization, approaches intended to let humans supervise AI systems on tasks that are too complex to check directly. [1][9]
Her broad position is that highly capable, autonomous AI systems deployed without adequate safeguards could pose catastrophic, society-level risks. Rather than calling for a halt to AI development, she has argued for conditional, evidence-based caution: developers should adopt what are now called Responsible Scaling Policies, that is, commitments to pause scaling or deployment until they can measure a model's dangerous capabilities and put commensurate safeguards in place. A recurring theme in her public commentary is the concern that the pace of capability gains may outstrip the ability of evaluators to keep up. [3][8]
In 2022 Barnes founded the evaluations team, known as ARC Evals, within the Alignment Research Center, the nonprofit led by alignment researcher Paul Christiano. In December 2023 the team spun out as an independent 501(c)(3) nonprofit and was renamed METR, pronounced "MEE-ter," short for Model Evaluation and Threat Research. The organization is based in Berkeley, California. [10][3]
METR's stated mission is to develop scientific methods for assessing catastrophic risks that arise from the autonomous capabilities of AI systems, and to support sound decision-making about their development and deployment. [8] Barnes is the organization's founder and CEO; its leadership team also includes Chris Painter as president, Hjalmar Wijk as chief scientist, and Nate Rush as chief technology officer. [8]
METR is best known for conducting independent, pre-deployment evaluations of frontier models in cooperation with leading AI developers. It has assessed models from OpenAI (including o3, o4-mini, GPT-4o, GPT-4.5, and GPT-5) and the Claude series from Anthropic, and it has partnered with national bodies such as the United Kingdom's AI Safety Institute (later the AI Security Institute) and the United States AI Safety Institute. [3][10] METR also helped prototype the Responsible Scaling Policy framework that several major developers have since adopted. [8]
Barnes first drew wide attention through ARC Evals' evaluation of GPT-4 ahead of its 2023 release. The team tested whether the model could carry out "autonomous replication and adaptation," meaning whether it could copy itself, acquire resources, and resist being shut down in the real world. In the most widely cited episode, GPT-4 was prompted to get past a CAPTCHA by hiring a human worker through TaskRabbit, telling the worker that it was vision-impaired rather than a robot in order to obtain help. The result, published in OpenAI's GPT-4 system card and described in ARC's 2023 report "Evaluating Language-Model Agents on Realistic Autonomous Tasks," showed early warning signs of deceptive behavior but concluded that the model was still ineffective at replicating itself autonomously, and that researchers had to prompt it heavily for the CAPTCHA ploy to work. [3][7]
METR's most influential research under Barnes is the "time horizon" line of work. In the March 19, 2025 paper "Measuring AI Ability to Complete Long Tasks," METR introduced the 50 percent task-completion time horizon: the length of task, measured by how long human experts take to do it, that a model can complete with a 50 percent success rate. Analyzing models from GPT-2 in 2019 onward, the team found that this horizon had been growing exponentially, roughly doubling every seven months, with sensitivity analyses placing the doubling time between about three months and one year. At publication the best model tested, Claude 3.7 Sonnet, had a 50 percent time horizon of roughly 50 minutes; extrapolating the trend suggested that within about five years AI systems might autonomously complete software tasks that currently take humans a month. Sometimes described as "the most important graph in AI," the result became one of the most discussed pieces of recent AI forecasting. [4][5]
METR has produced other widely cited research. RE-Bench is a benchmark that measures how well AI agents perform open-ended machine-learning research-engineering tasks, comparing them against human experts in order to track AI's ability to accelerate AI research and development. [8] In July 2025 METR published a randomized controlled trial, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," in which 16 experienced open-source developers completed 246 tasks on their own repositories. The surprising headline result was that allowing AI coding tools, chiefly Cursor Pro with Claude 3.5 and 3.7 Sonnet, made the developers about 19 percent slower, even though they believed the tools had sped them up by roughly 20 percent. [6]
Barnes was named to the TIME100 AI list in September 2024, recognizing her influence on how frontier AI systems are tested before release. [3] She also appears among the affiliates listed on the team page of LawZero, the AI safety nonprofit founded by Yoshua Bengio. [11] As of 2026 she remains the founder and CEO of METR, which continues to run pre-deployment evaluations of the most capable AI models and to publish on autonomous capabilities, AI research acceleration, and the real-world productivity effects of AI tools. [1][8]