Remote Labor Index

AI Companies AI Safety

8 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

4 citations

Revision

v1 · 1,669 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

The Remote Labor Index (RLI) is an AI benchmark that measures how well AI agents can complete real, paid remote knowledge work end to end. It was introduced in a paper titled "Remote Labor Index: Measuring AI Automation of Remote Work," posted to arXiv on October 30, 2025 and publicized in early November 2025, by researchers at the Center for AI Safety (CAIS) and Scale AI ^[1]^[2]^[3]. Rather than testing abstract or synthetic tasks, RLI is built from 240 actual freelance projects originally completed by human professionals, representing more than 6,000 hours of work and a combined real-world value of $143,991 ^[2]^[3].

The benchmark's central finding is that, despite high scores on many existing benchmarks, current frontier AI agents perform near the floor on real economically valued work. At the time of publication the best agent automated only about 2.5 percent of projects to a standard that a reasonable client would accept ^[1]^[2]. RLI is widely cited as a sober, economically grounded yardstick in the debate over AI and jobs, complementing other measures such as OpenAI's GDPval and METR's task time-horizon research ^[4].

Motivation: grounding AI-and-jobs claims

A recurring problem in discussions of technological unemployment and AI automation is that claims about AI "doing real jobs" are often extrapolated from benchmarks that measure narrow, decontextualized skills. Strong performance on exam-style questions, coding puzzles, or short isolated tasks does not necessarily indicate that an AI system can deliver a complete, client-ready piece of paid work ^[1]^[3].

The authors designed RLI to close that gap by anchoring measurement in work that real clients paid real money for. Each project carries an actual brief, a real price, defined deliverables, and a known human completion time, so the benchmark reflects the messy, end-to-end nature of paid remote labor rather than tidy laboratory tasks ^[2]^[3]. The stated goal is to provide data-driven evidence about AI's capacity to automate remote knowledge work, giving policymakers, employers, and researchers an empirical reference point instead of speculation ^[3]. Critically, RLI asks whether an AI can complete an entire project from start to finish, not merely whether it can be a useful assistant for parts of a task ^[3].

What the Remote Labor Index measures

Data sources and construction

RLI's projects were drawn primarily from the freelance marketplace Upwork. The team recruited 358 experienced freelancers, who had average platform earnings of about $23,364, and collected real work samples across 23 of Upwork's subcategories spanning nine major category groups ^[1]. Of the 240 projects, 207 came from posted freelance jobs with freelancer-provided work samples; the remaining 34 came from a long tail of sources, including 7 custom-commissioned projects and 33 obtained by contacting online creators for permission to use their work ^[1]. The dataset is split into a 230-project private test set and a 10-project public set to limit contamination ^[1].

The projects deliberately span creative, technical, and analytical domains, reflecting the true distribution of remote freelance work rather than emphasizing the programming and writing tasks that dominate many prior benchmarks. Categories include video and animation, 3D and CAD modeling, graphic design, game development, architecture, web development, data analysis, audio production, marketing, legal, and translation ^[1]^[2]. Across the set, deliverables span roughly 72 distinct file formats ^[4].

Attribute	Value
Total projects	240 (230 private test, 10 public)
Combined real-world value	$143,991
Combined human work hours	more than 6,000
Distinct categories	23 (across 9 major groups)
Median project value	$200
Mean project value	about $632.60
Median human completion time	11.5 hours
Mean human completion time	28.9 hours
Value range	modest projects up to more than $10,000
Time range	short projects up to more than 100 hours

Sources: ^[1]^[2]^[3]

Evaluation methodology

To run the benchmark, an AI agent is given a project's brief and required inputs and must produce the full set of deliverables. Outputs are then judged by trained human raters. Each project is assessed by three independent evaluators who decide, by majority vote, whether the AI's deliverables would be accepted by a reasonable client as the commissioned work ^[1]. Raters use a 3-point scale, and ratings of 2 or 3 count as a success, meaning the work is acceptable; a project's automation is credited only when the deliverable meets that client-acceptance bar ^[1].

The grading process is rigorous: evaluators spent an average of about 11.4 minutes per assessment, and inter-annotator agreement on the automation outcome was 94.4 percent ^[1]. The headline metric, the automation rate, is the share of projects an agent completes to acceptable quality. The leaderboard also reports an Elo-style rating that places agent quality on a scale where human freelancers are anchored near 1000 ^[2].

Results

The original RLI paper reported that all evaluated frontier agents automated less than 3 percent of the projects. The best performer was the agent system Manus, which reached an automation rate of about 2.5 percent, corresponding to roughly $1,720 of the $143,991 in available project value, and an Elo of about 510 against the human reference of 1000 ^[1]^[2]. In other words, about 97.5 percent of the real paid work in the benchmark remained beyond the reach of the strongest agent at the time ^[3].

The table below shows representative automation rates from the paper's evaluation ^[1]:

Agent / model	Automation rate
Manus	2.5%
Grok 4	2.1%
Claude Sonnet 4.5	2.1%
GPT-5 (CLI)	1.7%
ChatGPT Agent	1.3%
Gemini 2.5 Pro	0.8%

The authors analyzed why attempts failed. Rejections were dominated by quality and completeness problems rather than by inability to start: about 45.6 percent of failures involved poor professional quality, about 35.7 percent were incomplete or truncated, about 17.6 percent produced corrupted or unusable files, and about 14.8 percent contained internal inconsistencies such as mismatched visual elements ^[1]^[4]. Agents did relatively better on certain narrow, digital-native tasks, including audio editing, image generation for advertising, report writing, and coding interactive data visualizations and dashboards ^[1]^[4].

Because RLI is maintained as a public, live leaderboard, scores evolve as newer agents are tested. By mid-2026 the leaderboard reflected higher figures than the original paper, with later agent systems built on newer models surpassing the initial 2.5 percent mark, while still automating only a small single-digit percentage of projects overall ^[2]. The 2.5 percent figure should therefore be read as the result for the strongest agent at the benchmark's October 2025 release, not a permanent ceiling.

Significance

RLI's importance lies in reframing AI capability in directly economic terms. By measuring complete, client-acceptable delivery of real paid work, it provides a yardstick that resists the inflation common to benchmarks where models can score well on fragments of a job without being able to ship the whole thing ^[1]^[3]. The very low automation rates offered an empirically grounded counterweight to claims of imminent, broad automation of remote knowledge work, suggesting that as of late 2025 AI agents could not yet independently perform the large majority of real paid remote projects ^[3]^[4].

At the same time, the benchmark's designers frame it as a progress tracker rather than a verdict. RLI is intended to chart how AI's economic usefulness changes over time, and the steady, measurable improvements between agent generations are part of what it is built to capture ^[2]^[3]. This dual role, documenting both current limits and the trajectory, is why RLI is frequently invoked in policy and labor-market analyses of AI's near-term impact on jobs ^[4].

Relationship to other economic AI benchmarks

RLI is part of a wave of benchmarks that try to tie AI performance to economic value rather than narrow skill tests. Each takes a different slice of the problem:

GDPval (OpenAI) evaluates AI on occupation-specific tasks drawn from a wide range of professions and has shown that AI can match or exceed humans on some narrowly scoped, economically valuable tasks. RLI differs by measuring whole, end-to-end projects sourced from real freelance jobs rather than individual task types shared across professions, and it spans a broader range of work and file formats. The RLI authors note that their projects take roughly twice as long as comparable GDPval tasks and cover far more deliverable formats, about 72 versus 24 ^[1]^[4].
TheAgentCompany (a simulated-company agent benchmark) places agents inside a synthetic software firm to complete multi-step office and engineering work. It captures sequential, tool-using workflows, whereas RLI uses real commissioned deliverables graded against actual client acceptance ^[4].
METR time-horizon research measures the length of tasks, expressed as the human time they would take, that AI can complete reliably, and reported rapid, roughly exponential growth in that horizon over 2024 and 2025. METR's horizon framing and RLI's automation-rate framing are complementary: one tracks how long a task an agent can handle, the other tracks whether agents can deliver acceptable real paid work end to end ^[4].
HCAST and similar agentic task suites focus heavily on software and writing tasks; RLI's authors position their benchmark as capturing greater complexity and a more representative mix of design, multimedia, and operations work that dominates real freelance platforms ^[1]^[4].

Together these efforts form a small but growing toolkit for the economics of AI. RLI's particular contribution is its insistence on real money, real deliverables, and real client acceptance as the unit of measurement ^[1]^[4].

References

Mazeika, Mantas; Gatti, Alice; et al.; Hendrycks, Dan. "Remote Labor Index: Measuring AI Automation of Remote Work." arXiv:2510.26787, October 30, 2025. https://arxiv.org/abs/2510.26787 ↩
Scale AI / Scale Labs. "The Remote Labor Index: Measuring the Automation of Work" and RLI leaderboard. https://scale.com/blog/rli and https://labs.scale.com/leaderboard/rli ↩
Remote Labor Index official site. Center for AI Safety and Scale AI. https://www.remotelabor.ai/ ↩
Epoch AI. "What do economic value benchmarks tell us?" (discussion of RLI, GDPval, and METR time-horizon work). https://epoch.ai/blog/what-do-economic-value-benchmarks-tell-us ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

AI gold medals at the 2025 IMO Frontier Model Forum

Overview

Motivation: grounding AI-and-jobs claims

What the Remote Labor Index measures

Data sources and construction

Evaluation methodology

Results

Significance

Relationship to other economic AI benchmarks

References

Improve this article

Related Articles

Anthropic

Patronus AI

Frontier Model Forum

Apollo Research

Safe Superintelligence Inc

Goodfire AI

What links here

Related Articles

Anthropic

Patronus AI

Frontier Model Forum

Apollo Research

Safe Superintelligence Inc

Goodfire AI

What links here