Remote Labor Index
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,669 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
4 citations
Review status
Source-backed
Revision
v1 · 1,669 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Remote Labor Index (RLI) is an AI benchmark that measures how well AI agents can complete real, paid remote knowledge work end to end. It was introduced in a paper titled "Remote Labor Index: Measuring AI Automation of Remote Work," posted to arXiv on October 30, 2025 and publicized in early November 2025, by researchers at the Center for AI Safety (CAIS) and Scale AI [1][2][3]. Rather than testing abstract or synthetic tasks, RLI is built from 240 actual freelance projects originally completed by human professionals, representing more than 6,000 hours of work and a combined real-world value of $143,991 [2][3].
The benchmark's central finding is that, despite high scores on many existing benchmarks, current frontier AI agents perform near the floor on real economically valued work. At the time of publication the best agent automated only about 2.5 percent of projects to a standard that a reasonable client would accept [1][2]. RLI is widely cited as a sober, economically grounded yardstick in the debate over AI and jobs, complementing other measures such as OpenAI's GDPval and METR's task time-horizon research [4].
A recurring problem in discussions of technological unemployment and AI automation is that claims about AI "doing real jobs" are often extrapolated from benchmarks that measure narrow, decontextualized skills. Strong performance on exam-style questions, coding puzzles, or short isolated tasks does not necessarily indicate that an AI system can deliver a complete, client-ready piece of paid work [1][3].
The authors designed RLI to close that gap by anchoring measurement in work that real clients paid real money for. Each project carries an actual brief, a real price, defined deliverables, and a known human completion time, so the benchmark reflects the messy, end-to-end nature of paid remote labor rather than tidy laboratory tasks [2][3]. The stated goal is to provide data-driven evidence about AI's capacity to automate remote knowledge work, giving policymakers, employers, and researchers an empirical reference point instead of speculation [3]. Critically, RLI asks whether an AI can complete an entire project from start to finish, not merely whether it can be a useful assistant for parts of a task [3].
RLI's projects were drawn primarily from the freelance marketplace Upwork. The team recruited 358 experienced freelancers, who had average platform earnings of about $23,364, and collected real work samples across 23 of Upwork's subcategories spanning nine major category groups [1]. Of the 240 projects, 207 came from posted freelance jobs with freelancer-provided work samples; the remaining 34 came from a long tail of sources, including 7 custom-commissioned projects and 33 obtained by contacting online creators for permission to use their work [1]. The dataset is split into a 230-project private test set and a 10-project public set to limit contamination [1].
The projects deliberately span creative, technical, and analytical domains, reflecting the true distribution of remote freelance work rather than emphasizing the programming and writing tasks that dominate many prior benchmarks. Categories include video and animation, 3D and CAD modeling, graphic design, game development, architecture, web development, data analysis, audio production, marketing, legal, and translation [1][2]. Across the set, deliverables span roughly 72 distinct file formats [4].
| Attribute | Value |
|---|---|
| Total projects | 240 (230 private test, 10 public) |
| Combined real-world value | $143,991 |
| Combined human work hours | more than 6,000 |
| Distinct categories | 23 (across 9 major groups) |
| Median project value | $200 |
| Mean project value | about $632.60 |
| Median human completion time | 11.5 hours |
| Mean human completion time | 28.9 hours |
| Value range | modest projects up to more than $10,000 |
| Time range | short projects up to more than 100 hours |
To run the benchmark, an AI agent is given a project's brief and required inputs and must produce the full set of deliverables. Outputs are then judged by trained human raters. Each project is assessed by three independent evaluators who decide, by majority vote, whether the AI's deliverables would be accepted by a reasonable client as the commissioned work [1]. Raters use a 3-point scale, and ratings of 2 or 3 count as a success, meaning the work is acceptable; a project's automation is credited only when the deliverable meets that client-acceptance bar [1].
The grading process is rigorous: evaluators spent an average of about 11.4 minutes per assessment, and inter-annotator agreement on the automation outcome was 94.4 percent [1]. The headline metric, the automation rate, is the share of projects an agent completes to acceptable quality. The leaderboard also reports an Elo-style rating that places agent quality on a scale where human freelancers are anchored near 1000 [2].
The original RLI paper reported that all evaluated frontier agents automated less than 3 percent of the projects. The best performer was the agent system Manus, which reached an automation rate of about 2.5 percent, corresponding to roughly $1,720 of the $143,991 in available project value, and an Elo of about 510 against the human reference of 1000 [1][2]. In other words, about 97.5 percent of the real paid work in the benchmark remained beyond the reach of the strongest agent at the time [3].
The table below shows representative automation rates from the paper's evaluation [1]:
| Agent / model | Automation rate |
|---|---|
| Manus | 2.5% |
| Grok 4 | 2.1% |
| Claude Sonnet 4.5 | 2.1% |
| GPT-5 (CLI) | 1.7% |
| ChatGPT Agent | 1.3% |
| Gemini 2.5 Pro | 0.8% |
The authors analyzed why attempts failed. Rejections were dominated by quality and completeness problems rather than by inability to start: about 45.6 percent of failures involved poor professional quality, about 35.7 percent were incomplete or truncated, about 17.6 percent produced corrupted or unusable files, and about 14.8 percent contained internal inconsistencies such as mismatched visual elements [1][4]. Agents did relatively better on certain narrow, digital-native tasks, including audio editing, image generation for advertising, report writing, and coding interactive data visualizations and dashboards [1][4].
Because RLI is maintained as a public, live leaderboard, scores evolve as newer agents are tested. By mid-2026 the leaderboard reflected higher figures than the original paper, with later agent systems built on newer models surpassing the initial 2.5 percent mark, while still automating only a small single-digit percentage of projects overall [2]. The 2.5 percent figure should therefore be read as the result for the strongest agent at the benchmark's October 2025 release, not a permanent ceiling.
RLI's importance lies in reframing AI capability in directly economic terms. By measuring complete, client-acceptable delivery of real paid work, it provides a yardstick that resists the inflation common to benchmarks where models can score well on fragments of a job without being able to ship the whole thing [1][3]. The very low automation rates offered an empirically grounded counterweight to claims of imminent, broad automation of remote knowledge work, suggesting that as of late 2025 AI agents could not yet independently perform the large majority of real paid remote projects [3][4].
At the same time, the benchmark's designers frame it as a progress tracker rather than a verdict. RLI is intended to chart how AI's economic usefulness changes over time, and the steady, measurable improvements between agent generations are part of what it is built to capture [2][3]. This dual role, documenting both current limits and the trajectory, is why RLI is frequently invoked in policy and labor-market analyses of AI's near-term impact on jobs [4].
RLI is part of a wave of benchmarks that try to tie AI performance to economic value rather than narrow skill tests. Each takes a different slice of the problem:
Together these efforts form a small but growing toolkit for the economics of AI. RLI's particular contribution is its insistence on real money, real deliverables, and real client acceptance as the unit of measurement [1][4].