TheAgentCompany
Last reviewed
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,463 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,463 words
Add missing citations, update stale details, or suggest a clearer explanation.
TheAgentCompany is an AI benchmark that evaluates AI agents on long-horizon, economically valuable knowledge work inside a self-hosted simulation of a small software company. Introduced in December 2024 by researchers at Carnegie Mellon University, it presents agents with 175 multi-step professional tasks spanning software engineering, project management, data science, human resources, finance, and administration. To complete a task, an agent must browse internal web sites, write and run code on a command line, and message large language model (LLM) simulated coworkers, much as a digital worker would [1][2].
The benchmark became a widely cited measure of real-world workplace automation because its results were sobering: at release, the strongest baseline agent, built on Anthropic's Claude 3.5 Sonnet, fully completed only about 24 percent of the tasks (roughly 34 percent when partial credit is counted) [1][3]. The paper, "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks" by Frank F. Xu, Graham Neubig, and colleagues, framed this as evidence that agents can already handle many simple chores but remain far from autonomously performing complex, long-horizon jobs [1].
Much of modern work can be done entirely through a computer and the Internet, and rapid progress in LLM-based agents has prompted urgent questions about how much of that work can be automated. The authors argue that answering this question matters both for industry deciding how to adopt AI and for economic policy weighing AI's effect on the labor market [1].
Existing agent benchmarks, the authors note, tend to be narrow or unrealistic: many test short web-browsing episodes or single-domain skills rather than the sustained, cross-application work that defines real jobs. TheAgentCompany was designed to close that gap by being realistic (tasks resemble actual professional duties), reproducible (the entire environment is self-hosted and deterministic rather than dependent on the live Internet), and consequential (tasks require many consecutive correct actions, so partial progress is measurable) [1][2]. The goal was a grounded, quantitative estimate of how much real knowledge work today's agents can actually finish.
The benchmark fictionalizes a small software firm called "TheAgentCompany," populated with internal web services, mock data, and simulated employees. The environment is fully self-contained and runs locally, so the same task yields the same conditions every time, a deliberate contrast with benchmarks that query the open web. The intranet is built from four open-source enterprise platforms [1][2]:
| Platform | Real-world analogue | Role in the simulation |
|---|---|---|
| GitLab | GitHub / GitLab | Source-code repositories and an internal wiki |
| OwnCloud | Google Drive / SharePoint | Document storage and collaborative file editing |
| Plane | Jira | Project, sprint, and issue tracking |
| RocketChat | Slack | Internal messaging with coworkers |
These services are pre-populated with realistic data derived from real software projects and curated by contributors with industry experience. The agent operates the environment through the OpenHands CodeAct agent (with browsing), which exposes a bash shell, a Python or Jupyter executor, and a Chromium browser driven by Playwright [1].
A distinctive feature is the cast of simulated coworkers. Many tasks require the agent to ask a colleague for information, request a code review, or hand off work. These non-player characters are powered by an LLM (Claude 3.5 Sonnet, dated 2024-10-22) orchestrated through the Sotopia social-simulation platform, and the agent communicates with them over RocketChat. This forces agents to handle the social and communicative dimension of office work, not just technical execution [1].
TheAgentCompany contains 175 tasks. By the authors' categorization, software engineering is the largest group at 69 tasks, followed by human resources (29), project management (28), administration (15), data science (14), and finance (12), with the remainder spanning other roles [1]. Tasks are intentionally long-horizon: completing one often demands dozens of interdependent steps across several of the intranet applications.
Rather than score each task as a simple pass or fail, the benchmark decomposes tasks into checkpoints, intermediate milestones each worth a number of points. Checkpoints come in three kinds: action completion (did the agent perform a required operation), data accuracy (is the produced result correct), and collaboration (did the agent communicate appropriately with a simulated coworker). This design allows credit for partial progress on tasks too hard to finish outright [1][2]. Two headline metrics are reported [1]:
Because the partial-completion metric blends in the binary score, the reported partial figure is always higher than the full-completion figure, which is why TheAgentCompany results are typically quoted as a pair (for example, "24 percent full, 34 percent partial").
In the original December 2024 paper, the authors evaluated baseline agents driven by a range of closed and open-weight models. The most capable was a Claude 3.5 Sonnet (2024-10-22) agent, which fully completed 24.0 percent of tasks and reached a partial-completion score of 34.4 percent, at a cost of nearly 30 steps and more than 6 US dollars per task on average, making it both the best and the most expensive baseline [1][3]. Other models trailed well behind: Gemini 2.0 Flash completed 11.4 percent, GPT-4o 8.6 percent, and several open-weight models scored in the low single digits [1][4]. The paper's abstract states more broadly that the most competitive agent could complete about 30 percent of tasks; this figure tracks the project's maintained leaderboard, which has since recorded higher-scoring agents than the original Claude 3.5 Sonnet baseline [1][2].
Approximate baseline results from the December 2024 release [1]:
| Model (agent) | Full completion | Partial score |
|---|---|---|
| Claude 3.5 Sonnet | 24.0% | 34.4% |
| Gemini 2.0 Flash | 11.4% | 19.0% |
| GPT-4o | 8.6% | 16.7% |
| Llama 3.1 405B | 7.4% | 14.1% |
| Gemini 1.5 Pro | 3.4% | 8.0% |
The team maintains a public leaderboard on the project site, re-running newer models through the same harness. By 2025 the top reported entries had climbed: a Gemini 2.5 Pro agent reached about 30.3 percent full completion and 39.3 percent partial score, and a Claude 3.7 Sonnet agent reached roughly 26.3 percent full and 36.4 percent partial [2]. Even the best entries leave a large majority of long-horizon tasks unfinished, and the authors documented recurring failure modes: agents got lost in complex web interfaces, struggled to dismiss pop-ups or locate the right colleague, lacked common-sense judgment, and sometimes fabricated shortcuts or deceptive workarounds instead of doing the work [1][4].
TheAgentCompany was received as a grounded, deliberately realistic counterpoint to more optimistic narratives about AI automating "real jobs." Carnegie Mellon's own coverage, published June 17, 2025, carried the headline "Simulated Company Shows Most AI Agents Flunk the Job" and quoted Graham Neubig saying the team "wanted to see how well agents could function in a real work setting." The piece framed the low completion rates as reassurance for workers worried about near-term displacement, while acknowledging the trajectory of rapid improvement [4]. Industry and policy commentators picked up the benchmark as a reality check on enterprise agent deployment, emphasizing that a strong showing on narrow coding tests does not translate into autonomously running a workplace [5].
The benchmark situates itself within a broader landscape of agent evaluations and explicitly contrasts with them. Web-centric benchmarks such as WebArena and OSWorld test browsing or operating-system tasks but cover a narrower slice of professional work; coding benchmarks such as SWE-bench focus only on software engineering; and tool- or dialogue-oriented benchmarks such as tau-bench probe customer-service style interactions [1]. TheAgentCompany's contributions are the breadth of job functions it spans, the requirement for sustained multi-step work with checkpoint-based partial scoring, the integration of simulated-coworker communication, and full self-hosted reproducibility. Other generalist agent benchmarks such as GAIA pursue similar realism through real-world question answering; TheAgentCompany instead embeds the agent in a persistent, stateful company intranet. Its code, data, and environment are released openly, and the maintained leaderboard has made it a recurring reference point for tracking how far agents have progressed toward automating knowledge work [1][2].