SWE-Lancer
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,135 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,135 words
Add missing citations, update stale details, or suggest a clearer explanation.
SWE-Lancer is a benchmark released by OpenAI in February 2025 that evaluates the ability of frontier large language models to perform real-world freelance software-engineering work. The benchmark consists of 1,488 paid tasks drawn from the freelance marketplace Upwork, with a combined real-world payout of one million United States dollars. All tasks were sourced from the open-source mobile and web codebase of the expense-management company Expensify and are graded with end-to-end tests rather than unit tests, with each model's score reported as the cumulative dollar value of the tasks it successfully completes.[1][2]
Unlike earlier coding benchmarks such as HumanEval or SWE-bench, which evaluate isolated programming problems or repository-level patches against test suites, SWE-Lancer is explicitly designed to mirror the economic structure of paid software work. Tasks range from $50 bug fixes to $32,000 feature implementations, and they are split into two categories: Individual Contributor (IC) tasks, in which a model must write code that passes end-to-end browser tests, and Software Engineering Manager (SWE Manager) tasks, in which the model must select the best implementation proposal from several competing options that real engineering managers reviewed.[1][3]
The benchmark was introduced on 17 February 2025 with an OpenAI research blog post and an arXiv preprint titled SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?. The initial study found that no frontier model evaluated could earn more than approximately 40 percent of the available payout pool, with Anthropic's Claude 3.5 Sonnet earning the most ($403,000 out of $1 million) and OpenAI's GPT-4o earning the least among the three headline models tested.[1][2][4]
| Field | Value |
|---|---|
| Released | 17 February 2025 (arXiv 2502.12115) |
| Authors | Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke (OpenAI); contributions from the Expensify engineering team |
| Total tasks | 1,488 (764 IC SWE; 724 SWE Manager) |
| Total payout pool | $1,000,000 USD ($414,775 IC; $585,225 Manager) |
| Source codebase | Expensify (mobile and web app, primarily TypeScript / React Native) |
| Evaluation | End-to-end browser tests (Playwright) for IC; ground-truth manager decisions for Manager tasks |
| Public split | SWE-Lancer Diamond (open-sourced subset) |
| Repository | github.com/openai/SWELancer-Benchmark (archived 18 July 2025; now part of github.com/openai/preparedness) |
| Venue | ICML 2025 (poster) |
By early 2025, coding benchmarks for LLMs had proliferated, but most measured tightly scoped tasks: completing single functions (HumanEval), resolving GitHub issues against unit tests (SWE-bench and SWE-bench Verified), or solving competitive-programming puzzles. Critics from both industry and academia argued that these benchmarks did not capture the messier reality of paid engineering work, which involves ambiguous requirements, full-stack codebases, mobile and browser front-ends, integration with existing services, and the need to demonstrate that a change actually fixes a user-visible problem rather than merely passing a synthetic unit test.[3][5]
The framing of SWE-Lancer as a benchmark in which models attempt to "earn" money is a deliberate rhetorical and methodological choice. By tying each task's score to its real Upwork payout, the authors construct a difficulty gradient that emerges from the labor market itself rather than from researcher-assigned weights. A model that fixes ten $50 bug reports is credited with $500, while a model that delivers a $32,000 feature implementation earns the full payout for that single task. This payout-weighted scoring also produces a headline figure with intuitive economic interpretation: the dollar amount of paid freelance work a model could (in principle) replace.[1][2]
The authors are careful to caveat that the dollar figures are illustrative rather than predictive of any actual labor-market impact. The benchmark assumes a perfect oracle for grading (the original Upwork client's acceptance criteria, encoded as end-to-end tests), no negotiation or clarification, and no ability for the model to refuse low-quality tasks. Even so, payout-weighted scoring has become one of SWE-Lancer's most-cited innovations and has influenced follow-on benchmarks that incorporate cost, latency, or economic-value weighting.[1][6]
All 1,488 tasks were drawn from the public Expensify mobile and web application repository, available at github.com/Expensify/App. Expensify routinely contracts freelance engineers through Upwork to fix bugs and implement features in this codebase, and the company partnered with OpenAI to release a snapshot of historical tasks together with the original payouts, problem statements, and accepted implementations. The Expensify application is a cross-platform expense-management product built primarily in TypeScript using React Native for mobile and React for web, which means most tasks involve full-stack JavaScript/TypeScript work with substantial UI components.[1][4][7]
Because every task originates from the same codebase, SWE-Lancer is sometimes described as a "single-repository" benchmark, in contrast with SWE-bench, whose tasks span twelve Python repositories. The single-codebase design enables a unified Docker image and consistent test harness, but it also introduces a generalization limit that the authors explicitly acknowledge.[1][7]
Individual Contributor tasks make up 764 of the 1,488 problems and account for $414,775 of the $1,000,000 pool. Each IC task is structured around an original Upwork ticket: the model is provided with a natural-language problem statement, a fixed snapshot of the Expensify repository at the time the task was posted, and reproduction steps. The model must produce a code patch. Success is determined by running a set of end-to-end browser tests, typically implemented with Playwright, that simulate a user clicking through the Expensify app to confirm the intended behaviour.[1][3][4]
The authors emphasize that end-to-end tests are stricter than unit tests for this domain because Expensify is heavily UI-driven; a patch that compiles and passes function-level assertions can still produce a broken user experience. To reduce false positives and false negatives in scoring, every IC task's end-to-end test was triple-verified by experienced engineers before inclusion. IC tasks span server-side logic, UI/UX, bug fixes, and feature work; in the original paper's analysis, Claude 3.5 Sonnet's strongest IC sub-category was server-side logic (41.2 percent pass rate), followed by UI/UX (31.7 percent) and bug fixes (28.4 percent).[1][4][8]
Software Engineering Manager tasks make up the remaining 724 problems and account for $585,225 of the payout pool, a majority of the benchmark's dollar value. Rather than writing code, the model is presented with the original problem statement and several competing implementation proposals (typically pull requests submitted by different freelancers). The model must select the proposal that the human engineering manager who hired for the task originally chose as best. A validation study found that experienced engineers, given the same materials, reached 99 percent agreement with the original manager's choice, suggesting the ground truth is well-defined.[1][4]
This task type captures a distinct skill from code writing: the ability to read, compare, and judge other people's code. The authors note that frontier models in early 2025 performed substantially better on Manager tasks than on IC tasks, which they interpret as evidence that current LLMs are stronger as code reviewers and technical advisors than as autonomous implementers.[1][4][8]
OpenAI released the benchmark with a unified Docker image that bundles the Expensify codebase at the correct historical snapshot, all required dependencies, the Playwright browser test harness, and a thin runner that scores model patches. The Docker design is intended to ensure reproducibility: rather than requiring researchers to provision a complex development environment, any team can pull a pre-built image and evaluate a model in a hermetic container. Each per-task image occupies approximately 14 GB and takes 10–20 minutes to build, which the authors flag as one of the benchmark's main practical costs.[7][9]
For IC tasks, scoring is binary at the task level: a patch either passes all of its end-to-end tests (earning the task's full payout) or fails (earning $0). Partial credit is not awarded. For Manager tasks, scoring is similarly binary: the model either selects the same proposal the original manager selected (full payout) or selects a different proposal ($0). Aggregate scores are reported as both percentages (fraction of tasks resolved) and dollar amounts (sum of payouts earned).[1][4]
The original paper reports primarily pass@1 numbers using a default prompting setup, but it also includes ablation studies that explore the effect of multiple attempts and tool use. In one widely cited result, OpenAI's o1 model's performance on IC tasks nearly tripled when allowed six additional attempts, suggesting substantial gains from sampling. Providing tool-use capabilities such as code execution and file inspection further improved o1's performance on Manager tasks.[4]
Because the underlying Expensify task data was public on Upwork and GitHub before SWE-Lancer was released, the authors worried about training-data contamination. They mitigate this in two ways. First, they release only a subset of the benchmark, SWE-Lancer Diamond, publicly, holding the remainder as a private evaluation set. Second, they analyze the timing of each task relative to model knowledge cutoffs and report that contamination effects appear limited for tasks predating the cutoff. Nonetheless, contamination remains a known limitation, particularly for models capable of web browsing during evaluation.[1][7]
The original paper reports headline numbers for three frontier models: OpenAI's GPT-4o (May 2024 release) and o1 (the reasoning model released in December 2024), and Anthropic's Claude 3.5 Sonnet (October 2024 update). The results are summarized below.[1][4][8]
| Model | IC SWE pass rate | SWE Manager pass rate | Total earnings (out of $1,000,000) |
|---|---|---|---|
| Claude 3.5 Sonnet | 26.2% | 44.9% | ~$403,000 |
| OpenAI o1 | lower than Claude 3.5 Sonnet | 36–37% (Manager) | ~$380,000 |
| GPT-4o | 8.0% | lower than o1 | ~$304,000 |
On the public Diamond subset specifically, Claude 3.5 Sonnet earned $208,050. These numbers attracted attention because Anthropic's Claude 3.5 Sonnet, a non-reasoning model, outperformed OpenAI's reasoning model o1 on this benchmark released by OpenAI itself, a reversal of the typical ordering on contemporaneous benchmarks such as SWE-bench Verified. Several commentators interpreted the result as a sign that real-world freelance work rewards a different mix of skills (broad codebase navigation, UI-aware patching) than the structured competition-style tasks on which reasoning models typically excel.[4][8][10]
The authors emphasize that even the best model leaves the majority of the payout pool unearned: roughly 60 percent of the available money goes uncollected by Claude 3.5 Sonnet, and the model fails outright on most IC tasks. The paper's stated conclusion is that "frontier models are still unable to solve the majority of tasks" in this benchmark.[1]
Since the February 2025 release, multiple frontier-model launches have been accompanied by SWE-Lancer numbers. Public leaderboards aggregating results, most prominently the LLM Stats SWE-Lancer pages, track a growing list of models on both the full benchmark and the IC-Diamond subset.[11][12]
On the full SWE-Lancer leaderboard, OpenAI models for which scores have been published include GPT-4o (32.6 percent), GPT-4.5 (37.3 percent), o3-mini (18.0 percent), and as of late 2025 GPT-5.1 Codex with a score of approximately 66.3 percent, the highest publicly reported number on the full benchmark.[11] On the IC-Diamond subset, the leaderboard tracks GPT-4o (12.4 percent), GPT-4.5 (17.4 percent), o3-mini (7.4 percent), GPT-5.2 (74.6 percent, posted December 2025), GPT-5.3 Codex (81.4 percent), and GPT-5 with a reported score of 100 percent on the IC-Diamond split.[12]
Anthropic, Google DeepMind, and other vendors have not consistently published SWE-Lancer numbers for their post-2025 models on the public leaderboards, but third-party evaluations and individual blog posts have reported scores for Claude 3.7 Sonnet (released February 2025), Claude Opus 4, Claude Sonnet 4.5, and Gemini 2.5 Pro in the 40–70 percent range on various subsets. Because methodologies vary (full benchmark vs. Diamond, pass@1 vs. pass@k, with or without tool use), cross-model comparison outside of head-to-head papers should be interpreted with caution.[4][11][12]
The trajectory of reported scores has driven a recurring debate in the community: SWE-Lancer Diamond, in particular, appears to be approaching saturation by late 2025, with at least one model scoring 100 percent. Critics argue that the small size of the Diamond split (relative to the full benchmark) makes such ceiling-effects unsurprising and that the more representative full-benchmark numbers, which top out near two-thirds, remain a meaningful target.[12][13]
SWE-Lancer Diamond is the publicly released evaluation split of the benchmark. It consists of a curated subset of the 1,488 tasks for which the authors have made the full Docker images, problem statements, and reference tests openly available. The remaining tasks form a private holdout used for OpenAI's internal evaluations and for guarding against contamination. The Diamond split's name follows OpenAI's convention of using "Diamond" to denote a high-confidence, fully-verified public subset, as also seen in benchmarks such as GPQA-Diamond. The IC-Diamond leaderboard tracked on third-party services such as LLM Stats refers specifically to the IC SWE tasks within Diamond.[1][12]
A 2025 research follow-up named SWE-Lancer-Loc restructures a portion of the benchmark for "issue localization," the task of identifying which files in the Expensify codebase must be edited to resolve a given problem, without requiring the model to produce a working patch. Localization is treated as a lower-bound proxy for engineering competence: a model that cannot find the right file is unlikely to fix the bug correctly.[14]
In July 2025 the original openai/SWELancer-Benchmark repository was archived and its contents merged into a broader openai/preparedness repository, where active maintenance of evaluation tooling continued. As of mid-2025, the maintainers reported that 198 of the tasks had been adjusted to run successfully offline, a non-trivial subset given the network-dependent nature of full-stack browser tests.[7][9]
SWE-Lancer occupies a distinctive niche in the landscape of LLM coding benchmarks. The following table sketches the main contrasts.[1][3][15]
| Benchmark | Source | Test type | Difficulty signal | Languages |
|---|---|---|---|---|
| HumanEval (2021) | Hand-written prompts | Function-level unit tests | Manual curation | Python |
| SWE-bench (2023) | 12 Python GitHub repos | Repo-level unit tests | GitHub issue labels | Python |
| SWE-bench Verified (2024) | SWE-bench, human-cleaned | Repo-level unit tests | Human filtering | Python |
| LiveCodeBench | Competitive programming | Hidden test cases | Contest difficulty | Python (mainly) |
| Aider Polyglot | Curated multi-language tasks | Unit tests | Manual curation | 6+ languages |
| Terminal-Bench | Hand-crafted CLI tasks | Behavioral checks | Manual curation | Shell-centric |
| SWE-Lancer | Expensify Upwork tickets | End-to-end browser tests | Real Upwork payout | TypeScript (full-stack) |
The most direct competitor in scope is SWE-bench Verified, which similarly evaluates patches against existing tests in a real-world repository. SWE-Lancer is differentiated by its single-codebase TypeScript scope, its end-to-end (rather than unit) testing harness, its inclusion of a manager-style multiple-choice task type, and its payout-weighted scoring. Critics of SWE-Lancer often point to SWE-bench Verified as a more diverse benchmark; defenders point to SWE-Lancer's heavier emphasis on UI-aware, multi-file work that is closer to typical product-engineering practice than backend Python library patches.[3][4][15]
Reception of SWE-Lancer in academic and industry circles has been broadly positive, with the arXiv paper subsequently being accepted as a poster at the International Conference on Machine Learning (ICML) 2025. Commentators have highlighted three perceived strengths: end-to-end testing, payout-weighted scoring, and the inclusion of manager-style tasks alongside code-writing tasks.[1][16]
Several specific criticisms have nonetheless emerged. The most-cited concerns include:
On developer community discussion sites such as Hacker News, threads about SWE-Lancer have raised additional concerns about whether the benchmark's IC tasks systematically underweight backend infrastructure work (common in full-time engineering but rarer on Upwork), and whether the benchmark's heavy UI emphasis biases scores toward models trained with substantial web-development data.[13][17] Despite these criticisms, the benchmark has been incorporated into routine evaluation suites at OpenAI, several model vendors, and third-party evaluators, and its underlying methodology, payout-weighted scoring with end-to-end testing on real product code, has been cited as a template for next-generation software-engineering benchmarks.[4][11][16]