# SWE-Lancer

> Source: https://aiwiki.ai/wiki/swe_lancer
> Updated: 2026-06-09
> Categories: AI Benchmarks, AI Code Generation, OpenAI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# SWE-Lancer

**SWE-Lancer** is a benchmark released by [OpenAI](/wiki/openai) in February 2025 that evaluates the ability of frontier [large language models](/wiki/large_language_model) to perform real-world freelance software-engineering work. The benchmark consists of 1,488 paid tasks drawn from the freelance marketplace Upwork, with a combined real-world payout of one million United States dollars. All tasks were sourced from the open-source mobile and web codebase of the expense-management company Expensify and are graded with end-to-end tests rather than unit tests, with each model's score reported as the cumulative dollar value of the tasks it successfully completes.[^1][^2]

Unlike earlier coding benchmarks such as [HumanEval](/wiki/humaneval) or [SWE-bench](/wiki/swe_bench), which evaluate isolated programming problems or repository-level patches against test suites, SWE-Lancer is explicitly designed to mirror the economic structure of paid software work. Tasks range from $50 bug fixes to $32,000 feature implementations, and they are split into two categories: Individual Contributor (IC) tasks, in which a model must write code that passes end-to-end browser tests, and Software Engineering Manager (SWE Manager) tasks, in which the model must select the best implementation proposal from several competing options that real engineering managers reviewed.[^1][^3]

The benchmark was introduced on 17 February 2025 with an OpenAI research blog post and an arXiv preprint titled *SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?*. The initial study found that no frontier model evaluated could earn more than approximately 40 percent of the available payout pool, with [Anthropic's](/wiki/anthropic) [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) earning the most ($403,000 out of $1 million) and [OpenAI's](/wiki/openai) [GPT-4o](/wiki/gpt_4o) earning the least among the three headline models tested.[^1][^2][^4]

## Key facts

| Field | Value |
|---|---|
| Released | 17 February 2025 (arXiv 2502.12115) |
| Authors | Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke (OpenAI); contributions from the Expensify engineering team |
| Total tasks | 1,488 (764 IC SWE; 724 SWE Manager) |
| Total payout pool | $1,000,000 USD ($414,775 IC; $585,225 Manager) |
| Source codebase | Expensify (mobile and web app, primarily [TypeScript](/wiki/typescript) / React Native) |
| Evaluation | End-to-end browser tests (Playwright) for IC; ground-truth manager decisions for Manager tasks |
| Public split | SWE-Lancer Diamond (open-sourced subset) |
| Repository | `github.com/openai/SWELancer-Benchmark` (archived 18 July 2025; now part of `github.com/openai/preparedness`) |
| Venue | [ICML](/wiki/icml) 2025 (poster) |

## Background

By early 2025, coding benchmarks for [LLMs](/wiki/llm) had proliferated, but most measured tightly scoped tasks: completing single functions ([HumanEval](/wiki/humaneval)), resolving GitHub issues against unit tests ([SWE-bench](/wiki/swe_bench) and [SWE-bench Verified](/wiki/swe_bench_verified)), or solving competitive-programming puzzles. Critics from both industry and academia argued that these benchmarks did not capture the messier reality of paid engineering work, which involves ambiguous requirements, full-stack codebases, mobile and browser front-ends, integration with existing services, and the need to demonstrate that a change actually fixes a user-visible problem rather than merely passing a synthetic unit test.[^3][^5]

The framing of SWE-Lancer as a benchmark in which models attempt to "earn" money is a deliberate rhetorical and methodological choice. By tying each task's score to its real Upwork payout, the authors construct a difficulty gradient that emerges from the labor market itself rather than from researcher-assigned weights. A model that fixes ten $50 bug reports is credited with $500, while a model that delivers a $32,000 feature implementation earns the full payout for that single task. This payout-weighted scoring also produces a headline figure with intuitive economic interpretation: the dollar amount of paid freelance work a model could (in principle) replace.[^1][^2]

The authors are careful to caveat that the dollar figures are illustrative rather than predictive of any actual labor-market impact. The benchmark assumes a perfect oracle for grading (the original Upwork client's acceptance criteria, encoded as end-to-end tests), no negotiation or clarification, and no ability for the model to refuse low-quality tasks. Even so, payout-weighted scoring has become one of SWE-Lancer's most-cited innovations and has influenced follow-on benchmarks that incorporate cost, latency, or economic-value weighting.[^1][^6]

## Dataset composition

### Source codebase

All 1,488 tasks were drawn from the public Expensify mobile and web application repository, available at `github.com/Expensify/App`. Expensify routinely contracts freelance engineers through Upwork to fix bugs and implement features in this codebase, and the company partnered with OpenAI to release a snapshot of historical tasks together with the original payouts, problem statements, and accepted implementations. The Expensify application is a cross-platform expense-management product built primarily in [TypeScript](/wiki/typescript) using React Native for mobile and React for web, which means most tasks involve full-stack JavaScript/TypeScript work with substantial UI components.[^1][^4][^7]

Because every task originates from the same codebase, SWE-Lancer is sometimes described as a "single-repository" benchmark, in contrast with [SWE-bench](/wiki/swe_bench), whose tasks span twelve Python repositories. The single-codebase design enables a unified Docker image and consistent test harness, but it also introduces a generalization limit that the authors explicitly acknowledge.[^1][^7]

### IC SWE tasks

Individual Contributor tasks make up 764 of the 1,488 problems and account for $414,775 of the $1,000,000 pool. Each IC task is structured around an original Upwork ticket: the model is provided with a natural-language problem statement, a fixed snapshot of the Expensify repository at the time the task was posted, and reproduction steps. The model must produce a code patch. Success is determined by running a set of end-to-end browser tests, typically implemented with Playwright, that simulate a user clicking through the Expensify app to confirm the intended behaviour.[^1][^3][^4]

The authors emphasize that end-to-end tests are stricter than unit tests for this domain because Expensify is heavily UI-driven; a patch that compiles and passes function-level assertions can still produce a broken user experience. To reduce false positives and false negatives in scoring, every IC task's end-to-end test was triple-verified by experienced engineers before inclusion. IC tasks span server-side logic, UI/UX, bug fixes, and feature work; in the original paper's analysis, Claude 3.5 Sonnet's strongest IC sub-category was server-side logic (41.2 percent pass rate), followed by UI/UX (31.7 percent) and bug fixes (28.4 percent).[^1][^4][^8]

### SWE Manager tasks

Software Engineering Manager tasks make up the remaining 724 problems and account for $585,225 of the payout pool, a majority of the benchmark's dollar value. Rather than writing code, the model is presented with the original problem statement and several competing implementation proposals (typically pull requests submitted by different freelancers). The model must select the proposal that the human engineering manager who hired for the task originally chose as best. A validation study found that experienced engineers, given the same materials, reached 99 percent agreement with the original manager's choice, suggesting the ground truth is well-defined.[^1][^4]

This task type captures a distinct skill from code writing: the ability to read, compare, and judge other people's code. The authors note that frontier models in early 2025 performed substantially better on Manager tasks than on IC tasks, which they interpret as evidence that current LLMs are stronger as code reviewers and technical advisors than as autonomous implementers.[^1][^4][^8]

## Methodology

### Evaluation environment

OpenAI released the benchmark with a unified Docker image that bundles the Expensify codebase at the correct historical snapshot, all required dependencies, the Playwright browser test harness, and a thin runner that scores model patches. The Docker design is intended to ensure reproducibility: rather than requiring researchers to provision a complex development environment, any team can pull a pre-built image and evaluate a model in a hermetic container. Each per-task image occupies approximately 14 GB and takes 10-20 minutes to build, which the authors flag as one of the benchmark's main practical costs.[^7][^9]

For IC tasks, scoring is binary at the task level: a patch either passes all of its end-to-end tests (earning the task's full payout) or fails (earning $0). Partial credit is not awarded. For Manager tasks, scoring is similarly binary: the model either selects the same proposal the original manager selected (full payout) or selects a different proposal ($0). Aggregate scores are reported as both percentages (fraction of tasks resolved) and dollar amounts (sum of payouts earned).[^1][^4]

### Pass@k and tool use

The original paper reports primarily pass@1 numbers using a default prompting setup, but it also includes ablation studies that explore the effect of multiple attempts and tool use. In one widely cited result, OpenAI's [o1](/wiki/o1) model's performance on IC tasks nearly tripled when allowed six additional attempts, suggesting substantial gains from sampling. Providing tool-use capabilities such as code execution and file inspection further improved [o1](/wiki/o1)'s performance on Manager tasks.[^4]

### Contamination control

Because the underlying Expensify task data was public on Upwork and GitHub before SWE-Lancer was released, the authors worried about training-data contamination. They mitigate this in two ways. First, they release only a subset of the benchmark, SWE-Lancer Diamond, publicly, holding the remainder as a private evaluation set. Second, they analyze the timing of each task relative to model knowledge cutoffs and report that contamination effects appear limited for tasks predating the cutoff. Nonetheless, contamination remains a known limitation, particularly for models capable of web browsing during evaluation.[^1][^7]

## Initial results (February 2025)

The original paper reports headline numbers for three frontier models: [OpenAI's](/wiki/openai) [GPT-4o](/wiki/gpt_4o) (May 2024 release) and [o1](/wiki/o1) (the reasoning model released in December 2024), and [Anthropic's](/wiki/anthropic) [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) (October 2024 update). The results are summarized below.[^1][^4][^8]

| Model | IC SWE pass rate | SWE Manager pass rate | Total earnings (out of $1,000,000) |
|---|---|---|---|
| Claude 3.5 Sonnet | 26.2% | 44.9% | ~$403,000 |
| OpenAI o1 | lower than Claude 3.5 Sonnet | 36-37% (Manager) | ~$380,000 |
| GPT-4o | 8.0% | lower than o1 | ~$304,000 |

On the public Diamond subset specifically, Claude 3.5 Sonnet earned $208,050. These numbers attracted attention because Anthropic's Claude 3.5 Sonnet, a non-reasoning model, outperformed OpenAI's [reasoning model](/wiki/reasoning_models) [o1](/wiki/o1) on this benchmark released by [OpenAI itself](/wiki/openai), a reversal of the typical ordering on contemporaneous benchmarks such as [SWE-bench Verified](/wiki/swe_bench_verified). Several commentators interpreted the result as a sign that real-world freelance work rewards a different mix of skills (broad codebase navigation, UI-aware patching) than the structured competition-style tasks on which reasoning models typically excel.[^4][^8][^10]

The authors emphasize that even the best model leaves the majority of the payout pool unearned: roughly 60 percent of the available money goes uncollected by Claude 3.5 Sonnet, and the model fails outright on most IC tasks. The paper's stated conclusion is that "frontier models are still unable to solve the majority of tasks" in this benchmark.[^1]

## Subsequent results and leaderboard updates

Since the February 2025 release, multiple frontier-model launches have been accompanied by SWE-Lancer numbers. Public leaderboards aggregating results, most prominently the LLM Stats SWE-Lancer pages, track a growing list of models on both the full benchmark and the IC-Diamond subset.[^11][^12]

On the full SWE-Lancer leaderboard, OpenAI models for which scores have been published include [GPT-4o](/wiki/gpt_4o) (32.6 percent), GPT-4.5 (37.3 percent), o3-mini (18.0 percent), and as of late 2025 GPT-5.1 Codex with a score of approximately 66.3 percent, the highest publicly reported number on the full benchmark.[^11] On the IC-Diamond subset, the leaderboard tracks [GPT-4o](/wiki/gpt_4o) (12.4 percent), GPT-4.5 (17.4 percent), o3-mini (7.4 percent), GPT-5.2 (74.6 percent, posted December 2025), GPT-5.3 Codex (81.4 percent), and GPT-5 with a reported score of 100 percent on the IC-Diamond split.[^12]

Anthropic, Google DeepMind, and other vendors have not consistently published SWE-Lancer numbers for their post-2025 models on the public leaderboards, but third-party evaluations and individual blog posts have reported scores for [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) (released February 2025), [Claude Opus 4](/wiki/claude_opus_4), [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5), and [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) in the 40-70 percent range on various subsets. Because methodologies vary (full benchmark vs. Diamond, pass@1 vs. pass@k, with or without tool use), cross-model comparison outside of head-to-head papers should be interpreted with caution.[^4][^11][^12]

The trajectory of reported scores has driven a recurring debate in the community: SWE-Lancer Diamond, in particular, appears to be approaching saturation by late 2025, with at least one model scoring 100 percent. Critics argue that the small size of the Diamond split (relative to the full benchmark) makes such ceiling-effects unsurprising and that the more representative full-benchmark numbers, which top out near two-thirds, remain a meaningful target.[^12][^13]

## SWE-Lancer Diamond and variants

**SWE-Lancer Diamond** is the publicly released evaluation split of the benchmark. It consists of a curated subset of the 1,488 tasks for which the authors have made the full Docker images, problem statements, and reference tests openly available. The remaining tasks form a private holdout used for OpenAI's internal evaluations and for guarding against contamination. The Diamond split's name follows OpenAI's convention of using "Diamond" to denote a high-confidence, fully-verified public subset, as also seen in benchmarks such as GPQA-Diamond. The IC-Diamond leaderboard tracked on third-party services such as LLM Stats refers specifically to the IC SWE tasks within Diamond.[^1][^12]

A 2025 research follow-up named **SWE-Lancer-Loc** restructures a portion of the benchmark for "issue localization," the task of identifying which files in the Expensify codebase must be edited to resolve a given problem, without requiring the model to produce a working patch. Localization is treated as a lower-bound proxy for engineering competence: a model that cannot find the right file is unlikely to fix the bug correctly.[^14]

In July 2025 the original `openai/SWELancer-Benchmark` repository was archived and its contents merged into a broader `openai/preparedness` repository, where active maintenance of evaluation tooling continued. As of mid-2025, the maintainers reported that 198 of the tasks had been adjusted to run successfully offline, a non-trivial subset given the network-dependent nature of full-stack browser tests.[^7][^9]

## Comparison to other coding benchmarks

SWE-Lancer occupies a distinctive niche in the landscape of [LLM](/wiki/llm) coding benchmarks. The following table sketches the main contrasts.[^1][^3][^15]

| Benchmark | Source | Test type | Difficulty signal | Languages |
|---|---|---|---|---|
| [HumanEval](/wiki/humaneval) (2021) | Hand-written prompts | Function-level unit tests | Manual curation | Python |
| [SWE-bench](/wiki/swe_bench) (2023) | 12 Python GitHub repos | Repo-level unit tests | GitHub issue labels | Python |
| [SWE-bench Verified](/wiki/swe_bench_verified) (2024) | SWE-bench, human-cleaned | Repo-level unit tests | Human filtering | Python |
| [LiveCodeBench](/wiki/livecodebench) | Competitive programming | Hidden test cases | Contest difficulty | Python (mainly) |
| [Aider Polyglot](/wiki/aider_polyglot) | Curated multi-language tasks | Unit tests | Manual curation | 6+ languages |
| [Terminal-Bench](/wiki/terminal_bench) | Hand-crafted CLI tasks | Behavioral checks | Manual curation | Shell-centric |
| **SWE-Lancer** | Expensify Upwork tickets | End-to-end browser tests | Real Upwork payout | TypeScript (full-stack) |

The most direct competitor in scope is [SWE-bench Verified](/wiki/swe_bench_verified), which similarly evaluates patches against existing tests in a real-world repository. SWE-Lancer is differentiated by its single-codebase TypeScript scope, its end-to-end (rather than unit) testing harness, its inclusion of a manager-style multiple-choice task type, and its payout-weighted scoring. Critics of SWE-Lancer often point to SWE-bench Verified as a more diverse benchmark; defenders point to SWE-Lancer's heavier emphasis on UI-aware, multi-file work that is closer to typical product-engineering practice than backend Python library patches.[^3][^4][^15]

## Reception and criticism

Reception of SWE-Lancer in academic and industry circles has been broadly positive, with the arXiv paper subsequently being accepted as a poster at the International Conference on Machine Learning ([ICML](/wiki/icml)) 2025. Commentators have highlighted three perceived strengths: end-to-end testing, payout-weighted scoring, and the inclusion of manager-style tasks alongside code-writing tasks.[^1][^16]

Several specific criticisms have nonetheless emerged. The most-cited concerns include:

* **Single-codebase generalization.** Because all 1,488 tasks come from the Expensify mobile and web codebase, results may reflect idiosyncrasies of that project (React Native, particular state-management patterns, Expensify-specific abstractions) rather than general engineering ability. The authors acknowledge this and recommend caution when generalizing beyond freelance contexts.[^1][^4]
* **Test-based scoring.** End-to-end tests improve realism over unit tests but still cannot judge code quality, maintainability, or architectural soundness. A model that passes the tests with a hacky patch receives the same score as one that delivers a clean implementation.[^4][^17]
* **No clarification or negotiation.** Real freelancers can ask the client for clarification, propose alternative approaches, or refuse out-of-scope work. SWE-Lancer models receive a fixed problem statement and must work from it, which the paper concedes may understate model capability.[^1]
* **Text-only modality.** The original benchmark provides only textual problem descriptions and code; many real Upwork tickets include screenshots, screen-recordings, or video demonstrations of bugs, which the benchmark does not surface to the model.[^1]
* **"Pay" framing as economic measurement.** Tying scores to dollar payouts produces vivid headlines ("AI earns $403,000 of $1 million") but is at best a stylized economic signal. The dollar amounts reflect Upwork market clearing prices for these specific tasks at specific times, not the marginal social or economic value of automation.[^4][^17]
* **Contamination risk.** Although the authors hold out a private subset and analyze knowledge cutoffs, the public Diamond split remains vulnerable to direct training-data inclusion, particularly for models that browse the web during evaluation.[^7]

On [developer](/wiki/llm) community discussion sites such as Hacker News, threads about SWE-Lancer have raised additional concerns about whether the benchmark's IC tasks systematically underweight backend infrastructure work (common in full-time engineering but rarer on Upwork), and whether the benchmark's heavy UI emphasis biases scores toward models trained with substantial web-development data.[^13][^17] Despite these criticisms, the benchmark has been incorporated into routine evaluation suites at OpenAI, several model vendors, and third-party evaluators, and its underlying methodology, payout-weighted scoring with end-to-end testing on real product code, has been cited as a template for next-generation [software-engineering benchmarks](/wiki/benchmark).[^4][^11][^16]

## References

[^1]: Miserendino, Samuel; Wang, Michele; Patwardhan, Tejal; Heidecke, Johannes. *SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?* arXiv:2502.12115. 17 February 2025. https://arxiv.org/abs/2502.12115

[^2]: OpenAI. "Introducing the SWE-Lancer benchmark." OpenAI Research blog, 17 February 2025. https://openai.com/index/swe-lancer/

[^3]: MarkTechPost. "OpenAI Introduces SWE-Lancer: A Benchmark for Evaluating Model Performance on Real-World Freelance Software Engineering Work." 17 February 2025. https://www.marktechpost.com/2025/02/17/openai-introduces-swe-lancer-a-benchmark-for-evaluating-model-performance-on-real-world-freelance-software-engineering-work/

[^4]: Dickson, Ben. "Claude 3.5 Sonnet outperforms GPT-4o and o1 in software engineering, OpenAI study shows." TechTalks, 24 February 2025. https://bdtechtalks.com/2025/02/24/claude-3-5-sonnet-outperforms-gpt-4o-and-o1-in-software-engineering-openai-study-shows/

[^5]: GIGAZINE. "OpenAI releases AI benchmark 'SWE-Lancer' to measure whether a machine can perform tasks that would cost a freelance engineer $1 million." 19 February 2025. https://gigazine.net/gsc_news/en/20250219-openai-swe-lancer/

[^6]: Analytics Vidhya. "OpenAI's SWE-Lancer Benchmark: Testing AI on $1 Million Worth of Freelance Coding Tasks." February 2025. https://www.analyticsvidhya.com/blog/2025/02/openais-swe-lancer-benchmark/

[^7]: OpenAI. SWELancer-Benchmark GitHub repository (archived 18 July 2025). https://github.com/openai/SWELancer-Benchmark

[^8]: OpenAI Developer Community. "OpenAI releases new coding benchmark SWE-Lancer showing 3.5 Sonnet beating o1." February 2025. https://community.openai.com/t/openai-releases-new-coding-benchmark-swe-lancer-showing-3-5-sonnet-beating-o1/1123976

[^9]: DeepWiki. "openai/SWELancer-Benchmark." 2025. https://deepwiki.com/openai/SWELancer-Benchmark

[^10]: Expensify Engineering. "Expensify Powers OpenAI's SWE-Lancer: Real-World AI Benchmarks." Expensify blog, 2025. https://use.expensify.com/blog/expensify-powers-openai-swe-lancer-project

[^11]: LLM Stats. "SWE-Lancer Leaderboard." Accessed 2026. https://llm-stats.com/benchmarks/swe-lancer

[^12]: LLM Stats. "SWE-Lancer (IC-Diamond subset) Benchmark Leaderboard." Accessed 2026. https://llm-stats.com/benchmarks/swe-lancer-(ic-diamond-subset)

[^13]: Hacker News. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork." Discussion thread, February 2025. https://news.ycombinator.com/item?id=43086347

[^14]: EmergentMind. "SWE-Lancer-Loc: Real-World Issue Localization." 2025. https://www.emergentmind.com/topics/swe-lancer-loc

[^15]: SWE-bench. "SWE-bench Leaderboards." Accessed 2026. https://www.swebench.com/

[^16]: ICML 2025. "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?" Poster. https://icml.cc/virtual/2025/poster/43573

[^17]: DevOps.com. "AI Coding: New Research Shows Even the Best Models Struggle With Real-World Software Engineering." 2025. https://devops.com/ai-coding-new-research-shows-even-the-best-models-struggle-with-real-world-software-engineering-2/

