# SWE-Lancer

> Source: https://aiwiki.ai/wiki/swe_lancer
> Updated: 2026-07-07
> Categories: AI Benchmarks, AI Code Generation, OpenAI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**SWE-Lancer** is a benchmark released by [OpenAI](/wiki/openai) in February 2025 that evaluates the ability of frontier [large language models](/wiki/large_language_model) to perform real-world freelance software-engineering work. The benchmark consists of 1,488 paid tasks drawn from the freelance marketplace Upwork, with a combined real-world payout of one million United States dollars. All tasks were sourced from the open-source mobile and web codebase of the expense-management company Expensify and are graded with end-to-end tests rather than unit tests, with each model's score reported as the cumulative dollar value of the tasks it successfully completes. In OpenAI's launch evaluation, the best-performing model, [Anthropic's](/wiki/anthropic) [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), earned $403,325 (about 40 percent of the available pool), leading the authors to conclude that frontier models are "still unable to solve the majority of tasks."[^1][^2]

Unlike earlier coding benchmarks such as [HumanEval](/wiki/humaneval) or [SWE-bench](/wiki/swe_bench), which evaluate isolated programming problems or repository-level patches against test suites, SWE-Lancer is explicitly designed to mirror the economic structure of paid software work. Tasks range from $50 bug fixes to $32,000 feature implementations, and they are split into two categories: Individual Contributor (IC) tasks, in which a model must write code that passes end-to-end browser tests, and Software Engineering Manager (SWE Manager) tasks, in which the model must select the best implementation proposal from several competing options that real engineering managers reviewed.[^1][^3]

The benchmark was introduced on 17 February 2025 with an OpenAI research blog post and an arXiv preprint titled *SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?*, later published in the Proceedings of the 42nd International Conference on Machine Learning (PMLR 267). The initial study found that no frontier model evaluated could earn more than about 40 percent of the available payout pool, with [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) earning the most ($403,325 out of $1,000,000, a 40.3 percent earn rate) and [OpenAI's](/wiki/openai) [GPT-4o](/wiki/gpt_4o) earning the least among the three headline models tested ($303,525).[^1][^2][^4]

## Key facts

| Field | Value |
|---|---|
| Released | 17 February 2025 (arXiv 2502.12115) |
| Authors | Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke (OpenAI); contributions from the Expensify engineering team |
| Total tasks | 1,488 (764 IC SWE; 724 SWE Manager) |
| Total payout pool | $1,000,000 USD ($414,775 IC; $585,225 Manager) |
| Source codebase | Expensify (mobile and web app, primarily [TypeScript](/wiki/typescript) / React Native) |
| Evaluation | End-to-end browser tests (Playwright) for IC; ground-truth manager decisions for Manager tasks |
| Public split | SWE-Lancer Diamond: 502 tasks worth $500,800 (237 IC / $236,300; 265 Manager / $264,500) |
| Top model at launch | Claude 3.5 Sonnet, $403,325 (40.3% of pool) |
| Repository | `github.com/openai/SWELancer-Benchmark` (archived 18 July 2025; now part of `github.com/openai/preparedness`) |
| Venue | [ICML](/wiki/icml) 2025 (poster; PMLR 267) |

## Why did OpenAI build SWE-Lancer?

By early 2025, coding benchmarks for [LLMs](/wiki/llm) had proliferated, but most measured tightly scoped tasks: completing single functions ([HumanEval](/wiki/humaneval)), resolving GitHub issues against unit tests ([SWE-bench](/wiki/swe_bench) and [SWE-bench Verified](/wiki/swe_bench_verified)), or solving competitive-programming puzzles. Critics from both industry and academia argued that these benchmarks did not capture the messier reality of paid engineering work, which involves ambiguous requirements, full-stack codebases, mobile and browser front-ends, integration with existing services, and the need to demonstrate that a change actually fixes a user-visible problem rather than merely passing a synthetic unit test.[^3][^5]

The framing of SWE-Lancer as a benchmark in which models attempt to "earn" money is a deliberate rhetorical and methodological choice. By tying each task's score to its real Upwork payout, the authors construct a difficulty gradient that emerges from the labor market itself rather than from researcher-assigned weights. A model that fixes ten $50 bug reports is credited with $500, while a model that delivers a $32,000 feature implementation earns the full payout for that single task. This payout-weighted scoring also produces a headline figure with intuitive economic interpretation: the dollar amount of paid freelance work a model could (in principle) replace.[^1][^2]

The authors are careful to caveat that the dollar figures are illustrative rather than predictive of any actual labor-market impact. The benchmark assumes a perfect oracle for grading (the original Upwork client's acceptance criteria, encoded as end-to-end tests), no negotiation or clarification, and no ability for the model to refuse low-quality tasks. Even so, payout-weighted scoring has become one of SWE-Lancer's most-cited innovations and has influenced follow-on benchmarks that incorporate cost, latency, or economic-value weighting. The paper frames the benchmark's ultimate purpose in economic terms: "By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development."[^1][^6]

## What is in the SWE-Lancer dataset?

The 1,488 tasks are divided into 764 Individual Contributor (IC) SWE tasks worth $414,775 and 724 SWE Manager tasks worth $585,225. Individual task prices, set by the original Upwork market rather than by researchers, span from $50 to $32,000, and in the open-sourced Diamond subset 35 percent of tasks are worth more than $1,000 while 34 percent are worth between $500 and $1,000. The average Diamond task took 26 days to resolve on GitHub and accumulated 47 comments; a typical IC SWE task in that subset requires modifying about 2 files and 69 lines of code, while a SWE Manager task requires choosing among 4 to 5 competing proposals.[^1]

### Source codebase

All 1,488 tasks were drawn from the public Expensify mobile and web application repository, available at `github.com/Expensify/App`. Expensify routinely contracts freelance engineers through Upwork to fix bugs and implement features in this codebase, and the company partnered with OpenAI to release a snapshot of historical tasks together with the original payouts, problem statements, and accepted implementations. OpenAI describes Expensify as a $300 million public company (NASDAQ: EXFY) with 12 million users, calling it "a reliable testbed for sourcing commercially valuable software engineering tasks." The Expensify application is a cross-platform expense-management product built primarily in [TypeScript](/wiki/typescript) using React Native for mobile and React for web, spanning roughly two million lines of code, which means most tasks involve full-stack JavaScript/TypeScript work with substantial UI components.[^1][^4][^7][^18]

Because every task originates from the same codebase, SWE-Lancer is sometimes described as a "single-repository" benchmark, in contrast with [SWE-bench](/wiki/swe_bench), whose tasks span twelve Python repositories. The single-codebase design enables a unified Docker image and consistent test harness, but it also introduces a generalization limit that the authors explicitly acknowledge. In the open-sourced Diamond subset, 74 percent of IC SWE tasks and 76 percent of SWE Manager tasks involve application logic, 17 percent of IC tasks and 18 percent of manager tasks involve UI/UX work, and 88 percent of IC tasks and 94 percent of manager tasks are classified as bug fixes.[^1][^7]

### IC SWE tasks

Individual Contributor tasks make up 764 of the 1,488 problems and account for $414,775 of the $1,000,000 pool. Each IC task is structured around an original Upwork ticket: the model is provided with a natural-language problem statement, a fixed snapshot of the Expensify repository at the time the task was posted, and reproduction steps. The model must produce a code patch. Success is determined by running a set of end-to-end browser tests, typically implemented with Playwright, that simulate a user clicking through the Expensify app to confirm the intended behaviour.[^1][^3][^4]

The authors emphasize that end-to-end tests are stricter than unit tests for this domain because Expensify is heavily UI-driven; a patch that compiles and passes function-level assertions can still produce a broken user experience. OpenAI paid a team of 100 professional software engineers to write and verify the end-to-end tests, and every IC task's test was triple-verified by experienced engineers before inclusion. IC tasks span client-side application logic, server-side logic, UI/UX, and system-wide quality work. In the paper's Diamond-set analysis, Claude 3.5 Sonnet's strongest IC sub-category was server-side logic (41.2 percent pass rate), followed by UI/UX (31.7 percent) and client-side application logic (23.9 percent), while the three system-wide quality and reliability tasks went unsolved by every model (0 percent). Pass rates by task type for the three headline models are shown below.[^1][^4][^8]

| IC SWE task type (Diamond) | GPT-4o | o1 (high) | Claude 3.5 Sonnet | Tasks |
|---|---|---|---|---|
| Application logic (client-side) | 8.0% | 15.9% | 23.9% | 176 |
| UI/UX | 2.4% | 17.1% | 31.7% | 41 |
| Server-side logic | 23.5% | 23.5% | 41.2% | 17 |
| System-wide quality and reliability | 0.0% | 0.0% | 0.0% | 3 |

### SWE Manager tasks

Software Engineering Manager tasks make up the remaining 724 problems and account for $585,225 of the payout pool, a majority of the benchmark's dollar value. Rather than writing code, the model is presented with the original problem statement and several competing implementation proposals (typically pull requests submitted by different freelancers). The model must select the proposal that the human engineering manager who hired for the task originally chose as best. A validation study found that experienced engineers, given the same materials, reached 99 percent agreement with the original manager's choice, suggesting the ground truth is well-defined.[^1][^4]

This task type captures a distinct skill from code writing: the ability to read, compare, and judge other people's code. The authors note that frontier models in early 2025 performed substantially better on Manager tasks than on IC tasks, with pass@1 on Manager tasks often more than double the corresponding IC SWE pass rate, which they interpret as evidence that current LLMs are stronger as code reviewers and technical advisors than as autonomous implementers.[^1][^4][^8]

## How does SWE-Lancer evaluate models?

### Evaluation environment

OpenAI released the benchmark with a unified Docker image that bundles the Expensify codebase at the correct historical snapshot, all required dependencies, the Playwright browser test harness, and a thin runner that scores model patches. For the launch study, agents ran in a Docker container with the repository preconfigured and no internet access, preventing them from retrieving external information. The Docker design is intended to ensure reproducibility: rather than requiring researchers to provision a complex development environment, any team can pull a pre-built image and evaluate a model in a hermetic container. Each per-task image occupies approximately 14 GB and takes 10-20 minutes to build, which the authors flag as one of the benchmark's main practical costs.[^1][^7][^9]

For IC tasks, scoring is binary at the task level: a patch either passes all of its end-to-end tests (earning the task's full payout) or fails (earning $0). Partial credit is not awarded. For Manager tasks, scoring is similarly binary: the model either selects the same proposal the original manager selected (full payout) or selects a different proposal ($0). Aggregate scores are reported as both percentages (fraction of tasks resolved) and dollar amounts (sum of payouts earned), with an "earn rate" defined as payout received divided by total possible payout.[^1][^4]

### Pass@k and tool use

The original paper reports primarily pass@1 numbers using a default prompting setup, but it also includes ablation studies that explore the effect of multiple attempts, reasoning effort, and tool use. In one widely cited result, o1's pass rate on Diamond IC SWE tasks rose from 16.5 percent at pass@1 to 48.5 percent at pass@7, nearly tripling when allowed additional attempts. Increasing o1's reasoning effort from low to high lifted its Diamond IC SWE pass@1 from 9.3 percent to 16.5 percent, and giving the model access to the "user tool" (a Playwright-driven browser it can operate to reproduce and inspect bugs) raised the same figure from 13.1 percent to 16.5 percent. The authors note that the user tool takes 90 to 120 seconds per invocation, and that weaker models such as [GPT-4o](/wiki/gpt_4o) are "prone to abandoning the tool altogether."[^1][^4]

### Contamination control

Because the underlying Expensify task data was public on Upwork and GitHub before SWE-Lancer was released, the authors worried about training-data contamination. They mitigate this in two ways. First, they release only a subset of the benchmark, SWE-Lancer Diamond, publicly, holding the remainder as a private evaluation set. Second, they analyze the timing of each task relative to model knowledge cutoffs (the public Diamond tasks originate from GitHub issues posted between 2023 and 2024) and report that contamination effects appear limited for tasks predating the cutoff. Nonetheless, contamination remains a known limitation, particularly for models capable of web browsing during evaluation.[^1][^7]

## How did the first models score on SWE-Lancer?

The original paper reports headline numbers for three frontier models: [OpenAI's](/wiki/openai) [GPT-4o](/wiki/gpt_4o) (the gpt-4o-2024-08-06 snapshot) and [o1](/wiki/o1) (the reasoning model released in December 2024, run at high reasoning effort), and [Anthropic's](/wiki/anthropic) [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet) (claude-3-5-sonnet-20241022, the October 2024 update). The pass@1 results on the Diamond subset and total earnings on the full set are summarized below.[^1][^4][^8]

| Model | IC SWE (Diamond) pass@1 | SWE Manager (Diamond) pass@1 | Full-set earnings (of $1,000,000) |
|---|---|---|---|
| Claude 3.5 Sonnet | 26.2% | 44.9% | $403,325 |
| OpenAI o1 (high) | 16.5% | 41.5% | $380,350 |
| GPT-4o | 8.0% | 37.0% | $303,525 |

Across the full 1,488-task set, Claude 3.5 Sonnet resolved 33.7 percent of tasks (21.1 percent of IC tasks and 47.0 percent of manager tasks), o1 resolved 32.9 percent, and GPT-4o resolved 23.3 percent. On the public Diamond subset specifically, Claude 3.5 Sonnet earned $208,050 out of the $500,800 available, a 41.5 percent earn rate. These numbers attracted attention because Anthropic's Claude 3.5 Sonnet, a non-reasoning model, outperformed OpenAI's [reasoning model](/wiki/reasoning_models) [o1](/wiki/o1) on this benchmark released by [OpenAI itself](/wiki/openai), a reversal of the typical ordering on contemporaneous benchmarks such as [SWE-bench Verified](/wiki/swe_bench_verified). Several commentators interpreted the result as a sign that real-world freelance work rewards a different mix of skills (broad codebase navigation, UI-aware patching) than the structured competition-style tasks on which reasoning models typically excel.[^1][^4][^8][^10]

Qualitatively, the paper reports that agents "excel at localizing, but fail to root cause": models pinpoint the relevant file and functions quickly through keyword search across the whole repository, often faster than a human would, but frequently miss how an issue spans multiple components and therefore deliver partial or incorrect fixes.[^1]

The authors emphasize that even the best model leaves the majority of the payout pool unearned: roughly 60 percent of the available money goes uncollected by Claude 3.5 Sonnet, and the model fails outright on most IC tasks. The paper's stated conclusion is that "frontier models are still unable to solve the majority of tasks" in this benchmark.[^1]

## What are the latest SWE-Lancer leaderboard scores?

Since the February 2025 release, multiple frontier-model launches have been accompanied by SWE-Lancer numbers. Public leaderboards aggregating results, most prominently the LLM Stats SWE-Lancer pages, track a growing list of models on both the full benchmark and the IC-Diamond subset.[^11][^12]

On the full SWE-Lancer leaderboard, OpenAI models for which scores have been published include [GPT-4o](/wiki/gpt_4o) (32.6 percent), GPT-4.5 (37.3 percent), o3-mini (18.0 percent), and as of late 2025 GPT-5.1 Codex with a score of approximately 66.3 percent, the highest publicly reported number on the full benchmark.[^11] On the IC-Diamond subset, the leaderboard tracks [GPT-4o](/wiki/gpt_4o) (12.4 percent), GPT-4.5 (17.4 percent), o3-mini (7.4 percent), GPT-5.2 (74.6 percent, posted December 2025), GPT-5.3 Codex (81.4 percent), and GPT-5 with a reported score of 100 percent on the IC-Diamond split.[^12]

Anthropic, Google DeepMind, and other vendors have not consistently published SWE-Lancer numbers for their post-2025 models on the public leaderboards, but third-party evaluations and individual blog posts have reported scores for [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) (released February 2025), [Claude Opus 4](/wiki/claude_opus_4), [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5), and [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) in the 40-70 percent range on various subsets. Because methodologies vary (full benchmark vs. Diamond, pass@1 vs. pass@k, with or without tool use), cross-model comparison outside of head-to-head papers should be interpreted with caution.[^4][^11][^12]

The trajectory of reported scores has driven a recurring debate in the community: SWE-Lancer Diamond, in particular, appears to be approaching saturation by late 2025, with at least one model scoring 100 percent. Critics argue that the small size of the Diamond split (relative to the full benchmark) makes such ceiling-effects unsurprising and that the more representative full-benchmark numbers, which top out near two-thirds, remain a meaningful target.[^12][^13]

## SWE-Lancer Diamond and variants

**SWE-Lancer Diamond** is the publicly released evaluation split of the benchmark. It consists of 502 tasks worth $500,800: 237 IC SWE tasks worth $236,300 and 265 SWE Manager tasks worth $264,500, for which the authors have made the full Docker images, problem statements, and reference tests openly available. The remaining tasks form a private holdout used for OpenAI's internal evaluations and for guarding against contamination. The Diamond split's name follows OpenAI's convention of using "Diamond" to denote a high-confidence, fully-verified public subset, as also seen in benchmarks such as GPQA-Diamond. The IC-Diamond leaderboard tracked on third-party services such as LLM Stats refers specifically to the IC SWE tasks within Diamond.[^1][^12]

A 2025 research follow-up named **SWE-Lancer-Loc** restructures a portion of the benchmark for "issue localization," the task of identifying which files in the Expensify codebase must be edited to resolve a given problem, without requiring the model to produce a working patch. It derives 216 localization issues from SWE-Lancer Diamond and evaluates models with file-level and function-level Hit@k and Recall@k metrics over the roughly two-million-line Expensify codebase. Localization is treated as a lower-bound proxy for engineering competence: a model that cannot find the right file is unlikely to fix the bug correctly.[^14][^18]

In July 2025 the original `openai/SWELancer-Benchmark` repository was archived and its contents merged into a broader `openai/preparedness` repository, where active maintenance of evaluation tooling continued. As of mid-2025, the maintainers reported that 198 of the tasks had been adjusted to run successfully offline, a non-trivial subset given the network-dependent nature of full-stack browser tests.[^7][^9]

## How does SWE-Lancer compare to SWE-bench and HumanEval?

SWE-Lancer occupies a distinctive niche in the landscape of [LLM](/wiki/llm) coding benchmarks. The following table sketches the main contrasts.[^1][^3][^15]

| Benchmark | Source | Test type | Difficulty signal | Languages |
|---|---|---|---|---|
| [HumanEval](/wiki/humaneval) (2021) | Hand-written prompts | Function-level unit tests | Manual curation | Python |
| [SWE-bench](/wiki/swe_bench) (2023) | 12 Python GitHub repos | Repo-level unit tests | GitHub issue labels | Python |
| [SWE-bench Verified](/wiki/swe_bench_verified) (2024) | SWE-bench, human-cleaned | Repo-level unit tests | Human filtering | Python |
| [LiveCodeBench](/wiki/livecodebench) | Competitive programming | Hidden test cases | Contest difficulty | Python (mainly) |
| [Aider Polyglot](/wiki/aider_polyglot) | Curated multi-language tasks | Unit tests | Manual curation | 6+ languages |
| [Terminal-Bench](/wiki/terminal_bench) | Hand-crafted CLI tasks | Behavioral checks | Manual curation | Shell-centric |
| **SWE-Lancer** | Expensify Upwork tickets | End-to-end browser tests | Real Upwork payout | TypeScript (full-stack) |

The most direct competitor in scope is [SWE-bench Verified](/wiki/swe_bench_verified), which similarly evaluates patches against existing tests in a real-world repository. SWE-Lancer is differentiated by its single-codebase TypeScript scope, its end-to-end (rather than unit) testing harness, its inclusion of a manager-style multiple-choice task type, and its payout-weighted scoring. Critics of SWE-Lancer often point to SWE-bench Verified as a more diverse benchmark; defenders point to SWE-Lancer's heavier emphasis on UI-aware, multi-file work that is closer to typical product-engineering practice than backend Python library patches.[^3][^4][^15]

## What are the strengths and criticisms of SWE-Lancer?

Reception of SWE-Lancer in academic and industry circles has been broadly positive, with the arXiv paper subsequently being accepted as a poster at the International Conference on Machine Learning ([ICML](/wiki/icml)) 2025. Commentators have highlighted three perceived strengths: end-to-end testing, payout-weighted scoring, and the inclusion of manager-style tasks alongside code-writing tasks.[^1][^16]

Several specific criticisms have nonetheless emerged. The most-cited concerns include:

* **Single-codebase generalization.** Because all 1,488 tasks come from the Expensify mobile and web codebase, results may reflect idiosyncrasies of that project (React Native, particular state-management patterns, Expensify-specific abstractions) rather than general engineering ability. The authors acknowledge this and recommend caution when generalizing beyond freelance contexts. They also concede that infrastructure engineering tasks, such as debugging Kubernetes clusters, pod failures, or networking problems, are underrepresented because they are rare in Expensify's posted task set.[^1][^4]
* **Test-based scoring.** End-to-end tests improve realism over unit tests but still cannot judge code quality, maintainability, or architectural soundness. A model that passes the tests with a hacky patch receives the same score as one that delivers a clean implementation.[^4][^17]
* **No clarification or negotiation.** Real freelancers can ask the client for clarification, propose alternative approaches, or refuse out-of-scope work. SWE-Lancer models receive a fixed problem statement and must work from it, which the paper concedes may understate model capability.[^1]
* **Text-only modality.** The original benchmark provides only textual problem descriptions and code; many real Upwork tickets include screenshots, screen-recordings, or video demonstrations of bugs, which the benchmark does not surface to the model.[^1]
* **"Pay" framing as economic measurement.** Tying scores to dollar payouts produces vivid headlines ("AI earns $403,000 of $1 million") but is at best a stylized economic signal. The dollar amounts reflect Upwork market clearing prices for these specific tasks at specific times, not the marginal social or economic value of automation.[^4][^17]
* **Contamination risk.** Although the authors hold out a private subset and analyze knowledge cutoffs, the public Diamond split remains vulnerable to direct training-data inclusion, particularly for models that browse the web during evaluation.[^7]

On [developer](/wiki/llm) community discussion sites such as Hacker News, threads about SWE-Lancer have raised additional concerns about whether the benchmark's IC tasks systematically underweight backend infrastructure work (common in full-time engineering but rarer on Upwork), and whether the benchmark's heavy UI emphasis biases scores toward models trained with substantial web-development data.[^13][^17] Despite these criticisms, the benchmark has been incorporated into routine evaluation suites at OpenAI, several model vendors, and third-party evaluators, and its underlying methodology, payout-weighted scoring with end-to-end testing on real product code, has been cited as a template for next-generation [software-engineering benchmarks](/wiki/benchmark).[^4][^11][^16]

## References

[^1]: Miserendino, Samuel; Wang, Michele; Patwardhan, Tejal; Heidecke, Johannes. *SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?* arXiv:2502.12115. 17 February 2025. https://arxiv.org/abs/2502.12115

[^2]: OpenAI. "Introducing the SWE-Lancer benchmark." OpenAI Research blog, 17 February 2025. https://openai.com/index/swe-lancer/

[^3]: MarkTechPost. "OpenAI Introduces SWE-Lancer: A Benchmark for Evaluating Model Performance on Real-World Freelance Software Engineering Work." 17 February 2025. https://www.marktechpost.com/2025/02/17/openai-introduces-swe-lancer-a-benchmark-for-evaluating-model-performance-on-real-world-freelance-software-engineering-work/

[^4]: Dickson, Ben. "Claude 3.5 Sonnet outperforms GPT-4o and o1 in software engineering, OpenAI study shows." TechTalks, 24 February 2025. https://bdtechtalks.com/2025/02/24/claude-3-5-sonnet-outperforms-gpt-4o-and-o1-in-software-engineering-openai-study-shows/

[^5]: GIGAZINE. "OpenAI releases AI benchmark 'SWE-Lancer' to measure whether a machine can perform tasks that would cost a freelance engineer $1 million." 19 February 2025. https://gigazine.net/gsc_news/en/20250219-openai-swe-lancer/

[^6]: Analytics Vidhya. "OpenAI's SWE-Lancer Benchmark: Testing AI on $1 Million Worth of Freelance Coding Tasks." February 2025. https://www.analyticsvidhya.com/blog/2025/02/openais-swe-lancer-benchmark/

[^7]: OpenAI. SWELancer-Benchmark GitHub repository (archived 18 July 2025). https://github.com/openai/SWELancer-Benchmark

[^8]: OpenAI Developer Community. "OpenAI releases new coding benchmark SWE-Lancer showing 3.5 Sonnet beating o1." February 2025. https://community.openai.com/t/openai-releases-new-coding-benchmark-swe-lancer-showing-3-5-sonnet-beating-o1/1123976

[^9]: DeepWiki. "openai/SWELancer-Benchmark." 2025. https://deepwiki.com/openai/SWELancer-Benchmark

[^10]: Expensify Engineering. "Expensify Powers OpenAI's SWE-Lancer: Real-World AI Benchmarks." Expensify blog, 2025. https://use.expensify.com/blog/expensify-powers-openai-swe-lancer-project

[^11]: LLM Stats. "SWE-Lancer Leaderboard." Accessed 2026. https://llm-stats.com/benchmarks/swe-lancer

[^12]: LLM Stats. "SWE-Lancer (IC-Diamond subset) Benchmark Leaderboard." Accessed 2026. https://llm-stats.com/benchmarks/swe-lancer-(ic-diamond-subset)

[^13]: Hacker News. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork." Discussion thread, February 2025. https://news.ycombinator.com/item?id=43086347

[^14]: EmergentMind. "SWE-Lancer-Loc: Real-World Issue Localization." 2025. https://www.emergentmind.com/topics/swe-lancer-loc

[^15]: SWE-bench. "SWE-bench Leaderboards." Accessed 2026. https://www.swebench.com/

[^16]: ICML 2025. "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?" Poster. https://icml.cc/virtual/2025/poster/43573

[^17]: DevOps.com. "AI Coding: New Research Shows Even the Best Models Struggle With Real-World Software Engineering." 2025. https://devops.com/ai-coding-new-research-shows-even-the-best-models-struggle-with-real-world-software-engineering-2/

[^18]: "Extracting Conceptual Knowledge to Locate Software Issues" (introducing the SWE-Lancer-Loc benchmark). arXiv:2509.21427. 2025. https://arxiv.org/abs/2509.21427