SWE-Lancer

AI Benchmarks AI Code Generation OpenAI

19 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v5 · 3,892 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

SWE-Lancer is a benchmark released by OpenAI in February 2025 that evaluates the ability of frontier large language models to perform real-world freelance software-engineering work. The benchmark consists of 1,488 paid tasks drawn from the freelance marketplace Upwork, with a combined real-world payout of one million United States dollars. All tasks were sourced from the open-source mobile and web codebase of the expense-management company Expensify and are graded with end-to-end tests rather than unit tests, with each model's score reported as the cumulative dollar value of the tasks it successfully completes. In OpenAI's launch evaluation, the best-performing model, Anthropic's Claude 3.5 Sonnet, earned $403,325 (about 40 percent of the available pool), leading the authors to conclude that frontier models are "still unable to solve the majority of tasks."^[1]^[2]

Unlike earlier coding benchmarks such as HumanEval or SWE-bench, which evaluate isolated programming problems or repository-level patches against test suites, SWE-Lancer is explicitly designed to mirror the economic structure of paid software work. Tasks range from $50 bug fixes to $32,000 feature implementations, and they are split into two categories: Individual Contributor (IC) tasks, in which a model must write code that passes end-to-end browser tests, and Software Engineering Manager (SWE Manager) tasks, in which the model must select the best implementation proposal from several competing options that real engineering managers reviewed.^[1]^[3]

The benchmark was introduced on 17 February 2025 with an OpenAI research blog post and an arXiv preprint titled SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?, later published in the Proceedings of the 42nd International Conference on Machine Learning (PMLR 267). The initial study found that no frontier model evaluated could earn more than about 40 percent of the available payout pool, with Claude 3.5 Sonnet earning the most ($403,325 out of $1,000,000, a 40.3 percent earn rate) and OpenAI's GPT-4o earning the least among the three headline models tested ($303,525).^[1]^[2]^[4]

Key facts

Field	Value
Released	17 February 2025 (arXiv 2502.12115)
Authors	Samuel Miserendino, Michele Wang, Tejal Patwardhan, Johannes Heidecke (OpenAI); contributions from the Expensify engineering team
Total tasks	1,488 (764 IC SWE; 724 SWE Manager)
Total payout pool	$1,000,000 USD ($414,775 IC; $585,225 Manager)
Source codebase	Expensify (mobile and web app, primarily TypeScript / React Native)
Evaluation	End-to-end browser tests (Playwright) for IC; ground-truth manager decisions for Manager tasks
Public split	SWE-Lancer Diamond: 502 tasks worth $500,800 (237 IC / $236,300; 265 Manager / $264,500)
Top model at launch	Claude 3.5 Sonnet, $403,325 (40.3% of pool)
Repository	`github.com/openai/SWELancer-Benchmark` (archived 18 July 2025; now part of `github.com/openai/preparedness`)
Venue	ICML 2025 (poster; PMLR 267)

Why did OpenAI build SWE-Lancer?

By early 2025, coding benchmarks for LLMs had proliferated, but most measured tightly scoped tasks: completing single functions (HumanEval), resolving GitHub issues against unit tests (SWE-bench and SWE-bench Verified), or solving competitive-programming puzzles. Critics from both industry and academia argued that these benchmarks did not capture the messier reality of paid engineering work, which involves ambiguous requirements, full-stack codebases, mobile and browser front-ends, integration with existing services, and the need to demonstrate that a change actually fixes a user-visible problem rather than merely passing a synthetic unit test.^[3]^[5]

The framing of SWE-Lancer as a benchmark in which models attempt to "earn" money is a deliberate rhetorical and methodological choice. By tying each task's score to its real Upwork payout, the authors construct a difficulty gradient that emerges from the labor market itself rather than from researcher-assigned weights. A model that fixes ten $50 bug reports is credited with $500, while a model that delivers a $32,000 feature implementation earns the full payout for that single task. This payout-weighted scoring also produces a headline figure with intuitive economic interpretation: the dollar amount of paid freelance work a model could (in principle) replace.^[1]^[2]

The authors are careful to caveat that the dollar figures are illustrative rather than predictive of any actual labor-market impact. The benchmark assumes a perfect oracle for grading (the original Upwork client's acceptance criteria, encoded as end-to-end tests), no negotiation or clarification, and no ability for the model to refuse low-quality tasks. Even so, payout-weighted scoring has become one of SWE-Lancer's most-cited innovations and has influenced follow-on benchmarks that incorporate cost, latency, or economic-value weighting. The paper frames the benchmark's ultimate purpose in economic terms: "By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development."^[1]^[6]

What is in the SWE-Lancer dataset?

The 1,488 tasks are divided into 764 Individual Contributor (IC) SWE tasks worth $414,775 and 724 SWE Manager tasks worth $585,225. Individual task prices, set by the original Upwork market rather than by researchers, span from $50 to $32,000, and in the open-sourced Diamond subset 35 percent of tasks are worth more than $1,000 while 34 percent are worth between $500 and $1,000. The average Diamond task took 26 days to resolve on GitHub and accumulated 47 comments; a typical IC SWE task in that subset requires modifying about 2 files and 69 lines of code, while a SWE Manager task requires choosing among 4 to 5 competing proposals.^[1]

Source codebase

All 1,488 tasks were drawn from the public Expensify mobile and web application repository, available at github.com/Expensify/App. Expensify routinely contracts freelance engineers through Upwork to fix bugs and implement features in this codebase, and the company partnered with OpenAI to release a snapshot of historical tasks together with the original payouts, problem statements, and accepted implementations. OpenAI describes Expensify as a $300 million public company (NASDAQ: EXFY) with 12 million users, calling it "a reliable testbed for sourcing commercially valuable software engineering tasks." The Expensify application is a cross-platform expense-management product built primarily in TypeScript using React Native for mobile and React for web, spanning roughly two million lines of code, which means most tasks involve full-stack JavaScript/TypeScript work with substantial UI components.^[1]^[4]^[7]^[18]

Because every task originates from the same codebase, SWE-Lancer is sometimes described as a "single-repository" benchmark, in contrast with SWE-bench, whose tasks span twelve Python repositories. The single-codebase design enables a unified Docker image and consistent test harness, but it also introduces a generalization limit that the authors explicitly acknowledge. In the open-sourced Diamond subset, 74 percent of IC SWE tasks and 76 percent of SWE Manager tasks involve application logic, 17 percent of IC tasks and 18 percent of manager tasks involve UI/UX work, and 88 percent of IC tasks and 94 percent of manager tasks are classified as bug fixes.^[1]^[7]

IC SWE tasks

Individual Contributor tasks make up 764 of the 1,488 problems and account for $414,775 of the $1,000,000 pool. Each IC task is structured around an original Upwork ticket: the model is provided with a natural-language problem statement, a fixed snapshot of the Expensify repository at the time the task was posted, and reproduction steps. The model must produce a code patch. Success is determined by running a set of end-to-end browser tests, typically implemented with Playwright, that simulate a user clicking through the Expensify app to confirm the intended behaviour.^[1]^[3]^[4]

The authors emphasize that end-to-end tests are stricter than unit tests for this domain because Expensify is heavily UI-driven; a patch that compiles and passes function-level assertions can still produce a broken user experience. OpenAI paid a team of 100 professional software engineers to write and verify the end-to-end tests, and every IC task's test was triple-verified by experienced engineers before inclusion. IC tasks span client-side application logic, server-side logic, UI/UX, and system-wide quality work. In the paper's Diamond-set analysis, Claude 3.5 Sonnet's strongest IC sub-category was server-side logic (41.2 percent pass rate), followed by UI/UX (31.7 percent) and client-side application logic (23.9 percent), while the three system-wide quality and reliability tasks went unsolved by every model (0 percent). Pass rates by task type for the three headline models are shown below.^[1]^[4]^[8]

IC SWE task type (Diamond)	GPT-4o	o1 (high)	Claude 3.5 Sonnet	Tasks
Application logic (client-side)	8.0%	15.9%	23.9%	176
UI/UX	2.4%	17.1%	31.7%	41
Server-side logic	23.5%	23.5%	41.2%	17
System-wide quality and reliability	0.0%	0.0%	0.0%	3

SWE Manager tasks

Software Engineering Manager tasks make up the remaining 724 problems and account for $585,225 of the payout pool, a majority of the benchmark's dollar value. Rather than writing code, the model is presented with the original problem statement and several competing implementation proposals (typically pull requests submitted by different freelancers). The model must select the proposal that the human engineering manager who hired for the task originally chose as best. A validation study found that experienced engineers, given the same materials, reached 99 percent agreement with the original manager's choice, suggesting the ground truth is well-defined.^[1]^[4]

This task type captures a distinct skill from code writing: the ability to read, compare, and judge other people's code. The authors note that frontier models in early 2025 performed substantially better on Manager tasks than on IC tasks, with pass@1 on Manager tasks often more than double the corresponding IC SWE pass rate, which they interpret as evidence that current LLMs are stronger as code reviewers and technical advisors than as autonomous implementers.^[1]^[4]^[8]

How does SWE-Lancer evaluate models?

Evaluation environment

OpenAI released the benchmark with a unified Docker image that bundles the Expensify codebase at the correct historical snapshot, all required dependencies, the Playwright browser test harness, and a thin runner that scores model patches. For the launch study, agents ran in a Docker container with the repository preconfigured and no internet access, preventing them from retrieving external information. The Docker design is intended to ensure reproducibility: rather than requiring researchers to provision a complex development environment, any team can pull a pre-built image and evaluate a model in a hermetic container. Each per-task image occupies approximately 14 GB and takes 10-20 minutes to build, which the authors flag as one of the benchmark's main practical costs.^[1]^[7]^[9]

For IC tasks, scoring is binary at the task level: a patch either passes all of its end-to-end tests (earning the task's full payout) or fails (earning $0). Partial credit is not awarded. For Manager tasks, scoring is similarly binary: the model either selects the same proposal the original manager selected (full payout) or selects a different proposal ($0). Aggregate scores are reported as both percentages (fraction of tasks resolved) and dollar amounts (sum of payouts earned), with an "earn rate" defined as payout received divided by total possible payout.^[1]^[4]

Pass@k and tool use

The original paper reports primarily pass@1 numbers using a default prompting setup, but it also includes ablation studies that explore the effect of multiple attempts, reasoning effort, and tool use. In one widely cited result, o1's pass rate on Diamond IC SWE tasks rose from 16.5 percent at pass@1 to 48.5 percent at pass@7, nearly tripling when allowed additional attempts. Increasing o1's reasoning effort from low to high lifted its Diamond IC SWE pass@1 from 9.3 percent to 16.5 percent, and giving the model access to the "user tool" (a Playwright-driven browser it can operate to reproduce and inspect bugs) raised the same figure from 13.1 percent to 16.5 percent. The authors note that the user tool takes 90 to 120 seconds per invocation, and that weaker models such as GPT-4o are "prone to abandoning the tool altogether."^[1]^[4]

Contamination control

Because the underlying Expensify task data was public on Upwork and GitHub before SWE-Lancer was released, the authors worried about training-data contamination. They mitigate this in two ways. First, they release only a subset of the benchmark, SWE-Lancer Diamond, publicly, holding the remainder as a private evaluation set. Second, they analyze the timing of each task relative to model knowledge cutoffs (the public Diamond tasks originate from GitHub issues posted between 2023 and 2024) and report that contamination effects appear limited for tasks predating the cutoff. Nonetheless, contamination remains a known limitation, particularly for models capable of web browsing during evaluation.^[1]^[7]

How did the first models score on SWE-Lancer?

The original paper reports headline numbers for three frontier models: OpenAI's GPT-4o (the gpt-4o-2024-08-06 snapshot) and o1 (the reasoning model released in December 2024, run at high reasoning effort), and Anthropic's Claude 3.5 Sonnet (claude-3-5-sonnet-20241022, the October 2024 update). The pass@1 results on the Diamond subset and total earnings on the full set are summarized below.^[1]^[4]^[8]

Model	IC SWE (Diamond) pass@1	SWE Manager (Diamond) pass@1	Full-set earnings (of $1,000,000)
Claude 3.5 Sonnet	26.2%	44.9%	$403,325
OpenAI o1 (high)	16.5%	41.5%	$380,350
GPT-4o	8.0%	37.0%	$303,525

Across the full 1,488-task set, Claude 3.5 Sonnet resolved 33.7 percent of tasks (21.1 percent of IC tasks and 47.0 percent of manager tasks), o1 resolved 32.9 percent, and GPT-4o resolved 23.3 percent. On the public Diamond subset specifically, Claude 3.5 Sonnet earned $208,050 out of the $500,800 available, a 41.5 percent earn rate. These numbers attracted attention because Anthropic's Claude 3.5 Sonnet, a non-reasoning model, outperformed OpenAI's reasoning model o1 on this benchmark released by OpenAI itself, a reversal of the typical ordering on contemporaneous benchmarks such as SWE-bench Verified. Several commentators interpreted the result as a sign that real-world freelance work rewards a different mix of skills (broad codebase navigation, UI-aware patching) than the structured competition-style tasks on which reasoning models typically excel.^[1]^[4]^[8]^[10]

Qualitatively, the paper reports that agents "excel at localizing, but fail to root cause": models pinpoint the relevant file and functions quickly through keyword search across the whole repository, often faster than a human would, but frequently miss how an issue spans multiple components and therefore deliver partial or incorrect fixes.^[1]

The authors emphasize that even the best model leaves the majority of the payout pool unearned: roughly 60 percent of the available money goes uncollected by Claude 3.5 Sonnet, and the model fails outright on most IC tasks. The paper's stated conclusion is that "frontier models are still unable to solve the majority of tasks" in this benchmark.^[1]

What are the latest SWE-Lancer leaderboard scores?

Since the February 2025 release, multiple frontier-model launches have been accompanied by SWE-Lancer numbers. Public leaderboards aggregating results, most prominently the LLM Stats SWE-Lancer pages, track a growing list of models on both the full benchmark and the IC-Diamond subset.^[11]^[12]

On the full SWE-Lancer leaderboard, OpenAI models for which scores have been published include GPT-4o (32.6 percent), GPT-4.5 (37.3 percent), o3-mini (18.0 percent), and as of late 2025 GPT-5.1 Codex with a score of approximately 66.3 percent, the highest publicly reported number on the full benchmark.^[11] On the IC-Diamond subset, the leaderboard tracks GPT-4o (12.4 percent), GPT-4.5 (17.4 percent), o3-mini (7.4 percent), GPT-5.2 (74.6 percent, posted December 2025), GPT-5.3 Codex (81.4 percent), and GPT-5 with a reported score of 100 percent on the IC-Diamond split.^[12]

Anthropic, Google DeepMind, and other vendors have not consistently published SWE-Lancer numbers for their post-2025 models on the public leaderboards, but third-party evaluations and individual blog posts have reported scores for Claude 3.7 Sonnet (released February 2025), Claude Opus 4, Claude Sonnet 4.5, and Gemini 2.5 Pro in the 40-70 percent range on various subsets. Because methodologies vary (full benchmark vs. Diamond, pass@1 vs. pass@k, with or without tool use), cross-model comparison outside of head-to-head papers should be interpreted with caution.^[4]^[11]^[12]

The trajectory of reported scores has driven a recurring debate in the community: SWE-Lancer Diamond, in particular, appears to be approaching saturation by late 2025, with at least one model scoring 100 percent. Critics argue that the small size of the Diamond split (relative to the full benchmark) makes such ceiling-effects unsurprising and that the more representative full-benchmark numbers, which top out near two-thirds, remain a meaningful target.^[12]^[13]

SWE-Lancer Diamond and variants

SWE-Lancer Diamond is the publicly released evaluation split of the benchmark. It consists of 502 tasks worth $500,800: 237 IC SWE tasks worth $236,300 and 265 SWE Manager tasks worth $264,500, for which the authors have made the full Docker images, problem statements, and reference tests openly available. The remaining tasks form a private holdout used for OpenAI's internal evaluations and for guarding against contamination. The Diamond split's name follows OpenAI's convention of using "Diamond" to denote a high-confidence, fully-verified public subset, as also seen in benchmarks such as GPQA-Diamond. The IC-Diamond leaderboard tracked on third-party services such as LLM Stats refers specifically to the IC SWE tasks within Diamond.^[1]^[12]

A 2025 research follow-up named SWE-Lancer-Loc restructures a portion of the benchmark for "issue localization," the task of identifying which files in the Expensify codebase must be edited to resolve a given problem, without requiring the model to produce a working patch. It derives 216 localization issues from SWE-Lancer Diamond and evaluates models with file-level and function-level Hit@k and Recall@k metrics over the roughly two-million-line Expensify codebase. Localization is treated as a lower-bound proxy for engineering competence: a model that cannot find the right file is unlikely to fix the bug correctly.^[14]^[18]

In July 2025 the original openai/SWELancer-Benchmark repository was archived and its contents merged into a broader openai/preparedness repository, where active maintenance of evaluation tooling continued. As of mid-2025, the maintainers reported that 198 of the tasks had been adjusted to run successfully offline, a non-trivial subset given the network-dependent nature of full-stack browser tests.^[7]^[9]

How does SWE-Lancer compare to SWE-bench and HumanEval?

SWE-Lancer occupies a distinctive niche in the landscape of LLM coding benchmarks. The following table sketches the main contrasts.^[1]^[3]^[15]

Benchmark	Source	Test type	Difficulty signal	Languages
HumanEval (2021)	Hand-written prompts	Function-level unit tests	Manual curation	Python
SWE-bench (2023)	12 Python GitHub repos	Repo-level unit tests	GitHub issue labels	Python
SWE-bench Verified (2024)	SWE-bench, human-cleaned	Repo-level unit tests	Human filtering	Python
LiveCodeBench	Competitive programming	Hidden test cases	Contest difficulty	Python (mainly)
Aider Polyglot	Curated multi-language tasks	Unit tests	Manual curation	6+ languages
Terminal-Bench	Hand-crafted CLI tasks	Behavioral checks	Manual curation	Shell-centric
SWE-Lancer	Expensify Upwork tickets	End-to-end browser tests	Real Upwork payout	TypeScript (full-stack)

The most direct competitor in scope is SWE-bench Verified, which similarly evaluates patches against existing tests in a real-world repository. SWE-Lancer is differentiated by its single-codebase TypeScript scope, its end-to-end (rather than unit) testing harness, its inclusion of a manager-style multiple-choice task type, and its payout-weighted scoring. Critics of SWE-Lancer often point to SWE-bench Verified as a more diverse benchmark; defenders point to SWE-Lancer's heavier emphasis on UI-aware, multi-file work that is closer to typical product-engineering practice than backend Python library patches.^[3]^[4]^[15]

What are the strengths and criticisms of SWE-Lancer?

Reception of SWE-Lancer in academic and industry circles has been broadly positive, with the arXiv paper subsequently being accepted as a poster at the International Conference on Machine Learning (ICML) 2025. Commentators have highlighted three perceived strengths: end-to-end testing, payout-weighted scoring, and the inclusion of manager-style tasks alongside code-writing tasks.^[1]^[16]

Several specific criticisms have nonetheless emerged. The most-cited concerns include:

Single-codebase generalization. Because all 1,488 tasks come from the Expensify mobile and web codebase, results may reflect idiosyncrasies of that project (React Native, particular state-management patterns, Expensify-specific abstractions) rather than general engineering ability. The authors acknowledge this and recommend caution when generalizing beyond freelance contexts. They also concede that infrastructure engineering tasks, such as debugging Kubernetes clusters, pod failures, or networking problems, are underrepresented because they are rare in Expensify's posted task set.^[1]^[4]
Test-based scoring. End-to-end tests improve realism over unit tests but still cannot judge code quality, maintainability, or architectural soundness. A model that passes the tests with a hacky patch receives the same score as one that delivers a clean implementation.^[4]^[17]
No clarification or negotiation. Real freelancers can ask the client for clarification, propose alternative approaches, or refuse out-of-scope work. SWE-Lancer models receive a fixed problem statement and must work from it, which the paper concedes may understate model capability.^[1]
Text-only modality. The original benchmark provides only textual problem descriptions and code; many real Upwork tickets include screenshots, screen-recordings, or video demonstrations of bugs, which the benchmark does not surface to the model.^[1]
"Pay" framing as economic measurement. Tying scores to dollar payouts produces vivid headlines ("AI earns $403,000 of $1 million") but is at best a stylized economic signal. The dollar amounts reflect Upwork market clearing prices for these specific tasks at specific times, not the marginal social or economic value of automation.^[4]^[17]
Contamination risk. Although the authors hold out a private subset and analyze knowledge cutoffs, the public Diamond split remains vulnerable to direct training-data inclusion, particularly for models that browse the web during evaluation.^[7]

On developer community discussion sites such as Hacker News, threads about SWE-Lancer have raised additional concerns about whether the benchmark's IC tasks systematically underweight backend infrastructure work (common in full-time engineering but rarer on Upwork), and whether the benchmark's heavy UI emphasis biases scores toward models trained with substantial web-development data.^[13]^[17] Despite these criticisms, the benchmark has been incorporated into routine evaluation suites at OpenAI, several model vendors, and third-party evaluators, and its underlying methodology, payout-weighted scoring with end-to-end testing on real product code, has been cited as a template for next-generation software-engineering benchmarks.^[4]^[11]^[16]

References

Miserendino, Samuel; Wang, Michele; Patwardhan, Tejal; Heidecke, Johannes. *SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?* arXiv:2502.12115. 17 February 2025. https://arxiv.org/abs/2502.12115 ↩
OpenAI. "Introducing the SWE-Lancer benchmark." OpenAI Research blog, 17 February 2025. https://openai.com/index/swe-lancer/ ↩
MarkTechPost. "OpenAI Introduces SWE-Lancer: A Benchmark for Evaluating Model Performance on Real-World Freelance Software Engineering Work." 17 February 2025. https://www.marktechpost.com/2025/02/17/openai-introduces-swe-lancer-a-benchmark-for-evaluating-model-performance-on-real-world-freelance-software-engineering-work/ ↩
Dickson, Ben. "Claude 3.5 Sonnet outperforms GPT-4o and o1 in software engineering, OpenAI study shows." TechTalks, 24 February 2025. https://bdtechtalks.com/2025/02/24/claude-3-5-sonnet-outperforms-gpt-4o-and-o1-in-software-engineering-openai-study-shows/ ↩
GIGAZINE. "OpenAI releases AI benchmark 'SWE-Lancer' to measure whether a machine can perform tasks that would cost a freelance engineer $1 million." 19 February 2025. https://gigazine.net/gsc_news/en/20250219-openai-swe-lancer/ ↩
Analytics Vidhya. "OpenAI's SWE-Lancer Benchmark: Testing AI on $1 Million Worth of Freelance Coding Tasks." February 2025. https://www.analyticsvidhya.com/blog/2025/02/openais-swe-lancer-benchmark/ ↩
OpenAI. SWELancer-Benchmark GitHub repository (archived 18 July 2025). https://github.com/openai/SWELancer-Benchmark ↩
OpenAI Developer Community. "OpenAI releases new coding benchmark SWE-Lancer showing 3.5 Sonnet beating o1." February 2025. https://community.openai.com/t/openai-releases-new-coding-benchmark-swe-lancer-showing-3-5-sonnet-beating-o1/1123976 ↩
DeepWiki. "openai/SWELancer-Benchmark." 2025. https://deepwiki.com/openai/SWELancer-Benchmark ↩
Expensify Engineering. "Expensify Powers OpenAI's SWE-Lancer: Real-World AI Benchmarks." Expensify blog, 2025. https://use.expensify.com/blog/expensify-powers-openai-swe-lancer-project ↩
LLM Stats. "SWE-Lancer Leaderboard." Accessed 2026. https://llm-stats.com/benchmarks/swe-lancer ↩
LLM Stats. "SWE-Lancer (IC-Diamond subset) Benchmark Leaderboard." Accessed 2026. https://llm-stats.com/benchmarks/swe-lancer-(ic-diamond-subset) ↩
Hacker News. "SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork." Discussion thread, February 2025. https://news.ycombinator.com/item?id=43086347 ↩
EmergentMind. "SWE-Lancer-Loc: Real-World Issue Localization." 2025. https://www.emergentmind.com/topics/swe-lancer-loc ↩
SWE-bench. "SWE-bench Leaderboards." Accessed 2026. https://www.swebench.com/ ↩
ICML 2025. "SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?" Poster. https://icml.cc/virtual/2025/poster/43573 ↩
DevOps.com. "AI Coding: New Research Shows Even the Best Models Struggle With Real-World Software Engineering." 2025. https://devops.com/ai-coding-new-research-shows-even-the-best-models-struggle-with-real-world-software-engineering-2/ ↩
"Extracting Conceptual Knowledge to Locate Software Issues" (introducing the SWE-Lancer-Loc benchmark). arXiv:2509.21427. 2025. https://arxiv.org/abs/2509.21427 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Factorio Learning Environment GPT-5.1-Codex-Max SWE-bench Multimodal

Key facts

Why did OpenAI build SWE-Lancer?

What is in the SWE-Lancer dataset?

Source codebase

IC SWE tasks

SWE Manager tasks

How does SWE-Lancer evaluate models?

Evaluation environment

Pass@k and tool use

Contamination control

How did the first models score on SWE-Lancer?

What are the latest SWE-Lancer leaderboard scores?

SWE-Lancer Diamond and variants

How does SWE-Lancer compare to SWE-bench and HumanEval?

What are the strengths and criticisms of SWE-Lancer?

References

Improve this article

Related Articles

Programming with ChatGPT

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

What links here

Related Articles

Programming with ChatGPT

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

What links here