SWE-rebench
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,753 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,753 words
Add missing citations, update stale details, or suggest a clearer explanation.
SWE-rebench is a continuously refreshed, contamination-resistant AI benchmark and public leaderboard for evaluating AI agents on real-world software engineering tasks. It is built and maintained by Nebius, and it extends the methodology of SWE-bench, the de facto standard for measuring whether language model agents can resolve GitHub issues by editing real codebases [1][2]. The project consists of two linked artifacts: a large public dataset of more than 21,000 interactive Python tasks suitable for training and evaluation, and a private, continuously updated evaluation benchmark whose results are published on a public leaderboard at swe-rebench.com [1][3].
The defining idea of SWE-rebench is freshness. An automated pipeline constantly mines newly merged pull requests and their linked issues from thousands of GitHub repositories, packages each as an executable SWE-bench-style task, and evaluates models only on tasks whose creation dates fall after the models' training cutoffs. Because the evaluation tasks did not exist when a model was trained, scores cannot be inflated by the model having memorized the solution. This directly targets the two failure modes that erode older static benchmarks: training-data contamination and score saturation [1][2].
The accompanying paper, "SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents," was first posted to arXiv on May 26, 2025 (arXiv:2505.20411), with a revised version on November 4, 2025, and was accepted to the NeurIPS 2025 Datasets and Benchmarks track. All nine authors (Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, and Boris Yangel) are affiliated with Nebius [1][2].
SWE-bench, released in 2023, has been the primary yardstick for autonomous coding agents. Its tasks are real GitHub issues paired with the human pull request that fixed them, and an agent is judged solely on whether its code change makes the repository's hidden test suite pass. The benchmark has been enormously influential, but its very success creates problems [1].
The first problem is contamination. Because SWE-bench and its tasks have been public since 2023, any model trained afterward may have been exposed to the issues, the repositories, and even the reference fixes during pretraining or post-training. When a model has effectively seen the answer, a high score reflects memorization rather than genuine generalization, confounding the measurement of true capability [1]. This is the same concern that motivated live, rolling benchmarks for code generation such as LiveCodeBench, which only scores problems published after a model's release date [4].
The second problem is saturation. As frontier models improve, scores on a fixed test set climb toward a ceiling, compressing the differences between strong models and reducing the benchmark's power to discriminate. Curated subsets such as SWE-bench Verified, a 500-task slice that OpenAI and the SWE-bench team filtered for solvable, well-specified tasks, improved reliability but remain static and therefore still subject to gradual contamination and saturation over time [1].
The SWE-rebench authors add two further methodological complaints. Scaffolding variability, meaning the wide range of prompts, multi-agent frameworks, and retry strategies wrapped around a model, makes published SWE-bench numbers hard to compare across labs and risks implicit overfitting to the benchmark. And the common practice of reporting only the best of several runs, or of self-reporting without independent verification, can overstate a model's real resolved rate [1].
SWE-rebench addresses these issues with a fully automated, four-stage collection pipeline followed by a standardized evaluation protocol [1].
Stage one is preliminary task collection. The pipeline ingests the GitHub Archive event stream (roughly 21 TB of public events) and clones repositories with full commit history to avoid GitHub API rate limits. From an initial pull of about 450,000 pull requests linked to issues created before May 1, 2025, drawn from over 30,000 permissively licensed, predominantly Python repositories, it applies filters: the issue must be resolved, the PR must be merged into the main branch and linked to a single issue, the PR must add or modify tests, and the change must touch between 1 and 15 files. After filtering, roughly 153,400 candidate tasks remain [1].
Stage two is automated installation-instruction configuration. Unlike SWE-bench, which relied on manual environment setup, SWE-rebench uses an LLM (Qwen2.5-72B-Instruct) to read each repository's README, Dockerfile, and setup files and hypothesize a structured installation recipe, refining it from error logs when builds or tests fail. This automation is what makes the pipeline scale to thousands of repositories [1].
Stage three is execution-based verification. For each task the environment is built in a container (using Buildah, with TractoAI for distributed execution), the test patch is applied, and the test suite is run before and after the solution patch. A task is kept only if at least one test fails before the fix and passes after it, exactly reproducing the behavior recorded in the original pull request. Exact dependency versions are recorded for reproducibility [1].
Stage four is automated quality assessment. A fine-tuned instruction-following model, trained on human annotations from SWE-bench Verified, labels each surviving task for issue clarity, task complexity, and test-patch correctness, letting users filter for clear, solvable, non-trivial tasks [1].
The full pipeline yields the SWE-rebench dataset of 21,336 verifiable task instances drawn from 3,468 distinct repositories, released under CC BY 4.0 on Hugging Face on June 10, 2025 [1][3].
Decontamination is enforced at evaluation time. Because the pipeline records the precise creation date of every issue and pull request, the leaderboard can guarantee that each model is scored on tasks created after its release date. Evaluations that happen to include tasks predating a model's release are explicitly flagged as potentially contaminated, keeping the comparison transparent [1].
SWE-rebench maintains a private benchmark and publishes results on a public leaderboard. In the NeurIPS paper, the benchmark comprised 294 executable tasks selected from 169 diverse repositories, refreshed over time so that newer evaluation slices stay ahead of model training cutoffs [1].
To keep comparisons fair, Nebius runs every model itself rather than accepting self-reported scores. Each model is evaluated under a single fixed, minimal ReAct-style agentic scaffold with identical prompts, a standardized 128K-token context window, and default generation settings recommended by the model's developer. Function calling is deliberately not used, so all models interact with the environment through the same text-command interface. To account for the stochasticity of agent trajectories, each model is run five times across the full benchmark, and the leaderboard reports a Resolved Rate (best pass@1) alongside the standard error of the mean (SEM) and pass@5 [1].
As of June 8, 2026, the live leaderboard and third-party trackers show a tightly clustered set of frontier systems at the top, including recent Anthropic Claude, Z.ai GLM, DeepSeek, OpenAI GPT-Codex, and Alibaba Qwen models, with leading Resolved Rates in the low-to-mid 60 percent range [5]. Because the benchmark refreshes continuously and Nebius re-evaluates models on each new slice, exact rankings change month to month; the swe-rebench.com leaderboard is the authoritative current source.
The paper uses SWE-rebench's decontaminated, time-stamped tasks to probe for contamination and overfitting effects, comparing model performance across two temporal slices (tasks created in January 2025 versus March to April 2025) and against SWE-bench Verified [1].
The headline finding is that some scores appear inflated by contamination. Among the open models tested, GPT-4.1 was the only one whose performance noticeably declined on the fresher March to April slice relative to January (a Resolved Rate of 31.1 percent in January versus 26.7 percent in March to April), a pattern consistent with sensitivity to whether tasks predate the training cutoff. More broadly, several open models that scored well on SWE-bench Verified dropped substantially on the fresh SWE-rebench tasks, which the authors read as a sign that absolute Verified numbers may be inflated by data leakage [1].
A few representative comparisons from the paper illustrate the gap between the older static benchmark and the fresh tasks.
| Model | SWE-bench Verified (Resolved %) | SWE-rebench Mar to Apr 2025 (Resolved %) |
|---|---|---|
| DeepSeek-V3-0324 | 39.7 | 21.3 |
| DeepSeek-V3-1226 | 35.2 | 21.9 |
| Llama-4-Maverick-Instruct | 16.0 | 12.2 |
| Qwen2.5-72B-Instruct | 11.3 | 9.3 |
| Llama-4-Scout-Instruct | 8.8 | 5.3 |
| Qwen2.5-Coder-32B-Instruct | 4.9 | 3.2 |
Source: SWE-rebench paper, Table 2 [1].
Other observations from the result analysis: DeepSeek-V3 was the strongest open model across both SWE-rebench slices and SWE-bench Verified and was the most robust to changes in task distribution; Llama-4-Maverick showed a high pass@5 relative to its modest Resolved Rate, indicating high potential but inconsistent execution; Qwen2.5-Coder-32B-Instruct underperformed expectations because of instruction-following failures, hallucinated environment responses, and formatting errors; and Qwen3 variants performed similarly with and without explicit "think" mode enabled [1].
SWE-rebench is best understood as the live, decontaminated successor to SWE-bench for AI code generation agents, occupying the same conceptual niche that LiveCodeBench occupies for standalone competitive-programming problems [1][4]. It inherits SWE-bench's core task format (resolve a real GitHub issue so that hidden tests pass) and its execution-based grading, while replacing the fixed, aging task set with a rolling stream of fresh tasks and replacing per-lab self-reporting with centralized, standardized re-evaluation [1].
Relative to SWE-bench Verified, SWE-rebench keeps the goal of higher-quality, solvable tasks but pursues it through automated, LLM-assisted quality labeling at scale rather than one-time human curation, and it continuously refreshes the pool so that decontamination is maintained rather than eroding over time. The project sits within a broader 2025 to 2026 wave of contamination-aware coding benchmarks, alongside efforts such as SWE-bench Pro and SWE-PolyBench, that respond to the saturation and leakage of the original SWE-bench [1][3]. Beyond evaluation, the 21,336-task SWE-rebench dataset is explicitly intended to support reinforcement learning of software engineering agents, since each task ships with a reproducible execution environment and an automatic pass or fail verifier [1][3].