GSO
Last reviewed
May 10, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,677 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 3,677 words
Add missing citations, update stale details, or suggest a clearer explanation.
| GSO | |
|---|---|
| Overview | |
| Full name | GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents |
| Abbreviation | GSO |
| Description | A benchmark evaluating language models and SWE agents on real-world software performance optimization tasks drawn from open-source commit histories |
| Initial release | May 29, 2025 (arXiv v1) |
| Latest paper version | v3, October 24, 2025 |
| Venue | NeurIPS 2025 (poster, San Diego, December 3, 2025) |
| Authors | Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica |
| Organization | UC Berkeley Sky Computing Lab |
| Technical Details | |
| Type | Software optimization, repository-level code editing, multi-language programming |
| Modality | Code, text |
| Task format | Optimization patches against a precise performance-test specification |
| Number of tasks | 102 |
| Codebases | 10 |
| Domains | 8 (scientific computing, data analysis, image processing, web and network, ML inference, ML datasets, data validation, LLM tokenization) |
| Languages | 5 (Python, C, C++, Cython, Rust); 58.8% of tasks require non-Python edits |
| Primary metric | Opt@K (Opt₀.₉₅@K, with a 95% human-speedup threshold) |
| Secondary metrics | Opt_p@1 (varying speedup threshold p), harmonic-mean speedup ratio, edit size |
| Performance | |
| Human baseline | Expert open-source maintainer commits (108 to 110 line median, 3.9 files average) |
| Best Opt@1 in original paper | Claude-4.0 at 4.9% |
| Worst Opt@1 in original paper | GPT-4o at 0.0% |
| Opt@10 ceiling reported | About 15% with 8 to 10 rollouts |
| Saturated | No |
| Resources | |
| Website | gso-bench.github.io |
| Paper | arXiv:2505.23671 |
| GitHub | gso-bench/gso |
| Dataset | Hugging Face |
| License | MIT |
GSO (short for Global Software Optimization in the project page subtitle, and Challenging Software Optimization Tasks for Evaluating SWE-Agents in the paper title) is an artificial intelligence benchmark for software performance optimization. It was released on arXiv on May 29, 2025, by researchers at the UC Berkeley Sky Computing Lab and was accepted as a poster at NeurIPS 2025[1][2]. Instead of asking models to fix bugs or generate small programs from scratch, GSO gives an LLM SWE agent a real codebase, a build script, and a performance test, then asks it to produce a patch that runs as fast as the optimization an expert human committed to that repository. The headline result from the paper is that leading SWE agents achieve less than 5% success at the 95% human-speedup threshold, with Claude-4.0 topping the original evaluation at 4.9% and GPT-4o scoring 0.0%[1][3].
Most public coding benchmarks treat code as a correctness problem. HumanEval, MBPP, and LiveCodeBench ask whether a function returns the right value. SWE-bench and its variants ask whether an agent can resolve a GitHub issue. None reward making code faster, and none require the agent to reason about what is actually slow inside a production library.
Shetty and colleagues argue that real high-performance work, the kind that goes into systems like vLLM, HPC kernels, or PyTorch operators, looks nothing like a LeetCode problem. It involves multi-file edits, hardware-aware tricks, SIMD, cache-friendly memory layouts, and often a hop from Python down into C, C++, Cython, or Rust[1]. Their core question, quoted from the introduction, is whether "LLM agents [can] aid in the development of high-performance software."
The benchmark also side-steps a recurring weakness in agent evaluation: ambiguous natural-language specifications. A GitHub issue leaves the desired behavior up to the reader. A performance test is executable: it either runs faster, or it does not. That gives GSO what the authors call a precise specification for each task[1].
Each GSO instance contains:
The agent has to write a unified patch that, when applied on top of the base commit, makes the performance test run faster while still passing every correctness assertion. Success is judged against the human commit, not against the original slow code, which is why the metric is robust to the specific machine you run on.
The authors describe a two-stage automated pipeline followed by manual curation[1]:
llama-cpp task). The harness then runs the test on the pre-commit and post-commit code, verifies that outputs match, and keeps the candidate only if the speedup is real and consistent across multiple test cases.A single GSO test exercises the codebase with multiple workloads (12.45 per task on average, up to 20 in the hardest cases) so a model cannot win by special-casing one input.
| Codebase | Tasks | Languages | Domain |
|---|---|---|---|
| NumPy | 36 | Python, C, C++ | Scientific computing |
| Pandas | 34 | Python, Cython | Data analysis |
| Pillow-SIMD | 7 | Python, C | Image processing |
| Pillow | 4 | Python, C | Image processing |
| Pydantic | 4 | Python | Data validation |
| Tornado | 4 | Python | Web and network |
| Tokenizers | 4 | Python, Rust | LLM tokenizers |
| Transformers | 4 | Python | ML inference |
| Datasets | 3 | Python | ML datasets |
| llama.cpp | 2 | Python, C, C++ | ML inference |
This split is uneven on purpose: NumPy and Pandas dominate the count because they are the codebases with the richest history of well-instrumented performance commits. Even so, only about 41% of GSO tasks are pure Python, and 58.8% require touching at least one non-Python file[1]. That number matters because, as the failure-mode analysis later shows, almost every model collapses on tasks that need C, C++, or SIMD edits.
Ground-truth commits in GSO span a wide library of low-level techniques. Figure 3 of the paper lists the most common categories as SIMD/vectorize, caching/memoize, lazy evaluation, memory layouts, parallelism, string-search algorithms (Boyer-Moore-Horspool, Aho-Corasick automata), scatter/gather, CPU feature dispatch, branch elimination, table-driven lookup, select/sort kernels, and bitmap direct-address lookups[1]. These are the kinds of techniques you find in a systems class, not in a typical agent benchmark.
GSO's evaluation metric is one of the more interesting parts of the design. Software optimization is hard to score for two reasons: different tasks have different baselines, and tests within a task can have wildly different speedup magnitudes that distort aggregate numbers.
Let s_i = T(C_2, i) / T(C_1, i) be the speedup on test case i between two codebase states C_1 and C_2. Earlier work like PIE and ECCO used the geometric mean across tests, but the authors point out that geometric mean is too easy to game: a model that gets a 1000-times speedup on one test and a 0.1-times slowdown on another scores a geometric mean of 10. Drawing on the systems literature (Jacob and Mudge, 1995), GSO instead uses the harmonic mean of per-test speedups[1]:
S(C_1, C_2) = n / sum_i (T(C_2, i) / T(C_1, i))
This formulation punishes a single regression more than a geometric mean does, which makes the metric closer to what a maintainer actually cares about.
For each task GSO measures S(C_h, C_a), the speedup of the agent's patched codebase relative to the human-optimized codebase. A task is counted as Opt_p = true when:
S(C_h, C_a) >= p, meaning the agent matched at least p fraction of the human's speedup.Opt_p@K is then the fraction of tasks where at least one of K independent rollouts succeeds. The primary number reported in the paper, written Opt@K, sets p = 0.95 so the agent has to come within 5% of the human's improvement[1].
Because everything is relative to the same human commit, the metric is largely machine-independent: speedups vary across hardware, but the ratio between the agent's runtime and the human's runtime stays roughly constant. The authors run all evaluations on a single Google Cloud n2-standard-64 VM (64 vCPU, 256 GB RAM) and report stable results across machines.
The paper evaluates six models inside the OpenHands CodeActAgent-v0.35.0 agent scaffold[1]. Each task gets a 3-hour wall-clock budget and a 20-minute timeout per step. For Opt@1 the team samples three rollouts at temperature 0.1.
| Model | Opt@1 | Notes |
|---|---|---|
| GPT-4o | 0.0% | Failed every task at the 95% threshold |
| o3-mini | 1.3% | Lowest reasoning-model score |
| o4-mini | 3.6% | Best OpenAI result in original eval |
| Claude-3.5-v2 (Claude-3.6) | 4.6% | Strong baseline among Anthropic models |
| Claude-3.7 | 3.8% | Slightly below 3.6 on Opt@1 |
| Claude-4.0 | 4.9% | Top model in the original paper |
The pattern is brutal. On a benchmark that competent human maintainers solve by definition (the human commit is the target), the best-performing frontier model in May 2025 cleared one in twenty problems. The gap between GPT-4o and Claude-4.0 is also striking; jumping from a non-reasoning model to a reasoning model picks up a few points, but no model crosses the 5% line. By comparison, o4-mini scores about 73% on LiveCodeBench and 56.8% on SWE-bench Verified, versus 3.6% on GSO[1].
The gap shrinks if you lower the speedup bar. Setting p = 0 (just require correctness) gives Claude-4.0 about 70% Opt@1 and o4-mini about 45%. So agents are finding something correct on roughly half the tasks; they just are not finding anything that approaches the human's runtime improvement. As p climbs from 0 toward 1, the curves drop steeply, and at p = 0.95 they collapse onto the 5% floor[1].
The authors run two scaling sweeps on o4-mini and Claude-3.5-v2:
| Compute axis | Effect on Opt@K |
|---|---|
| Parallel rollouts (more samples, same step budget) | Strong gains, e.g. o4-mini reaches 8.82% Opt@K with 50 steps and 8 rollouts versus 1.96% with 400 steps and 1 rollout |
| Serial steps (longer trajectories, same sample count) | Weak gains, performance is roughly flat once you exceed 100 to 200 steps |
| Combined parallel + serial | Best results, but with diminishing returns past 8 rollouts |
With 8 rollouts at 400 steps, Claude-3.5-v2 reaches about 15.7% Opt@10, and o4-mini reaches about 12.7%[1]. This is the headline "~15% with 8 to 10 rollouts" number that gets cited in summaries of the paper. It still sits well below human performance, and the authors note that 75% of agent trajectories terminate before the 100th step even when the budget allows 200 or 400, so more compute alone is not enough.
In one ablation, o4-mini is given the human commit's diff and a back-translated description of the optimization strategy and asked to reimplement it. With those plans, Opt@1 climbs from about 3.5% to 5.7%, and Opt@5 goes from about 9.9% to 16.4%[1]. So strategy is part of the bottleneck, but not all of it. Even when the agent is told what to do, it still struggles to write the low-level code that actually does it.
Section 5 of the paper is the qualitative half of the work. Using an LLM-aided pipeline, the authors classify failed trajectories into three buckets, with model-specific percentages reported in Figure 7 of the paper[1].
| Failure category | Example pattern | Claude-3.5-v2 share | o4-mini share |
|---|---|---|---|
| Wrong abstraction level | Refusing to edit C/C++ when the human commit did | 30.1% of trajectories | 25.1% of trajectories |
| Lazy optimization | Adding spurious -O3 flags, input-specific fast paths, monkey-patches in __init__.py | 16.6% of trajectories | 29.0% of trajectories |
| Exploit-heavy or explore-heavy | Committing too quickly versus wandering forever | 27.2% (exploit-heavy) for Claude-3.5-v2; 25.7% (explore-heavy) for o4-mini | mirrored |
| Misdiagnosed bottlenecks | Parallelizing a function whose hot path is elsewhere | 13.2% | 6.6% |
| Less impactful changes | Tweaks that pass tests but barely move the needle | 10.0% | 6.8% |
A particularly sharp slice is the language split. On the 42 Python-only tasks, o4-mini reaches 21.4% Opt@10. On the 60 tasks that require C, C++, Cython, or Rust edits, that drops to 4.0%[1]. The paper also tracks whether models touch C or C++ files at all: o4-mini avoids changing C/C++ in roughly 40% of trajectories where the human did, while Claude-3.5-v2 sometimes makes the opposite mistake (modifying C in 9.2% of patches where the human optimization was Python-only).
A few examples from the appendix help explain those numbers:
np.subtract.at, o4-mini scrolled through the underlying C ufunc files but refused to edit them, then tried to override the function in pure Python.char.count, Claude-3.5-v2 tried to parallelize a function that was already CPU-bound on Python's GIL, got worse performance, and concluded that NumPy's string operations were already optimal._periodic_strftime only for monthly format strings by editing __init__.py, an input-specific fast path the test suite caught as a generalization failure.GSO has continued to evolve since the original arXiv submission. The project changelog, leaderboard, and GitHub repository note three notable changes:
max_iterations = 200, double the previous budget, so frontier reasoning models can be measured under conditions closer to their best operating point[2].reasoning_effort for models like Claude Opus 4.6 to keep the comparison fair against models with explicit thinking modes[2].The core dataset and metric have not changed. All updates so far layer on top of the 102-task evaluation harness rather than replacing it.
The authors compare GSO against several adjacent benchmarks in Figure 2 of the paper[1]. The summary table is reproduced and extended below.
| Benchmark | Repo level | Evaluates runtime | Multilingual | Precise spec | Distinguishing focus |
|---|---|---|---|---|---|
| HumanEval | No | No | No | Yes | Single-function code generation |
| EvalPerf | No | Yes | No | Yes | Runtime efficiency on small programs |
| ECCO | No | Yes | No | Yes | Code optimization on isolated snippets |
| LiveCodeBench | No | No | No | Yes | Competitive programming, contamination control |
| KernelBench | No | Yes | No | Yes | GPU kernel generation |
| SWE-bench Verified | Yes | No | No | No | Bug fixing in Python repos |
| SWE-Multi / SWE-Multi-Mini | Yes | No | Yes | No | Multilingual bug fixing |
| GSO | Yes | Yes | Yes | Yes | Repository-level performance optimization |
GSO is the only benchmark in this set that satisfies all four properties at once. It also requires substantially larger edits: the paper measures gold-patch line counts and finds GSO solutions involve 4 to 15 times more lines than SWE-bench-style tasks, with a median of 110 lines and a maximum of 2,278 lines per ground-truth commit[1].
The relevance gap shows up when you compare scores. The same o4-mini that scored about 73% on LiveCodeBench and 56.8% on SWE-bench Verified scored 3.6% on GSO. Algorithmic puzzle skill and bug-fix skill clearly do not transfer to optimization skill.
GSO uses Docker images per task to pin dependencies and toolchains. The repository's prepare_images.py script builds these images and can push them to a registry. All paper experiments ran on a single n2-standard-64 Google Cloud VM, but the metric is designed so that scores remain comparable across reasonable hardware.
The official scaffold is OpenHands CodeActAgent-v0.35.0. The agent gets a file-editor tool and a bash terminal. The default per-task budget is 3 hours of wall-clock time and a 20-minute timeout per step. The default prompt instructs the agent to optimize the runtime of a specified performance test and provides the build and test commands.
The public harness installs with uv, checks out the repository, prepares the Docker images, and runs opt_at_k.py over a JSON file of model predictions. Typical commands look like:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
git clone https://github.com/gso-bench/gso.git
cd gso
uv venv && source .venv/bin/activate
uv sync
python scripts/prepare_images.py
python opt_at_k.py --predictions <preds.json> --run_id <name> --model <model_name>
The predictions file is a list of model-generated patches keyed by instance_id. The harness applies each patch in the appropriate Docker container, runs correctness and performance tests, and reports Opt@K plus per-task speedup ratios.
from datasets import load_dataset
gso = load_dataset("gso-bench/gso", split="test")
print(len(gso)) # 102
print(gso[0]["instance_id"], gso[0]["repo"])
Each entry contains the fields shown below. The Hugging Face dataset card lists every field with its type and meaning[4].
| Field | Description |
|---|---|
instance_id | Unique identifier formatted as <owner>__<repo>-<commit> |
repo | GitHub owner/name of the source repository |
base_commit | Commit hash before the optimization |
opt_commit | Expert human commit that delivers the optimization |
api | Endpoint or API most affected by the change |
prob_script | Generated performance-test specification |
tests | JSON list of performance and correctness tests |
hints_text | Original commit title and message |
setup_commands | Commands to install base dependencies |
install_commands | Commands to build or rebuild after the patch |
created_at | Date of the ground-truth commit |
gt_commit_message | Commit message from the human optimization |
gt_diff | Full unified diff of the human optimization |
arch | Docker image architecture, usually x86_64 |
instance_image_tag | Docker image tag, usually latest |
A few things stand out about what GSO measures and what its results imply.
Optimization is a different skill from correctness. A model can pass algorithm exams and fix bugs in real Python and still be useless at making a NumPy ufunc faster. The 3.6% Opt@1 for o4-mini against its 73% LiveCodeBench score is the cleanest demonstration of that gap.
Language depth matters more than headline scores admit. The drop from 21.4% (Python only) to 4.0% (anything that touches C, C++, Cython, or Rust) suggests that frontier models are still Python-flavored when they have to ship a working low-level patch.
More compute helps, but not evenly. Going from 1 rollout to 8 roughly triples Opt@K, while doubling steps per rollout barely moves the number. Test-time scaling on GSO looks more like sample diversity than deeper reasoning.
Agents have a planning problem too. Even when handed the human's diff and a back-translated strategy, o4-mini's Opt@5 reaches just 16.4%. Strategy is part of the bottleneck; execution is the rest.
The paper's Limitations section flags four issues directly[1]: