GSO

GSO
Overview
Full name	GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
Abbreviation	GSO
Description	A benchmark evaluating language models and SWE agents on real-world software performance optimization tasks drawn from open-source commit histories
Initial release	May 29, 2025 (arXiv v1)
Latest paper version	v3, October 24, 2025
Venue	NeurIPS 2025 (poster, San Diego, December 3, 2025)
Authors	Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica
Organization	UC Berkeley Sky Computing Lab
Technical Details
Type	Software optimization, repository-level code editing, multi-language programming
Modality	Code, text
Task format	Optimization patches against a precise performance-test specification
Number of tasks	102
Codebases	10
Domains	8 (scientific computing, data analysis, image processing, web and network, ML inference, ML datasets, data validation, LLM tokenization)
Languages	5 (Python, C, C++, Cython, Rust); 58.8% of tasks require non-Python edits
Primary metric	Opt@K (Opt₀.₉₅@K, with a 95% human-speedup threshold)
Secondary metrics	Opt_p@1 (varying speedup threshold p), harmonic-mean speedup ratio, edit size
Performance
Human baseline	Expert open-source maintainer commits (108 to 110 line median, 3.9 files average)
Best Opt@1 in original paper	Claude-4.0 at 4.9%
Worst Opt@1 in original paper	GPT-4o at 0.0%
Opt@10 ceiling reported	About 15% with 8 to 10 rollouts
Saturated	No
Resources
Website	gso-bench.github.io
Paper	arXiv:2505.23671
GitHub	gso-bench/gso
Dataset	Hugging Face
License	MIT

GSO (short for Global Software Optimization in the project page subtitle, and Challenging Software Optimization Tasks for Evaluating SWE-Agents in the paper title) is an artificial intelligence benchmark for software performance optimization. It was released on arXiv on May 29, 2025, by researchers at the UC Berkeley Sky Computing Lab and was accepted as a poster at NeurIPS 2025[1][2]. Instead of asking models to fix bugs or generate small programs from scratch, GSO gives an LLM SWE agent a real codebase, a build script, and a performance test, then asks it to produce a patch that runs as fast as the optimization an expert human committed to that repository. The headline result from the paper is that leading SWE agents achieve less than 5% success at the 95% human-speedup threshold, with Claude-4.0 topping the original evaluation at 4.9% and GPT-4o scoring 0.0%[1][3].

why this benchmark exists

Most public coding benchmarks treat code as a correctness problem. HumanEval, MBPP, and LiveCodeBench ask whether a function returns the right value. SWE-bench and its variants ask whether an agent can resolve a GitHub issue. None reward making code faster, and none require the agent to reason about what is actually slow inside a production library.

Shetty and colleagues argue that real high-performance work, the kind that goes into systems like vLLM, HPC kernels, or PyTorch operators, looks nothing like a LeetCode problem. It involves multi-file edits, hardware-aware tricks, SIMD, cache-friendly memory layouts, and often a hop from Python down into C, C++, Cython, or Rust[1]. Their core question, quoted from the introduction, is whether "LLM agents [can] aid in the development of high-performance software."

The benchmark also side-steps a recurring weakness in agent evaluation: ambiguous natural-language specifications. A GitHub issue leaves the desired behavior up to the reader. A performance test is executable: it either runs faster, or it does not. That gives GSO what the authors call a precise specification for each task[1].

task design

what an agent receives

Each GSO instance contains:

A snapshot of an open-source repository at a specific base commit.
A build script and dependency setup so the codebase actually compiles inside a Docker container.
A performance test generated automatically from a real expert optimization commit, plus correctness tests.
Hints in the form of the original commit title and message.

The agent has to write a unified patch that, when applied on top of the base commit, makes the performance test run faster while still passing every correctness assertion. Success is judged against the human commit, not against the original slow code, which is why the metric is robust to the specific machine you run on.

how the data was built

The authors describe a two-stage automated pipeline followed by manual curation[1]:

Stage I, identifying performance-improving commits. An LLM-based judge crawls a repository's commit history with code-change heuristics, looking for commits that read like they are about speed. For each candidate it pulls the diff, the commit message, the linked issues and pull requests, and the API endpoints that the diff touches.
Stage II, generating and executing performance tests. Another LLM, prompted with that context, writes a performance test using execution-based rejection sampling. The test exercises the affected APIs with realistic workloads (for example, generating completions from a Qwen2-7B model on the ShareGPT dataset for a llama-cpp task). The harness then runs the test on the pre-commit and post-commit code, verifies that outputs match, and keeps the candidate only if the speedup is real and consistent across multiple test cases.
Final curation. The team manually reviewed the surviving candidates, threw out anything with weak tests or reproducibility issues, and balanced the final 102 tasks across optimization techniques, difficulty levels, and application domains.

A single GSO test exercises the codebase with multiple workloads (12.45 per task on average, up to 20 in the hardest cases) so a model cannot win by special-casing one input.

the ten codebases

Codebase	Tasks	Languages	Domain
NumPy	36	Python, C, C++	Scientific computing
Pandas	34	Python, Cython	Data analysis
Pillow-SIMD	7	Python, C	Image processing
Pillow	4	Python, C	Image processing
Pydantic	4	Python	Data validation
Tornado	4	Python	Web and network
Tokenizers	4	Python, Rust	LLM tokenizers
Transformers	4	Python	ML inference
Datasets	3	Python	ML datasets
llama.cpp	2	Python, C, C++	ML inference

This split is uneven on purpose: NumPy and Pandas dominate the count because they are the codebases with the richest history of well-instrumented performance commits. Even so, only about 41% of GSO tasks are pure Python, and 58.8% require touching at least one non-Python file[1]. That number matters because, as the failure-mode analysis later shows, almost every model collapses on tasks that need C, C++, or SIMD edits.

the kinds of optimizations involved

Ground-truth commits in GSO span a wide library of low-level techniques. Figure 3 of the paper lists the most common categories as SIMD/vectorize, caching/memoize, lazy evaluation, memory layouts, parallelism, string-search algorithms (Boyer-Moore-Horspool, Aho-Corasick automata), scatter/gather, CPU feature dispatch, branch elimination, table-driven lookup, select/sort kernels, and bitmap direct-address lookups[1]. These are the kinds of techniques you find in a systems class, not in a typical agent benchmark.

the Opt@K metric

GSO's evaluation metric is one of the more interesting parts of the design. Software optimization is hard to score for two reasons: different tasks have different baselines, and tests within a task can have wildly different speedup magnitudes that distort aggregate numbers.

speedup with harmonic mean

Let s_i = T(C_2, i) / T(C_1, i) be the speedup on test case i between two codebase states C_1 and C_2. Earlier work like PIE and ECCO used the geometric mean across tests, but the authors point out that geometric mean is too easy to game: a model that gets a 1000-times speedup on one test and a 0.1-times slowdown on another scores a geometric mean of 10. Drawing on the systems literature (Jacob and Mudge, 1995), GSO instead uses the harmonic mean of per-test speedups[1]:

S(C_1, C_2) = n / sum_i (T(C_2, i) / T(C_1, i))

This formulation punishes a single regression more than a geometric mean does, which makes the metric closer to what a maintainer actually cares about.

Opt_p and Opt_p@K

For each task GSO measures S(C_h, C_a), the speedup of the agent's patched codebase relative to the human-optimized codebase. A task is counted as Opt_p = true when:

The patch passes all correctness tests.
S(C_h, C_a) >= p, meaning the agent matched at least p fraction of the human's speedup.

Opt_p@K is then the fraction of tasks where at least one of K independent rollouts succeeds. The primary number reported in the paper, written Opt@K, sets p = 0.95 so the agent has to come within 5% of the human's improvement[1].

Because everything is relative to the same human commit, the metric is largely machine-independent: speedups vary across hardware, but the ratio between the agent's runtime and the human's runtime stays roughly constant. The authors run all evaluations on a single Google Cloud n2-standard-64 VM (64 vCPU, 256 GB RAM) and report stable results across machines.

what the original paper found

The paper evaluates six models inside the OpenHands CodeActAgent-v0.35.0 agent scaffold[1]. Each task gets a 3-hour wall-clock budget and a 20-minute timeout per step. For Opt@1 the team samples three rollouts at temperature 0.1.

Opt@1 leaderboard from the paper

Model	Opt@1	Notes
GPT-4o	0.0%	Failed every task at the 95% threshold
o3-mini	1.3%	Lowest reasoning-model score
o4-mini	3.6%	Best OpenAI result in original eval
Claude-3.5-v2 (Claude-3.6)	4.6%	Strong baseline among Anthropic models
Claude-3.7	3.8%	Slightly below 3.6 on Opt@1
Claude-4.0	4.9%	Top model in the original paper

The pattern is brutal. On a benchmark that competent human maintainers solve by definition (the human commit is the target), the best-performing frontier model in May 2025 cleared one in twenty problems. The gap between GPT-4o and Claude-4.0 is also striking; jumping from a non-reasoning model to a reasoning model picks up a few points, but no model crosses the 5% line. By comparison, o4-mini scores about 73% on LiveCodeBench and 56.8% on SWE-bench Verified, versus 3.6% on GSO[1].

Opt_p@1 across thresholds

The gap shrinks if you lower the speedup bar. Setting p = 0 (just require correctness) gives Claude-4.0 about 70% Opt@1 and o4-mini about 45%. So agents are finding something correct on roughly half the tasks; they just are not finding anything that approaches the human's runtime improvement. As p climbs from 0 toward 1, the curves drop steeply, and at p = 0.95 they collapse onto the 5% floor[1].

scaling test-time compute

The authors run two scaling sweeps on o4-mini and Claude-3.5-v2:

Compute axis	Effect on Opt@K
Parallel rollouts (more samples, same step budget)	Strong gains, e.g. o4-mini reaches 8.82% Opt@K with 50 steps and 8 rollouts versus 1.96% with 400 steps and 1 rollout
Serial steps (longer trajectories, same sample count)	Weak gains, performance is roughly flat once you exceed 100 to 200 steps
Combined parallel + serial	Best results, but with diminishing returns past 8 rollouts

With 8 rollouts at 400 steps, Claude-3.5-v2 reaches about 15.7% Opt@10, and o4-mini reaches about 12.7%[1]. This is the headline "~15% with 8 to 10 rollouts" number that gets cited in summaries of the paper. It still sits well below human performance, and the authors note that 75% of agent trajectories terminate before the 100th step even when the budget allows 200 or 400, so more compute alone is not enough.

performance with ground-truth plans

In one ablation, o4-mini is given the human commit's diff and a back-translated description of the optimization strategy and asked to reimplement it. With those plans, Opt@1 climbs from about 3.5% to 5.7%, and Opt@5 goes from about 9.9% to 16.4%[1]. So strategy is part of the bottleneck, but not all of it. Even when the agent is told what to do, it still struggles to write the low-level code that actually does it.

failure modes

Section 5 of the paper is the qualitative half of the work. Using an LLM-aided pipeline, the authors classify failed trajectories into three buckets, with model-specific percentages reported in Figure 7 of the paper[1].

Failure category	Example pattern	Claude-3.5-v2 share	o4-mini share
Wrong abstraction level	Refusing to edit C/C++ when the human commit did	30.1% of trajectories	25.1% of trajectories
Lazy optimization	Adding spurious `-O3` flags, input-specific fast paths, monkey-patches in `__init__.py`	16.6% of trajectories	29.0% of trajectories
Exploit-heavy or explore-heavy	Committing too quickly versus wandering forever	27.2% (exploit-heavy) for Claude-3.5-v2; 25.7% (explore-heavy) for o4-mini	mirrored
Misdiagnosed bottlenecks	Parallelizing a function whose hot path is elsewhere	13.2%	6.6%
Less impactful changes	Tweaks that pass tests but barely move the needle	10.0%	6.8%

A particularly sharp slice is the language split. On the 42 Python-only tasks, o4-mini reaches 21.4% Opt@10. On the 60 tasks that require C, C++, Cython, or Rust edits, that drops to 4.0%[1]. The paper also tracks whether models touch C or C++ files at all: o4-mini avoids changing C/C++ in roughly 40% of trajectories where the human did, while Claude-3.5-v2 sometimes makes the opposite mistake (modifying C in 9.2% of patches where the human optimization was Python-only).

A few examples from the appendix help explain those numbers:

For NumPy's np.subtract.at, o4-mini scrolled through the underlying C ufunc files but refused to edit them, then tried to override the function in pure Python.
On Pillow, Claude-3.5-v2 broke SIMD pointer arithmetic in a way that caused segmentation faults.
On NumPy's char.count, Claude-3.5-v2 tried to parallelize a function that was already CPU-bound on Python's GIL, got worse performance, and concluded that NumPy's string operations were already optimal.
On Pandas, o4-mini tried to patch _periodic_strftime only for monthly format strings by editing __init__.py, an input-specific fast path the test suite caught as a generalization failure.

post-paper updates

GSO has continued to evolve since the original arXiv submission. The project changelog, leaderboard, and GitHub repository note three notable changes:

November 2025: Hack Detector. Reward hacking is a known problem in agent benchmarks (Gu et al., 2025; Lange et al., 2025). The maintainers added a hack detector that flags trajectories using deceptive optimizations such as test-harness manipulation or memoization that only works for the test inputs[2].
April 2026: increased inference budget. New runs default to max_iterations = 200, double the previous budget, so frontier reasoning models can be measured under conditions closer to their best operating point[2].
Reasoning-effort knob for thinking models. Newer entries on the leaderboard set reasoning_effort for models like Claude Opus 4.6 to keep the comparison fair against models with explicit thinking modes[2].

The core dataset and metric have not changed. All updates so far layer on top of the 102-task evaluation harness rather than replacing it.

The authors compare GSO against several adjacent benchmarks in Figure 2 of the paper[1]. The summary table is reproduced and extended below.

Benchmark	Repo level	Evaluates runtime	Multilingual	Precise spec	Distinguishing focus
HumanEval	No	No	No	Yes	Single-function code generation
EvalPerf	No	Yes	No	Yes	Runtime efficiency on small programs
ECCO	No	Yes	No	Yes	Code optimization on isolated snippets
LiveCodeBench	No	No	No	Yes	Competitive programming, contamination control
KernelBench	No	Yes	No	Yes	GPU kernel generation
SWE-bench Verified	Yes	No	No	No	Bug fixing in Python repos
SWE-Multi / SWE-Multi-Mini	Yes	No	Yes	No	Multilingual bug fixing
GSO	Yes	Yes	Yes	Yes	Repository-level performance optimization

GSO is the only benchmark in this set that satisfies all four properties at once. It also requires substantially larger edits: the paper measures gold-patch line counts and finds GSO solutions involve 4 to 15 times more lines than SWE-bench-style tasks, with a median of 110 lines and a maximum of 2,278 lines per ground-truth commit[1].

The relevance gap shows up when you compare scores. The same o4-mini that scored about 73% on LiveCodeBench and 56.8% on SWE-bench Verified scored 3.6% on GSO. Algorithmic puzzle skill and bug-fix skill clearly do not transfer to optimization skill.

evaluation setup

environment

GSO uses Docker images per task to pin dependencies and toolchains. The repository's prepare_images.py script builds these images and can push them to a registry. All paper experiments ran on a single n2-standard-64 Google Cloud VM, but the metric is designed so that scores remain comparable across reasonable hardware.

agent scaffold

The official scaffold is OpenHands CodeActAgent-v0.35.0. The agent gets a file-editor tool and a bash terminal. The default per-task budget is 3 hours of wall-clock time and a 20-minute timeout per step. The default prompt instructs the agent to optimize the runtime of a specified performance test and provides the build and test commands.

running an evaluation

The public harness installs with uv, checks out the repository, prepares the Docker images, and runs opt_at_k.py over a JSON file of model predictions. Typical commands look like:

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
git clone https://github.com/gso-bench/gso.git
cd gso
uv venv && source .venv/bin/activate
uv sync
python scripts/prepare_images.py
python opt_at_k.py --predictions <preds.json> --run_id <name> --model <model_name>

The predictions file is a list of model-generated patches keyed by instance_id. The harness applies each patch in the appropriate Docker container, runs correctness and performance tests, and reports Opt@K plus per-task speedup ratios.

loading the dataset

from datasets import load_dataset

gso = load_dataset("gso-bench/gso", split="test")
print(len(gso))  # 102
print(gso[0]["instance_id"], gso[0]["repo"])

Each entry contains the fields shown below. The Hugging Face dataset card lists every field with its type and meaning[4].

Field	Description
`instance_id`	Unique identifier formatted as `<owner>__<repo>-<commit>`
`repo`	GitHub `owner/name` of the source repository
`base_commit`	Commit hash before the optimization
`opt_commit`	Expert human commit that delivers the optimization
`api`	Endpoint or API most affected by the change
`prob_script`	Generated performance-test specification
`tests`	JSON list of performance and correctness tests
`hints_text`	Original commit title and message
`setup_commands`	Commands to install base dependencies
`install_commands`	Commands to build or rebuild after the patch
`created_at`	Date of the ground-truth commit
`gt_commit_message`	Commit message from the human optimization
`gt_diff`	Full unified diff of the human optimization
`arch`	Docker image architecture, usually `x86_64`
`instance_image_tag`	Docker image tag, usually `latest`

takeaways

A few things stand out about what GSO measures and what its results imply.

Optimization is a different skill from correctness. A model can pass algorithm exams and fix bugs in real Python and still be useless at making a NumPy ufunc faster. The 3.6% Opt@1 for o4-mini against its 73% LiveCodeBench score is the cleanest demonstration of that gap.

Language depth matters more than headline scores admit. The drop from 21.4% (Python only) to 4.0% (anything that touches C, C++, Cython, or Rust) suggests that frontier models are still Python-flavored when they have to ship a working low-level patch.

More compute helps, but not evenly. Going from 1 rollout to 8 roughly triples Opt@K, while doubling steps per rollout barely moves the number. Test-time scaling on GSO looks more like sample diversity than deeper reasoning.

Agents have a planning problem too. Even when handed the human's diff and a back-translated strategy, o4-mini's Opt@5 reaches just 16.4%. Strategy is part of the bottleneck; execution is the rest.

limitations

The paper's Limitations section flags four issues directly[1]:

Benchmark size. 102 tasks introduce variance. The team plans to expand based on community feedback.
Hacky optimizations. Reward hacking is a live problem; the November 2025 Hack Detector tries to keep ahead of it.
Evaluation beyond speedup. Real engineering involves trade-offs against memory use, maintainability, and idiomatic style, none of which GSO currently measures.
Contamination. Tasks come from public GitHub repositories, so training-data overlap is plausible. The authors argue that the persistently low scores and the continuous speedup metric make contamination unlikely to fully explain the results.

references

Shetty, M., Jain, N., Liu, J., Kethanaboyina, V., Sen, K., Stoica, I. (2025). "GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents." arXiv preprint, arXiv:2505.23671v3, October 24, 2025. https://arxiv.org/abs/2505.23671
GSO project page and leaderboard, UC Berkeley Sky Computing Lab. https://gso-bench.github.io/
UC Berkeley Sky Computing Lab project listing for GSO. https://sky.cs.berkeley.edu/project/gso/
GSO dataset card, Hugging Face. https://huggingface.co/datasets/gso-bench/gso
GSO benchmark repository, GitHub. https://github.com/gso-bench/gso
NeurIPS 2025 poster listing for GSO, San Diego, December 3, 2025. https://neurips.cc/virtual/2025/loc/san-diego/poster/121735
Epoch AI benchmark page for GSO. https://epoch.ai/benchmarks/gso

why this benchmark exists

task design

what an agent receives

how the data was built

the ten codebases

the kinds of optimizations involved

the Opt@K metric

speedup with harmonic mean

Opt_p and Opt_p@K

what the original paper found

Opt@1 leaderboard from the paper

Opt_p@1 across thresholds

scaling test-time compute

performance with ground-truth plans

failure modes

post-paper updates

comparison with related benchmarks

evaluation setup

environment

agent scaffold

running an evaluation

loading the dataset

takeaways

limitations

see also

references

Improve this article

Related Articles

Aider Polyglot

Humanity's Last Exam

AA-LCR

AIME 2025

BrowseComp

Creative Writing v3

why this benchmark exists

task design

what an agent receives

how the data was built

the ten codebases

the kinds of optimizations involved

the Opt@K metric

speedup with harmonic mean

Opt_p and Opt_p@K

what the original paper found

Opt@1 leaderboard from the paper

Opt_p@1 across thresholds

scaling test-time compute

performance with ground-truth plans

failure modes

post-paper updates

comparison with related benchmarks

evaluation setup

environment

agent scaffold

running an evaluation

loading the dataset

takeaways

limitations

see also

references

Related Articles

Aider Polyglot

Humanity's Last Exam

AA-LCR

AIME 2025

BrowseComp

Creative Writing v3