# GSO

> Source: https://aiwiki.ai/wiki/gso
> Updated: 2026-05-10
> Categories: AI Benchmarks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

| GSO | |
| --- | --- |
| Overview | |
| Full name | GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents |
| Abbreviation | GSO |
| Description | A benchmark evaluating language models and SWE agents on real-world software performance optimization tasks drawn from open-source commit histories |
| Initial release | May 29, 2025 (arXiv v1) |
| Latest paper version | v3, October 24, 2025 |
| Venue | NeurIPS 2025 (poster, San Diego, December 3, 2025) |
| Authors | Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica |
| Organization | UC Berkeley Sky Computing Lab |
| Technical Details | |
| Type | Software optimization, repository-level code editing, multi-language programming |
| Modality | Code, text |
| Task format | Optimization patches against a precise performance-test specification |
| Number of tasks | 102 |
| Codebases | 10 |
| Domains | 8 (scientific computing, data analysis, image processing, web and network, ML inference, ML datasets, data validation, LLM tokenization) |
| Languages | 5 (Python, C, C++, Cython, Rust); 58.8% of tasks require non-Python edits |
| Primary metric | Opt@K (Opt₀.₉₅@K, with a 95% human-speedup threshold) |
| Secondary metrics | Opt_p@1 (varying speedup threshold p), harmonic-mean speedup ratio, edit size |
| Performance | |
| Human baseline | Expert open-source maintainer commits (108 to 110 line median, 3.9 files average) |
| Best Opt@1 in original paper | Claude-4.0 at 4.9% |
| Worst Opt@1 in original paper | GPT-4o at 0.0% |
| Opt@10 ceiling reported | About 15% with 8 to 10 rollouts |
| Saturated | No |
| Resources | |
| Website | [gso-bench.github.io](https://gso-bench.github.io/) |
| Paper | [arXiv:2505.23671](https://arxiv.org/abs/2505.23671) |
| GitHub | [gso-bench/gso](https://github.com/gso-bench/gso) |
| Dataset | [Hugging Face](https://huggingface.co/datasets/gso-bench/gso) |
| License | MIT |

**GSO** (short for *Global Software Optimization* in the project page subtitle, and *Challenging Software Optimization Tasks for Evaluating SWE-Agents* in the paper title) is an [artificial intelligence](/wiki/artificial_intelligence) benchmark for software performance optimization. It was released on arXiv on May 29, 2025, by researchers at the [UC Berkeley](/wiki/uc_berkeley) [Sky Computing Lab](/wiki/sky_computing_lab) and was accepted as a poster at [NeurIPS](/wiki/neurips) 2025[1][2]. Instead of asking models to fix bugs or generate small programs from scratch, GSO gives an [LLM](/wiki/large_language_model) [SWE agent](/wiki/swe_agent) a real codebase, a build script, and a performance test, then asks it to produce a patch that runs as fast as the optimization an expert human committed to that repository. The headline result from the paper is that leading SWE agents achieve less than 5% success at the 95% human-speedup threshold, with Claude-4.0 topping the original evaluation at 4.9% and GPT-4o scoring 0.0%[1][3].

## why this benchmark exists

Most public coding benchmarks treat code as a correctness problem. [HumanEval](/wiki/humaneval), [MBPP](/wiki/mbpp), and [LiveCodeBench](/wiki/livecodebench) ask whether a function returns the right value. [SWE-bench](/wiki/swe_bench) and its variants ask whether an agent can resolve a [GitHub](/wiki/github) issue. None reward making code faster, and none require the agent to reason about what is actually slow inside a production library.

Shetty and colleagues argue that real high-performance work, the kind that goes into systems like [vLLM](/wiki/vllm), HPC kernels, or [PyTorch](/wiki/pytorch) operators, looks nothing like a LeetCode problem. It involves multi-file edits, hardware-aware tricks, [SIMD](/wiki/simd), cache-friendly memory layouts, and often a hop from Python down into C, C++, [Cython](/wiki/cython), or [Rust](/wiki/rust)[1]. Their core question, quoted from the introduction, is whether "LLM agents [can] aid in the development of high-performance software."

The benchmark also side-steps a recurring weakness in agent evaluation: ambiguous natural-language specifications. A [GitHub issue](/wiki/github_issue) leaves the desired behavior up to the reader. A performance test is executable: it either runs faster, or it does not. That gives GSO what the authors call a *precise specification* for each task[1].

## task design

### what an agent receives

Each GSO instance contains:

- A snapshot of an [open-source](/wiki/open_source) repository at a specific base commit.
- A build script and dependency setup so the codebase actually compiles inside a [Docker](/wiki/docker) container.
- A *performance test* generated automatically from a real expert optimization commit, plus correctness tests.
- Hints in the form of the original commit title and message.

The agent has to write a unified patch that, when applied on top of the base commit, makes the performance test run faster while still passing every correctness assertion. Success is judged against the human commit, not against the original slow code, which is why the metric is robust to the specific machine you run on.

### how the data was built

The authors describe a two-stage automated pipeline followed by manual curation[1]:

1. **Stage I, identifying performance-improving commits.** An LLM-based judge crawls a repository's commit history with code-change heuristics, looking for commits that read like they are about speed. For each candidate it pulls the diff, the commit message, the linked issues and pull requests, and the API endpoints that the diff touches.
2. **Stage II, generating and executing performance tests.** Another LLM, prompted with that context, writes a performance test using execution-based rejection sampling. The test exercises the affected APIs with realistic workloads (for example, generating completions from a Qwen2-7B model on the ShareGPT dataset for a `llama-cpp` task). The harness then runs the test on the pre-commit and post-commit code, verifies that outputs match, and keeps the candidate only if the speedup is real and consistent across multiple test cases.
3. **Final curation.** The team manually reviewed the surviving candidates, threw out anything with weak tests or reproducibility issues, and balanced the final 102 tasks across optimization techniques, difficulty levels, and application domains.

A single GSO test exercises the codebase with multiple workloads (12.45 per task on average, up to 20 in the hardest cases) so a model cannot win by special-casing one input.

### the ten codebases

| Codebase | Tasks | Languages | Domain |
| --- | --- | --- | --- |
| [NumPy](/wiki/numpy) | 36 | Python, C, C++ | Scientific computing |
| [Pandas](/wiki/pandas) | 34 | Python, [Cython](/wiki/cython) | Data analysis |
| [Pillow-SIMD](/wiki/pillow_simd) | 7 | Python, C | Image processing |
| [Pillow](/wiki/pillow) | 4 | Python, C | Image processing |
| [Pydantic](/wiki/pydantic) | 4 | Python | Data validation |
| [Tornado](/wiki/tornado) | 4 | Python | Web and network |
| [Tokenizers](/wiki/tokenizers) | 4 | Python, [Rust](/wiki/rust) | LLM tokenizers |
| [Transformers](/wiki/transformers) | 4 | Python | ML inference |
| [Datasets](/wiki/huggingface_datasets) | 3 | Python | ML datasets |
| [llama.cpp](/wiki/llama_cpp) | 2 | Python, C, C++ | ML inference |

This split is uneven on purpose: NumPy and Pandas dominate the count because they are the codebases with the richest history of well-instrumented performance commits. Even so, only about 41% of GSO tasks are pure Python, and 58.8% require touching at least one non-Python file[1]. That number matters because, as the failure-mode analysis later shows, almost every model collapses on tasks that need C, C++, or SIMD edits.

### the kinds of optimizations involved

Ground-truth commits in GSO span a wide library of low-level techniques. Figure 3 of the paper lists the most common categories as SIMD/vectorize, caching/memoize, lazy evaluation, memory layouts, parallelism, string-search algorithms (Boyer-Moore-Horspool, Aho-Corasick automata), scatter/gather, CPU feature dispatch, branch elimination, table-driven lookup, select/sort kernels, and bitmap direct-address lookups[1]. These are the kinds of techniques you find in a systems class, not in a typical agent benchmark.

## the Opt@K metric

GSO's evaluation metric is one of the more interesting parts of the design. Software optimization is hard to score for two reasons: different tasks have different baselines, and tests within a task can have wildly different speedup magnitudes that distort aggregate numbers.

### speedup with harmonic mean

Let `s_i = T(C_2, i) / T(C_1, i)` be the speedup on test case `i` between two codebase states `C_1` and `C_2`. Earlier work like [PIE](/wiki/pie_benchmark) and [ECCO](/wiki/ecco_benchmark) used the geometric mean across tests, but the authors point out that geometric mean is too easy to game: a model that gets a 1000-times speedup on one test and a 0.1-times *slowdown* on another scores a geometric mean of 10. Drawing on the systems literature (Jacob and Mudge, 1995), GSO instead uses the harmonic mean of per-test speedups[1]:

```
S(C_1, C_2) = n / sum_i (T(C_2, i) / T(C_1, i))
```

This formulation punishes a single regression more than a geometric mean does, which makes the metric closer to what a maintainer actually cares about.

### Opt_p and Opt_p@K

For each task GSO measures `S(C_h, C_a)`, the speedup of the agent's patched codebase relative to the human-optimized codebase. A task is counted as `Opt_p = true` when:

1. The patch passes all correctness tests.
2. `S(C_h, C_a) >= p`, meaning the agent matched at least `p` fraction of the human's speedup.

`Opt_p@K` is then the fraction of tasks where at least one of `K` independent rollouts succeeds. The primary number reported in the paper, written `Opt@K`, sets `p = 0.95` so the agent has to come within 5% of the human's improvement[1].

Because everything is relative to the same human commit, the metric is largely machine-independent: speedups vary across hardware, but the ratio between the agent's runtime and the human's runtime stays roughly constant. The authors run all evaluations on a single Google Cloud `n2-standard-64` VM (64 vCPU, 256 GB RAM) and report stable results across machines.

## what the original paper found

The paper evaluates six models inside the [OpenHands](/wiki/openhands) `CodeActAgent-v0.35.0` agent scaffold[1]. Each task gets a 3-hour wall-clock budget and a 20-minute timeout per step. For Opt@1 the team samples three rollouts at temperature 0.1.

### Opt@1 leaderboard from the paper

| Model | Opt@1 | Notes |
| --- | --- | --- |
| [GPT-4o](/wiki/gpt_4o) | 0.0% | Failed every task at the 95% threshold |
| [o3-mini](/wiki/o3_mini) | 1.3% | Lowest reasoning-model score |
| [o4-mini](/wiki/o4_mini) | 3.6% | Best [OpenAI](/wiki/openai) result in original eval |
| Claude-3.5-v2 (Claude-3.6) | 4.6% | Strong baseline among Anthropic models |
| [Claude-3.7](/wiki/claude_3_7_sonnet) | 3.8% | Slightly below 3.6 on Opt@1 |
| [Claude-4.0](/wiki/claude_4) | 4.9% | Top model in the original paper |

The pattern is brutal. On a benchmark that competent human maintainers solve by definition (the human commit *is* the target), the best-performing frontier model in May 2025 cleared one in twenty problems. The gap between GPT-4o and Claude-4.0 is also striking; jumping from a non-reasoning model to a reasoning model picks up a few points, but no model crosses the 5% line. By comparison, o4-mini scores about 73% on [LiveCodeBench](/wiki/livecodebench) and 56.8% on [SWE-bench Verified](/wiki/swe_bench_verified), versus 3.6% on GSO[1].

### Opt_p@1 across thresholds

The gap shrinks if you lower the speedup bar. Setting `p = 0` (just require correctness) gives Claude-4.0 about 70% Opt@1 and o4-mini about 45%. So agents are *finding* something correct on roughly half the tasks; they just are not finding anything that approaches the human's runtime improvement. As `p` climbs from 0 toward 1, the curves drop steeply, and at `p = 0.95` they collapse onto the 5% floor[1].

### scaling test-time compute

The authors run two scaling sweeps on o4-mini and Claude-3.5-v2:

| Compute axis | Effect on Opt@K |
| --- | --- |
| Parallel rollouts (more samples, same step budget) | Strong gains, e.g. o4-mini reaches 8.82% Opt@K with 50 steps and 8 rollouts versus 1.96% with 400 steps and 1 rollout |
| Serial steps (longer trajectories, same sample count) | Weak gains, performance is roughly flat once you exceed 100 to 200 steps |
| Combined parallel + serial | Best results, but with diminishing returns past 8 rollouts |

With 8 rollouts at 400 steps, Claude-3.5-v2 reaches about 15.7% Opt@10, and o4-mini reaches about 12.7%[1]. This is the headline "~15% with 8 to 10 rollouts" number that gets cited in summaries of the paper. It still sits well below human performance, and the authors note that 75% of agent trajectories terminate before the 100th step even when the budget allows 200 or 400, so more compute alone is not enough.

### performance with ground-truth plans

In one ablation, o4-mini is given the human commit's diff and a back-translated description of the optimization strategy and asked to reimplement it. With those plans, Opt@1 climbs from about 3.5% to 5.7%, and Opt@5 goes from about 9.9% to 16.4%[1]. So strategy is part of the bottleneck, but not all of it. Even when the agent is told *what* to do, it still struggles to write the low-level code that actually does it.

## failure modes

Section 5 of the paper is the qualitative half of the work. Using an LLM-aided pipeline, the authors classify failed trajectories into three buckets, with model-specific percentages reported in Figure 7 of the paper[1].

| Failure category | Example pattern | Claude-3.5-v2 share | o4-mini share |
| --- | --- | --- | --- |
| Wrong abstraction level | Refusing to edit C/C++ when the human commit did | 30.1% of trajectories | 25.1% of trajectories |
| Lazy optimization | Adding spurious `-O3` flags, input-specific fast paths, monkey-patches in `__init__.py` | 16.6% of trajectories | 29.0% of trajectories |
| Exploit-heavy or explore-heavy | Committing too quickly versus wandering forever | 27.2% (exploit-heavy) for Claude-3.5-v2; 25.7% (explore-heavy) for o4-mini | mirrored |
| Misdiagnosed bottlenecks | Parallelizing a function whose hot path is elsewhere | 13.2% | 6.6% |
| Less impactful changes | Tweaks that pass tests but barely move the needle | 10.0% | 6.8% |

A particularly sharp slice is the language split. On the 42 Python-only tasks, o4-mini reaches 21.4% Opt@10. On the 60 tasks that require C, C++, Cython, or Rust edits, that drops to 4.0%[1]. The paper also tracks whether models touch C or C++ files at all: o4-mini avoids changing C/C++ in roughly 40% of trajectories where the human did, while Claude-3.5-v2 sometimes makes the opposite mistake (modifying C in 9.2% of patches where the human optimization was Python-only).

A few examples from the appendix help explain those numbers:

- For [NumPy](/wiki/numpy)'s `np.subtract.at`, o4-mini scrolled through the underlying C ufunc files but refused to edit them, then tried to override the function in pure Python.
- On [Pillow](/wiki/pillow), Claude-3.5-v2 broke SIMD pointer arithmetic in a way that caused segmentation faults.
- On NumPy's `char.count`, Claude-3.5-v2 tried to parallelize a function that was already CPU-bound on Python's [GIL](/wiki/global_interpreter_lock), got worse performance, and concluded that NumPy's string operations were already optimal.
- On [Pandas](/wiki/pandas), o4-mini tried to patch `_periodic_strftime` only for monthly format strings by editing `__init__.py`, an input-specific fast path the test suite caught as a generalization failure.

## post-paper updates

GSO has continued to evolve since the original arXiv submission. The project changelog, leaderboard, and GitHub repository note three notable changes:

- **November 2025: Hack Detector.** Reward hacking is a known problem in agent benchmarks (Gu et al., 2025; Lange et al., 2025). The maintainers added a hack detector that flags trajectories using deceptive optimizations such as test-harness manipulation or memoization that only works for the test inputs[2].
- **April 2026: increased inference budget.** New runs default to `max_iterations = 200`, double the previous budget, so frontier reasoning models can be measured under conditions closer to their best operating point[2].
- **Reasoning-effort knob for thinking models.** Newer entries on the leaderboard set `reasoning_effort` for models like Claude Opus 4.6 to keep the comparison fair against models with explicit thinking modes[2].

The core dataset and metric have not changed. All updates so far layer on top of the 102-task evaluation harness rather than replacing it.

## comparison with related benchmarks

The authors compare GSO against several adjacent benchmarks in Figure 2 of the paper[1]. The summary table is reproduced and extended below.

| Benchmark | Repo level | Evaluates runtime | Multilingual | Precise spec | Distinguishing focus |
| --- | --- | --- | --- | --- | --- |
| [HumanEval](/wiki/humaneval) | No | No | No | Yes | Single-function code generation |
| EvalPerf | No | Yes | No | Yes | Runtime efficiency on small programs |
| [ECCO](/wiki/ecco_benchmark) | No | Yes | No | Yes | Code optimization on isolated snippets |
| [LiveCodeBench](/wiki/livecodebench) | No | No | No | Yes | Competitive programming, contamination control |
| [KernelBench](/wiki/kernelbench) | No | Yes | No | Yes | GPU kernel generation |
| [SWE-bench Verified](/wiki/swe_bench_verified) | Yes | No | No | No | Bug fixing in Python repos |
| SWE-Multi / SWE-Multi-Mini | Yes | No | Yes | No | Multilingual bug fixing |
| **GSO** | Yes | Yes | Yes | Yes | Repository-level performance optimization |

GSO is the only benchmark in this set that satisfies all four properties at once. It also requires substantially larger edits: the paper measures gold-patch line counts and finds GSO solutions involve 4 to 15 times more lines than [SWE-bench](/wiki/swe_bench_verified)-style tasks, with a median of 110 lines and a maximum of 2,278 lines per ground-truth commit[1].

The relevance gap shows up when you compare scores. The same o4-mini that scored about 73% on LiveCodeBench and 56.8% on SWE-bench Verified scored 3.6% on GSO. Algorithmic puzzle skill and bug-fix skill clearly do not transfer to optimization skill.

## evaluation setup

### environment

GSO uses Docker images per task to pin dependencies and toolchains. The repository's `prepare_images.py` script builds these images and can push them to a registry. All paper experiments ran on a single n2-standard-64 Google Cloud VM, but the metric is designed so that scores remain comparable across reasonable hardware.

### agent scaffold

The official scaffold is OpenHands `CodeActAgent-v0.35.0`. The agent gets a file-editor tool and a bash terminal. The default per-task budget is 3 hours of wall-clock time and a 20-minute timeout per step. The default prompt instructs the agent to optimize the runtime of a specified performance test and provides the build and test commands.

### running an evaluation

The public harness installs with [uv](/wiki/uv_python), checks out the repository, prepares the Docker images, and runs `opt_at_k.py` over a JSON file of model predictions. Typical commands look like:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
git clone https://github.com/gso-bench/gso.git
cd gso
uv venv && source .venv/bin/activate
uv sync
python scripts/prepare_images.py
python opt_at_k.py --predictions <preds.json> --run_id <name> --model <model_name>
```

The `predictions` file is a list of model-generated patches keyed by `instance_id`. The harness applies each patch in the appropriate Docker container, runs correctness and performance tests, and reports Opt@K plus per-task speedup ratios.

### loading the dataset

```python
from datasets import load_dataset

gso = load_dataset("gso-bench/gso", split="test")
print(len(gso))  # 102
print(gso[0]["instance_id"], gso[0]["repo"])
```

Each entry contains the fields shown below. The Hugging Face dataset card lists every field with its type and meaning[4].

| Field | Description |
| --- | --- |
| `instance_id` | Unique identifier formatted as `<owner>__<repo>-<commit>` |
| `repo` | GitHub `owner/name` of the source repository |
| `base_commit` | Commit hash before the optimization |
| `opt_commit` | Expert human commit that delivers the optimization |
| `api` | Endpoint or API most affected by the change |
| `prob_script` | Generated performance-test specification |
| `tests` | JSON list of performance and correctness tests |
| `hints_text` | Original commit title and message |
| `setup_commands` | Commands to install base dependencies |
| `install_commands` | Commands to build or rebuild after the patch |
| `created_at` | Date of the ground-truth commit |
| `gt_commit_message` | Commit message from the human optimization |
| `gt_diff` | Full unified diff of the human optimization |
| `arch` | Docker image architecture, usually `x86_64` |
| `instance_image_tag` | Docker image tag, usually `latest` |

## takeaways

A few things stand out about what GSO measures and what its results imply.

Optimization is a different skill from correctness. A model can pass algorithm exams and fix bugs in real Python and still be useless at making a [NumPy](/wiki/numpy) ufunc faster. The 3.6% Opt@1 for o4-mini against its 73% LiveCodeBench score is the cleanest demonstration of that gap.

Language depth matters more than headline scores admit. The drop from 21.4% (Python only) to 4.0% (anything that touches C, C++, Cython, or Rust) suggests that frontier models are still Python-flavored when they have to ship a working low-level patch.

More compute helps, but not evenly. Going from 1 rollout to 8 roughly triples Opt@K, while doubling steps per rollout barely moves the number. Test-time scaling on GSO looks more like sample diversity than deeper reasoning.

Agents have a planning problem too. Even when handed the human's diff and a back-translated strategy, o4-mini's Opt@5 reaches just 16.4%. Strategy is part of the bottleneck; execution is the rest.

## limitations

The paper's Limitations section flags four issues directly[1]:

- **Benchmark size.** 102 tasks introduce variance. The team plans to expand based on community feedback.
- **Hacky optimizations.** Reward hacking is a live problem; the November 2025 Hack Detector tries to keep ahead of it.
- **Evaluation beyond speedup.** Real engineering involves trade-offs against memory use, maintainability, and idiomatic style, none of which GSO currently measures.
- **Contamination.** Tasks come from public GitHub repositories, so training-data overlap is plausible. The authors argue that the persistently low scores and the continuous speedup metric make contamination unlikely to fully explain the results.

## see also

- [SWE-bench](/wiki/swe_bench)
- [SWE-bench Verified](/wiki/swe_bench_verified)
- [LiveCodeBench](/wiki/livecodebench)
- [HumanEval](/wiki/humaneval)
- [KernelBench](/wiki/kernelbench)
- [OpenHands](/wiki/openhands)
- [Sky Computing Lab](/wiki/sky_computing_lab)
- [SWE Agent](/wiki/swe_agent)
- [Code Optimization](/wiki/code_optimization)
- [Performance Engineering](/wiki/performance_engineering)

## references

1. Shetty, M., Jain, N., Liu, J., Kethanaboyina, V., Sen, K., Stoica, I. (2025). "GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents." arXiv preprint, arXiv:2505.23671v3, October 24, 2025. [https://arxiv.org/abs/2505.23671](https://arxiv.org/abs/2505.23671)
2. GSO project page and leaderboard, UC Berkeley Sky Computing Lab. [https://gso-bench.github.io/](https://gso-bench.github.io/)
3. UC Berkeley Sky Computing Lab project listing for GSO. [https://sky.cs.berkeley.edu/project/gso/](https://sky.cs.berkeley.edu/project/gso/)
4. GSO dataset card, Hugging Face. [https://huggingface.co/datasets/gso-bench/gso](https://huggingface.co/datasets/gso-bench/gso)
5. GSO benchmark repository, GitHub. [https://github.com/gso-bench/gso](https://github.com/gso-bench/gso)
6. NeurIPS 2025 poster listing for GSO, San Diego, December 3, 2025. [https://neurips.cc/virtual/2025/loc/san-diego/poster/121735](https://neurips.cc/virtual/2025/loc/san-diego/poster/121735)
7. Epoch AI benchmark page for GSO. [https://epoch.ai/benchmarks/gso](https://epoch.ai/benchmarks/gso)