# GSO (Global Software Optimization Bench)

> Source: https://aiwiki.ai/wiki/gso_bench
> Updated: 2026-06-08
> Categories: AI Benchmarks, AI Code Generation
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

## Overview

GSO (Global Software Optimization), also styled as a benchmark of "Challenging Software Optimization Tasks for Evaluating SWE-Agents," is an [AI benchmark](/wiki/ai_benchmark) that measures whether [AI agents](/wiki/ai_agents) and language models can make real software run faster while keeping it correct. Unlike most coding benchmarks, which test functional correctness or bug fixing, GSO targets [software optimization](/wiki/program_optimization): each task gives an agent a real codebase and a performance test, then asks it to produce a code change that achieves a measured runtime speedup matching what a human expert accomplished in an actual commit. The benchmark was introduced in a 2025 paper by Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica of the University of California, Berkeley [1][2].

GSO contains 102 optimization tasks drawn from 10 widely used open-source codebases spanning five programming languages [1][2]. Its central finding is stark: leading [SWE-agents](/wiki/swe_bench) built on frontier models solve fewer than 5 percent of tasks under the headline metric, far below the human experts whose work defines each task, exposing a large gap in machine performance-engineering ability [1][3].

## Motivation: optimization versus correctness

Most prominent software-engineering benchmarks evaluate whether a model can produce code that works. [SWE-bench](/wiki/swe_bench), for example, asks agents to resolve real GitHub issues so that a hidden test suite passes, rewarding functional correctness and bug resolution. That skill is valuable, but it is only one axis of software engineering. A distinct and economically important capability is making existing, already-correct code faster: profiling a program, localizing bottlenecks, and rewriting hot paths to reduce runtime, memory traffic, or other resource costs without changing behavior [1].

The GSO authors argue that high-performance software development is a specialized skill requiring deep expertise, and that it is poorly captured by correctness-only evaluations [1]. Performance optimization is harder to measure than correctness: a patch either passes a test or it does not, but a speedup is a continuous quantity that depends on hardware, workload, and measurement noise. Verifying that an optimization is genuine, rather than a shortcut that cheats the timing harness, requires careful experimental design. By grounding every task in a real expert commit that delivered a measured improvement, GSO provides an objective, human-anchored target for this otherwise slippery axis of [AI code generation](/wiki/ai_code_generation) [1][2].

## What GSO contains

GSO was assembled through a largely automated two-stage pipeline applied to the version-control histories of major open-source projects, followed by manual curation [1].

In the first stage, a language model scanned commit histories to identify candidate performance-improving commits, using commit messages and heuristic analysis of the code changes. In the second stage, the system generated performance tests for each candidate via execution-based rejection sampling, prompting a model with the commit context to produce workloads that exercise the relevant code path. Each candidate test was run against the code before and after the commit; only commits that produced significant, reproducible speedups across multiple test cases while preserving correctness were kept. A final round of human curation removed weak tests and cases with reproducibility problems, yielding the 102-task set [1].

The 10 source codebases cover numerical computing, data processing, web serving, image manipulation, and machine-learning infrastructure. They are listed below with their languages.

| Codebase | Domain | Languages |
|---|---|---|
| NumPy | Numerical computing | Python, C, C++ |
| Pandas | Data analysis | Python, Cython |
| Pillow | Image processing | Python, C |
| Pillow-SIMD | Image processing (SIMD) | Python, C |
| Pydantic | Data validation | Python |
| Tornado | Web framework | Python |
| Tokenizers | NLP tokenization | Python, Rust |
| Transformers | ML model library | Python |
| Datasets | ML data library | Python |
| llama.cpp | LLM inference | Python, C, C++ |

Although every task has a Python entry point, the benchmark is deliberately polyglot: by the authors' accounting, about 58.8 percent of tasks require changes to non-Python code, pushing agents into lower-level languages such as C, C++, Cython, and Rust where most real performance gains live [1]. The reference commits are substantial engineering efforts, averaging roughly 250 lines changed (median about 110, maximum over 2,000), and each task ships with an average of around 12 performance tests [1].

## Evaluation: measured speedup

GSO judges an agent's patch by three conditions, all of which must hold for success: the patch must apply cleanly to the codebase, it must pass the project's correctness tests, and it must reach a speedup at least as large as a set fraction of the human expert's improvement [1].

Speedup is computed by timing the test workloads on the original code versus the agent's patched code and aggregating across tests using a harmonic mean, a choice that prevents a single easy or extreme test case from dominating the score [1]. The primary metric is Opt@1, the fraction of tasks where a single attempt both passes correctness and reaches at least 95 percent of the human commit's speedup [1][2]. The framework generalizes this to Optp@k: p is the speedup threshold expressed as a fraction of the human result (p = 0.95 is the default, p = 0 reduces to a pure correctness check, and p = 1 demands full parity with the human expert), and k is the number of attempts allowed, supporting an analysis of inference-time scaling [1].

Because a clever agent could "optimize" by memoizing results or otherwise gaming the timing harness rather than genuinely improving the code, later versions of the benchmark added a hack detector that compares submitted solutions against oracle implementations and penalizes such deceptive shortcuts [2].

## Results

Across the models and agent scaffolds the authors evaluated, performance was uniformly low. Using the OpenHands CodeActAgent framework, the team tested a range of frontier models including GPT-4o, o3-mini, o4-mini, Claude 3.5 (v2), Claude 3.7, and the Claude 4 family [1]. Under the default Opt@1 metric, even the strongest configuration solved fewer than 5 percent of tasks, with the Claude 4 generation as the best performer and weaker models scoring at or near zero [1][3].

Allowing many more attempts helped only modestly. With inference-time scaling to 10 attempts per task (Opt@10), the best results rose to roughly 15 percent, and gains showed clear diminishing returns rather than closing the gap [1]. The paper characterizes the dominant failure modes as difficulty working in low-level languages, a tendency toward premature or superficial optimization, and an inability to correctly localize the true performance bottleneck [1].

The table below summarizes the headline figures.

| Quantity | Value |
|---|---|
| Tasks | 102 |
| Source codebases | 10 |
| Programming languages | 5 |
| Non-Python tasks (approx.) | 58.8% |
| Best single-attempt success (Opt@1) | under 5% |
| Best success with 10 attempts (Opt@10) | about 15% |
| Success threshold | at least 95% of expert speedup, tests passing |

These numbers should be read as a snapshot tied to the models and scaffolds available when the paper was written in 2025; GSO maintains a public leaderboard that tracks newer systems over time [2].

## Significance

GSO matters because it isolates a hard, distinct dimension of coding ability that correctness-focused evaluations miss. The headline result, fewer than 5 percent success against human experts, is one of the lowest reported for a frontier-model coding benchmark, and it stands in sharp contrast to the rapid progress agents have made on correctness-oriented suites such as SWE-bench. That contrast suggests performance engineering is not simply a harder version of bug fixing but a separate competency, one that demands profiling, hardware awareness, and reasoning across language boundaries [1].

By anchoring every task to a real, measured expert optimization and by enforcing precise runtime measurement with safeguards against gaming, GSO offers a rigorous and falsifiable target for an economically valuable skill. The work was presented at the 2025 Conference on Neural Information Processing Systems and has helped seed a line of related benchmarks probing whether agents can optimize real-world repositories and inference workloads [2][3]. For builders of coding agents, GSO marks performance optimization as a clear frontier where current systems remain far behind human experts.

## References

1. Shetty, Manish; Jain, Naman; Liu, Jinjian; Kethanaboyina, Vijay; Sen, Koushik; Stoica, Ion. "GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents." arXiv:2505.23671, 2025. https://arxiv.org/abs/2505.23671
2. GSO Benchmark project website and leaderboard. https://gso-bench.github.io/
3. "GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents." OpenReview (NeurIPS 2025). https://openreview.net/forum?id=I5qDL315bQ