GSO (Global Software Optimization Bench)

AI Benchmarks AI Code Generation

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

3 citations

Revision

v1 · 1,401 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

GSO (Global Software Optimization), also styled as a benchmark of "Challenging Software Optimization Tasks for Evaluating SWE-Agents," is an AI benchmark that measures whether AI agents and language models can make real software run faster while keeping it correct. Unlike most coding benchmarks, which test functional correctness or bug fixing, GSO targets software optimization: each task gives an agent a real codebase and a performance test, then asks it to produce a code change that achieves a measured runtime speedup matching what a human expert accomplished in an actual commit. The benchmark was introduced in a 2025 paper by Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, and Ion Stoica of the University of California, Berkeley ^[1]^[2].

GSO contains 102 optimization tasks drawn from 10 widely used open-source codebases spanning five programming languages ^[1]^[2]. Its central finding is stark: leading SWE-agents built on frontier models solve fewer than 5 percent of tasks under the headline metric, far below the human experts whose work defines each task, exposing a large gap in machine performance-engineering ability ^[1]^[3].

Motivation: optimization versus correctness

Most prominent software-engineering benchmarks evaluate whether a model can produce code that works. SWE-bench, for example, asks agents to resolve real GitHub issues so that a hidden test suite passes, rewarding functional correctness and bug resolution. That skill is valuable, but it is only one axis of software engineering. A distinct and economically important capability is making existing, already-correct code faster: profiling a program, localizing bottlenecks, and rewriting hot paths to reduce runtime, memory traffic, or other resource costs without changing behavior ^[1].

The GSO authors argue that high-performance software development is a specialized skill requiring deep expertise, and that it is poorly captured by correctness-only evaluations ^[1]. Performance optimization is harder to measure than correctness: a patch either passes a test or it does not, but a speedup is a continuous quantity that depends on hardware, workload, and measurement noise. Verifying that an optimization is genuine, rather than a shortcut that cheats the timing harness, requires careful experimental design. By grounding every task in a real expert commit that delivered a measured improvement, GSO provides an objective, human-anchored target for this otherwise slippery axis of AI code generation ^[1]^[2].

What GSO contains

GSO was assembled through a largely automated two-stage pipeline applied to the version-control histories of major open-source projects, followed by manual curation ^[1].

In the first stage, a language model scanned commit histories to identify candidate performance-improving commits, using commit messages and heuristic analysis of the code changes. In the second stage, the system generated performance tests for each candidate via execution-based rejection sampling, prompting a model with the commit context to produce workloads that exercise the relevant code path. Each candidate test was run against the code before and after the commit; only commits that produced significant, reproducible speedups across multiple test cases while preserving correctness were kept. A final round of human curation removed weak tests and cases with reproducibility problems, yielding the 102-task set ^[1].

The 10 source codebases cover numerical computing, data processing, web serving, image manipulation, and machine-learning infrastructure. They are listed below with their languages.

Codebase	Domain	Languages
NumPy	Numerical computing	Python, C, C++
Pandas	Data analysis	Python, Cython
Pillow	Image processing	Python, C
Pillow-SIMD	Image processing (SIMD)	Python, C
Pydantic	Data validation	Python
Tornado	Web framework	Python
Tokenizers	NLP tokenization	Python, Rust
Transformers	ML model library	Python
Datasets	ML data library	Python
llama.cpp	LLM inference	Python, C, C++

Although every task has a Python entry point, the benchmark is deliberately polyglot: by the authors' accounting, about 58.8 percent of tasks require changes to non-Python code, pushing agents into lower-level languages such as C, C++, Cython, and Rust where most real performance gains live ^[1]. The reference commits are substantial engineering efforts, averaging roughly 250 lines changed (median about 110, maximum over 2,000), and each task ships with an average of around 12 performance tests ^[1].

Evaluation: measured speedup

GSO judges an agent's patch by three conditions, all of which must hold for success: the patch must apply cleanly to the codebase, it must pass the project's correctness tests, and it must reach a speedup at least as large as a set fraction of the human expert's improvement ^[1].

Speedup is computed by timing the test workloads on the original code versus the agent's patched code and aggregating across tests using a harmonic mean, a choice that prevents a single easy or extreme test case from dominating the score ^[1]. The primary metric is Opt@1, the fraction of tasks where a single attempt both passes correctness and reaches at least 95 percent of the human commit's speedup ^[1]^[2]. The framework generalizes this to Optp@k: p is the speedup threshold expressed as a fraction of the human result (p = 0.95 is the default, p = 0 reduces to a pure correctness check, and p = 1 demands full parity with the human expert), and k is the number of attempts allowed, supporting an analysis of inference-time scaling ^[1].

Because a clever agent could "optimize" by memoizing results or otherwise gaming the timing harness rather than genuinely improving the code, later versions of the benchmark added a hack detector that compares submitted solutions against oracle implementations and penalizes such deceptive shortcuts ^[2].

Results

Across the models and agent scaffolds the authors evaluated, performance was uniformly low. Using the OpenHands CodeActAgent framework, the team tested a range of frontier models including GPT-4o, o3-mini, o4-mini, Claude 3.5 (v2), Claude 3.7, and the Claude 4 family ^[1]. Under the default Opt@1 metric, even the strongest configuration solved fewer than 5 percent of tasks, with the Claude 4 generation as the best performer and weaker models scoring at or near zero ^[1]^[3].

Allowing many more attempts helped only modestly. With inference-time scaling to 10 attempts per task (Opt@10), the best results rose to roughly 15 percent, and gains showed clear diminishing returns rather than closing the gap ^[1]. The paper characterizes the dominant failure modes as difficulty working in low-level languages, a tendency toward premature or superficial optimization, and an inability to correctly localize the true performance bottleneck ^[1].

The table below summarizes the headline figures.

Quantity	Value
Tasks	102
Source codebases	10
Programming languages	5
Non-Python tasks (approx.)	58.8%
Best single-attempt success (Opt@1)	under 5%
Best success with 10 attempts (Opt@10)	about 15%
Success threshold	at least 95% of expert speedup, tests passing

These numbers should be read as a snapshot tied to the models and scaffolds available when the paper was written in 2025; GSO maintains a public leaderboard that tracks newer systems over time ^[2].

Significance

GSO matters because it isolates a hard, distinct dimension of coding ability that correctness-focused evaluations miss. The headline result, fewer than 5 percent success against human experts, is one of the lowest reported for a frontier-model coding benchmark, and it stands in sharp contrast to the rapid progress agents have made on correctness-oriented suites such as SWE-bench. That contrast suggests performance engineering is not simply a harder version of bug fixing but a separate competency, one that demands profiling, hardware awareness, and reasoning across language boundaries ^[1].

By anchoring every task to a real, measured expert optimization and by enforcing precise runtime measurement with safeguards against gaming, GSO offers a rigorous and falsifiable target for an economically valuable skill. The work was presented at the 2025 Conference on Neural Information Processing Systems and has helped seed a line of related benchmarks probing whether agents can optimize real-world repositories and inference workloads ^[2]^[3]. For builders of coding agents, GSO marks performance optimization as a clear frontier where current systems remain far behind human experts.

References

Shetty, Manish; Jain, Naman; Liu, Jinjian; Kethanaboyina, Vijay; Sen, Koushik; Stoica, Ion. "GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents." arXiv:2505.23671, 2025. https://arxiv.org/abs/2505.23671 ↩
GSO Benchmark project website and leaderboard. https://gso-bench.github.io/ ↩
"GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents." OpenReview (NeurIPS 2025). https://openreview.net/forum?id=I5qDL315bQ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

HumanEval Pass@k

Overview

Motivation: optimization versus correctness

What GSO contains

Evaluation: measured speedup

Results

Significance

References

Improve this article

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here

Related Articles

HumanEval

LiveCodeBench

SWE-bench Verified

MBPP

CodeContests

CRUXEval

What links here