GSO

GSO
Overview
Full name	Global Software Optimization
Abbreviation	GSO
Description	A benchmark evaluating language models' capabilities in software performance optimization through real-world code optimization tasks
Release date	2025-05
Latest version	1.0
Benchmark updated	2025-05-30
Authors	Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica
Organization	UC Berkeley Sky Computing Lab
Technical Details
Type	Software Optimization, Code Performance, Multi-language Programming
Modality	Code, Text
Task format	Performance optimization patches
Number of tasks	102
Total examples	102 optimization tasks across 10 codebases
Evaluation metric	Opt@1, Opt@K, Speedup ratio
Domains	Scientific computing, Data processing, Image processing, Machine learning
Languages	Python, C, C++, SIMD, Rust, Cython
Performance
Human performance	100% (baseline)
Baseline	Expert developer optimizations
SOTA score	8.8%
SOTA model	O3 (high)
SOTA date	2025-05
Saturated	No
Resources
Website	Official website
Paper	Paper
GitHub	Repository
Dataset	Download
License	MIT ;

GSO (Global Software Optimization) is a comprehensive artificial intelligence benchmark designed to evaluate language models' capabilities in developing high-performance software through optimization tasks. Released in May 2025 by researchers from UC Berkeley's Sky Computing Lab^[1], GSO challenges AI systems to improve the runtime efficiency of existing codebases by generating performance-improving code patches that match or exceed human expert optimizations. The benchmark reveals that even state-of-the-art models struggle with optimization tasks, with the best model (O3) achieving only 8.8% success rate on these real-world optimization tasks, highlighting a critical gap between current AI capabilities and the demands of production software engineering.

Overview

GSO represents a paradigm shift in evaluating AI systems for software engineering tasks by focusing specifically on performance optimization rather than bug fixing or code generation. The benchmark consists of 102 challenging optimization tasks across 10 popular codebases, spanning 5 programming languages and 8 different domains^[2]. Unlike traditional benchmarks that focus on correctness, GSO evaluates whether AI systems can understand and improve the performance characteristics of complex software systems, a critical capability for real-world software development.

Motivation

The creation of GSO was motivated by several key observations:

**Performance criticality**: Software performance optimization is essential for production systems
**Evaluation gap**: Existing benchmarks focus on bug fixing rather than optimization
**Real-world complexity**: Production optimizations require multi-file, multi-language changes
**AI limitations**: Success on coding benchmarks doesn't translate to optimization capabilities
**Industry needs**: Growing demand for AI assistance in performance engineering

Benchmark Design

Task Construction

GSO's tasks are derived from real optimization commits in popular open-source repositories:

Repository	Domain	Primary Language	Example Optimizations
NumPy	Scientific computing	Python/C	Vectorization, memory layout optimization
Pandas	Data processing	Python/Cython	Algorithm improvements, caching
Pillow	Image processing	Python/C	SIMD operations, buffer management
Llama-CPP	Machine learning	C++	GPU optimization, parallelization
scikit-learn	Machine learning	Python/Cython	Algorithm optimization, vectorization
matplotlib	Visualization	Python/C++	Rendering optimization, caching
Additional repos	Various	Mixed	Domain-specific optimizations

Task Characteristics

Each GSO task exhibits several distinguishing features^[1]:

Characteristic	GSO	Traditional Benchmarks	Significance
Edit scope	4-15× more lines	Single function/file	Real-world complexity
Language diversity	~60% require non-Python	Mostly single language	Systems programming skills
File span	Multiple files/modules	Usually single file	Architectural understanding
Performance focus	Primary objective	Correctness only	Different skill set
Solution space	Multiple valid approaches	Single correct answer	Creative problem-solving

Evaluation Methodology

Performance Metrics

GSO employs rigorous performance evaluation metrics:

Metric	Definition	Calculation	Success Threshold
Opt@1	Single-attempt success rate	Tasks achieving ≥95% speedup / Total tasks	≥95% of human speedup
Opt@K	Best-of-K success rate	Tasks with any success in K attempts / Total tasks	≥95% of human speedup
Speedup Ratio	Performance improvement	Optimized runtime / Original runtime	Lower is better
Edit Distance	Solution complexity	Lines changed in patch	Informational only

Evaluation Pipeline

The GSO evaluation process follows a structured pipeline:

1. **Task Initialization**: Load problem specification and performance tests 2. **Agent Execution**: Generate optimization patch within resource limits 3. **Correctness Validation**: Ensure patch doesn't break functionality 4. **Performance Measurement**: Compare runtime against baseline 5. **Success Determination**: Check if 95% threshold met

Automated Test Generation

GSO introduces an innovative automated pipeline for generating performance tests^[3]:

```python

GSO's 4-step test generation framework

def generate_performance_test(repo, commit):

   # Step 1: Extract optimization commit using LLMs
   opt_commit = extract_optimization_commit(repo, commit)
   
   # Step 2: Identify affected APIs
   apis = identify_apis(opt_commit)
   
   # Step 3: Generate performance test
   test = generate_test_for_apis(apis)
   
   # Step 4: Validate test execution
   validate_test(test, repo, commit)
   
   return test

```

Current Performance

Model Leaderboard (May 2025)

Rank	Model	Opt@1	Opt@8	Avg Speedup	Languages Handled
1	O3 (high)	8.8%	~20%	0.91×	All
2	Claude-4-Opus/Claude-4-Sonnet	4.9%	~15%	0.92×	All
3	GPT-4o	<5%	<10%	0.95×	Python-heavy
4	O1-preview	<5%	<8%	0.96×	Limited
5	DeepSeek-V3	<5%	<5%	0.98×	Python only
-	Human Expert	100%	100%	1.0×	All

Performance Analysis by Task Type

Task Category	Human Success	Best AI	Gap
Algorithm optimization	100%	8%	92%
Memory management	100%	3%	97%
Parallelization	100%	2%	98%
SIMD/Vectorization	100%	0%	100%
Caching strategies	100%	5%	95%

Key Findings

Failure Mode Analysis

GSO reveals systematic failure patterns in current AI systems^[1]:

Failure Mode	Frequency	Description	Example
Abstraction hierarchy	25-30%	Avoiding necessary low-level changes	Refusing to modify C code when needed
Lazy optimization	30%	Preferring trivial changes	Adding simple caching instead of algorithmic improvements
Premature termination	75%	Not using full compute budget	Stopping after 25% of allowed steps
Cross-language barriers	60%	Inability to work across languages	Python-only solutions for C++ problems
Performance blindness	40%	No understanding of performance implications	Random changes without profiling

Scaling Laws

GSO provides insights into compute scaling for optimization tasks:

Scaling Type	Performance Impact	Efficiency	Recommendation
Parallel (multiple rollouts)	Moderate improvement	Good	Preferred approach
Serial (longer reasoning)	Minimal improvement	Poor	Not recommended
Hybrid approaches	Best results	Moderate	Future research direction
Increased model size	Limited benefit	Poor	Not sufficient alone

Technical Implementation

Installation and Setup

GSO provides comprehensive tooling for evaluation:

```bash

Install GSO benchmark

curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env

Clone repository

git clone https://github.com/gso-bench/gso.git cd gso

Setup environment

uv venv && source .venv/bin/activate uv sync

Prepare Docker environments

python scripts/prepare_docker_images.py ```

Dataset Access

The GSO dataset is available through multiple channels^[4]:

```python from datasets import load_dataset

Load GSO dataset

gso_dataset = load_dataset('gso-bench/gso', split='test')

Access individual tasks

for task in gso_dataset:

   instance_id = task['instance_id']
   repo = task['repo']
   optimization_commit = task['opt_commit']
   performance_tests = task['tests']

```

Task Structure

Each GSO task contains:

Field	Description	Example
`instance_id`	Unique identifier	"numpy__numpy-12345"
`repo`	Repository name	"numpy/numpy"
`base_commit`	Starting commit	"abc123..."
`opt_commit`	Target optimization	"def456..."
`api`	Affected functions	["np.dot", "np.matmul"]
`prob_script`	Problem specification	Performance test code
`tests`	Validation tests	Unit and performance tests
`hints_text`	Optimization hints	"Consider vectorization"
`gt_diff`	Ground truth patch	Actual optimization changes

Significance and Impact

Research Implications

GSO's findings have significant implications for AI research:

1. **Capability gap**: Reveals fundamental limitations in current AI systems 2. **New research directions**: Highlights need for specialized optimization models 3. **Evaluation standards**: Establishes rigorous benchmarks for performance tasks 4. **Scaling insights**: Provides data on compute scaling effectiveness

Industry Applications

Application Area	Current State	GSO Insights	Future Potential
Code review	AI assists with style	Performance blind spots	Performance-aware review
CI/CD pipelines	Basic automation	Cannot optimize	Automated optimization
Legacy modernization	Manual process	AI struggles with complexity	Guided optimization
Performance debugging	Limited AI help	Poor understanding	Improved tools needed

Related Work

Comparison with Other Benchmarks

Benchmark	Focus	Task Count	Languages	GSO Advantage
SWE-Bench	Bug fixing	2,294	Python	4-15× more complex edits
HumanEval	Code generation	164	Python	Real-world optimization
MBPP	Programming problems	974	Python	Performance focus
LiveCodeBench	General coding	Variable	Multiple	Optimization specific
GSO	Performance optimization	102	6 languages	Unique focus area

Limitations and Future Work

Current Limitations

1. **Limited scale**: 102 tasks may not cover all optimization patterns 2. **Domain coverage**: Focus on specific repositories 3. **Language bias**: Heavy emphasis on Python ecosystem 4. **Measurement challenges**: Performance can be hardware-dependent 5. **Single-shot evaluation**: No iterative refinement allowed

Future Directions

Direction	Description	Timeline
Expanded coverage	More repositories and languages	2025-2026
Interactive mode	Allow iterative optimization	2026
Hardware diversity	GPU, TPU optimization tasks	2026
Specialized models	Optimization-focused architectures	Research ongoing
Industry integration	Real production systems	2026-2027

Significance

GSO represents a critical advancement in evaluating AI systems for real-world software engineering tasks. By focusing on performance optimization, a skill essential for production software development, the benchmark reveals that current AI systems, despite their impressive capabilities in code generation and bug fixing, fundamentally lack the understanding necessary for effective software optimization. The stark performance gap (less than 5% success rate for best models) highlights both the challenge ahead and the potential impact of solving this problem.

The benchmark's rigorous evaluation methodology, automated test generation pipeline, and focus on real-world complexity make it an essential tool for advancing AI capabilities in software engineering. As software performance becomes increasingly critical in the era of cloud computing and mobile devices, GSO provides the foundation for developing AI systems that can truly assist with the full spectrum of software development tasks.

References

↑ ^1.0 ^1.1 ^1.2 Shetty, M., Jain, N., Liu, J., Kethanaboyina, V., Sen, K., & Stoica, I. (2025). "GSO: A Global Software Optimization Benchmark". arXiv:2505.23671. Retrieved from https://arxiv.org/abs/2505.23671
↑ GSO Team. (2025). "GSO: Global Software Optimization Benchmark". Retrieved from https://gso-bench.github.io/
↑ UC Berkeley Sky Lab. (2025). "GSO Project". Retrieved from https://sky.cs.berkeley.edu/project/gso/
↑ GSO Team. (2025). "GSO Dataset". HuggingFace. Retrieved from https://huggingface.co/datasets/gso-bench/gso

Cite error: <ref> tag with name "gso_github" defined in <references> is not used in prior text.

[gso_paper-1] 1.0 ^1.1 ^1.2 Shetty, M., Jain, N., Liu, J., Kethanaboyina, V., Sen, K., & Stoica, I. (2025). "GSO: A Global Software Optimization Benchmark". arXiv:2505.23671. Retrieved from https://arxiv.org/abs/2505.23671

[gso_website-2] GSO Team. (2025). "GSO: Global Software Optimization Benchmark". Retrieved from https://gso-bench.github.io/

[gso_berkeley-3] UC Berkeley Sky Lab. (2025). "GSO Project". Retrieved from https://sky.cs.berkeley.edu/project/gso/

[gso_huggingface-4] GSO Team. (2025). "GSO Dataset". HuggingFace. Retrieved from https://huggingface.co/datasets/gso-bench/gso

[1]

[2]

[3]

[4]