| GSO | |
|---|---|
| Overview | |
| Full name | Global Software Optimization |
| Abbreviation | GSO |
| Description | A benchmark evaluating language models' capabilities in software performance optimization through real-world code optimization tasks |
| Release date | 2025-05 |
| Latest version | 1.0 |
| Benchmark updated | 2025-05-30 |
| Authors | Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica |
| Organization | UC Berkeley Sky Computing Lab |
| Technical Details | |
| Type | Software Optimization, Code Performance, Multi-language Programming |
| Modality | Code, Text |
| Task format | Performance optimization patches |
| Number of tasks | 102 |
| Total examples | 102 optimization tasks across 10 codebases |
| Evaluation metric | Opt@1, Opt@K, Speedup ratio |
| Domains | Scientific computing, Data processing, Image processing, Machine learning |
| Languages | Python, C, C++, SIMD, Rust, Cython |
| Performance | |
| Human performance | 100% (baseline) |
| Baseline | Expert developer optimizations |
| SOTA score | 8.8% |
| SOTA model | O3 (high) |
| SOTA date | 2025-05 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | Download |
| License | MIT
|
GSO (Global Software Optimization) is a comprehensive artificial intelligence benchmark designed to evaluate language models' capabilities in developing high-performance software through optimization tasks. Released in May 2025 by researchers from UC Berkeley's Sky Computing Lab[1], GSO challenges AI systems to improve the runtime efficiency of existing codebases by generating performance-improving code patches that match or exceed human expert optimizations. The benchmark reveals that even state-of-the-art models struggle with optimization tasks, with the best model (O3) achieving only 8.8% success rate on these real-world optimization tasks, highlighting a critical gap between current AI capabilities and the demands of production software engineering.
GSO represents a paradigm shift in evaluating AI systems for software engineering tasks by focusing specifically on performance optimization rather than bug fixing or code generation. The benchmark consists of 102 challenging optimization tasks across 10 popular codebases, spanning 5 programming languages and 8 different domains[2]. Unlike traditional benchmarks that focus on correctness, GSO evaluates whether AI systems can understand and improve the performance characteristics of complex software systems, a critical capability for real-world software development.
The creation of GSO was motivated by several key observations:
GSO's tasks are derived from real optimization commits in popular open-source repositories:
| Repository | Domain | Primary Language | Example Optimizations |
|---|---|---|---|
| NumPy | Scientific computing | Python/C | Vectorization, memory layout optimization |
| Pandas | Data processing | Python/Cython | Algorithm improvements, caching |
| Pillow | Image processing | Python/C | SIMD operations, buffer management |
| Llama-CPP | Machine learning | C++ | GPU optimization, parallelization |
| scikit-learn | Machine learning | Python/Cython | Algorithm optimization, vectorization |
| matplotlib | Visualization | Python/C++ | Rendering optimization, caching |
| Additional repos | Various | Mixed | Domain-specific optimizations |
Each GSO task exhibits several distinguishing features[1]:
| Characteristic | GSO | Traditional Benchmarks | Significance |
|---|---|---|---|
| **Edit scope** | 4-15× more lines | Single function/file | Real-world complexity |
| **Language diversity** | ~60% require non-Python | Mostly single language | Systems programming skills |
| **File span** | Multiple files/modules | Usually single file | Architectural understanding |
| **Performance focus** | Primary objective | Correctness only | Different skill set |
| **Solution space** | Multiple valid approaches | Single correct answer | Creative problem-solving |
GSO employs rigorous performance evaluation metrics:
| Metric | Definition | Calculation | Success Threshold |
|---|---|---|---|
| **Opt@1** | Single-attempt success rate | Tasks achieving ≥95% speedup / Total tasks | ≥95% of human speedup |
| **Opt@K** | Best-of-K success rate | Tasks with any success in K attempts / Total tasks | ≥95% of human speedup |
| **Speedup Ratio** | Performance improvement | Optimized runtime / Original runtime | Lower is better |
| **Edit Distance** | Solution complexity | Lines changed in patch | Informational only |
The GSO evaluation process follows a structured pipeline:
1. **Task Initialization**: Load problem specification and performance tests 2. **Agent Execution**: Generate optimization patch within resource limits 3. **Correctness Validation**: Ensure patch doesn't break functionality 4. **Performance Measurement**: Compare runtime against baseline 5. **Success Determination**: Check if 95% threshold met
GSO introduces an innovative automated pipeline for generating performance tests[3]:
```python
def generate_performance_test(repo, commit):
# Step 1: Extract optimization commit using LLMs opt_commit = extract_optimization_commit(repo, commit) # Step 2: Identify affected APIs apis = identify_apis(opt_commit) # Step 3: Generate performance test test = generate_test_for_apis(apis) # Step 4: Validate test execution validate_test(test, repo, commit) return test
```
| Rank | Model | Opt@1 | Opt@8 | Avg Speedup | Languages Handled |
|---|---|---|---|---|---|
| 1 | O3 (high) | 8.8% | ~20% | 0.91× | All |
| 2 | Claude-4-Opus/Claude-4-Sonnet | 4.9% | ~15% | 0.92× | All |
| 3 | GPT-4o | <5% | <10% | 0.95× | Python-heavy |
| 4 | O1-preview | <5% | <8% | 0.96× | Limited |
| 5 | DeepSeek-V3 | <5% | <5% | 0.98× | Python only |
| - | Human Expert | 100% | 100% | 1.0× | All |
| Task Category | Human Success | Best AI | Gap |
|---|---|---|---|
| **Algorithm optimization** | 100% | 8% | 92% |
| **Memory management** | 100% | 3% | 97% |
| **Parallelization** | 100% | 2% | 98% |
| **SIMD/Vectorization** | 100% | 0% | 100% |
| **Caching strategies** | 100% | 5% | 95% |
GSO reveals systematic failure patterns in current AI systems[1]:
| Failure Mode | Frequency | Description | Example |
|---|---|---|---|
| **Abstraction hierarchy** | 25-30% | Avoiding necessary low-level changes | Refusing to modify C code when needed |
| **Lazy optimization** | 30% | Preferring trivial changes | Adding simple caching instead of algorithmic improvements |
| **Premature termination** | 75% | Not using full compute budget | Stopping after 25% of allowed steps |
| **Cross-language barriers** | 60% | Inability to work across languages | Python-only solutions for C++ problems |
| **Performance blindness** | 40% | No understanding of performance implications | Random changes without profiling |
GSO provides insights into compute scaling for optimization tasks:
| Scaling Type | Performance Impact | Efficiency | Recommendation |
|---|---|---|---|
| **Parallel (multiple rollouts)** | Moderate improvement | Good | Preferred approach |
| **Serial (longer reasoning)** | Minimal improvement | Poor | Not recommended |
| **Hybrid approaches** | Best results | Moderate | Future research direction |
| **Increased model size** | Limited benefit | Poor | Not sufficient alone |
GSO provides comprehensive tooling for evaluation:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh source $HOME/.local/bin/env
git clone https://github.com/gso-bench/gso.git cd gso
uv venv && source .venv/bin/activate uv sync
python scripts/prepare_docker_images.py ```
The GSO dataset is available through multiple channels[4]:
```python from datasets import load_dataset
gso_dataset = load_dataset('gso-bench/gso', split='test')
for task in gso_dataset:
instance_id = task['instance_id'] repo = task['repo'] optimization_commit = task['opt_commit'] performance_tests = task['tests']
```
Each GSO task contains:
| Field | Description | Example |
|---|---|---|
| `instance_id` | Unique identifier | "numpy__numpy-12345" |
| `repo` | Repository name | "numpy/numpy" |
| `base_commit` | Starting commit | "abc123..." |
| `opt_commit` | Target optimization | "def456..." |
| `api` | Affected functions | ["np.dot", "np.matmul"] |
| `prob_script` | Problem specification | Performance test code |
| `tests` | Validation tests | Unit and performance tests |
| `hints_text` | Optimization hints | "Consider vectorization" |
| `gt_diff` | Ground truth patch | Actual optimization changes |
GSO's findings have significant implications for AI research:
1. **Capability gap**: Reveals fundamental limitations in current AI systems 2. **New research directions**: Highlights need for specialized optimization models 3. **Evaluation standards**: Establishes rigorous benchmarks for performance tasks 4. **Scaling insights**: Provides data on compute scaling effectiveness
| Application Area | Current State | GSO Insights | Future Potential |
|---|---|---|---|
| **Code review** | AI assists with style | Performance blind spots | Performance-aware review |
| **CI/CD pipelines** | Basic automation | Cannot optimize | Automated optimization |
| **Legacy modernization** | Manual process | AI struggles with complexity | Guided optimization |
| **Performance debugging** | Limited AI help | Poor understanding | Improved tools needed |
| Benchmark | Focus | Task Count | Languages | GSO Advantage |
|---|---|---|---|---|
| SWE-Bench | Bug fixing | 2,294 | Python | 4-15× more complex edits |
| HumanEval | Code generation | 164 | Python | Real-world optimization |
| MBPP | Programming problems | 974 | Python | Performance focus |
| LiveCodeBench | General coding | Variable | Multiple | Optimization specific |
| GSO | Performance optimization | 102 | 6 languages | Unique focus area |
1. **Limited scale**: 102 tasks may not cover all optimization patterns 2. **Domain coverage**: Focus on specific repositories 3. **Language bias**: Heavy emphasis on Python ecosystem 4. **Measurement challenges**: Performance can be hardware-dependent 5. **Single-shot evaluation**: No iterative refinement allowed
| Direction | Description | Timeline |
|---|---|---|
| **Expanded coverage** | More repositories and languages | 2025-2026 |
| **Interactive mode** | Allow iterative optimization | 2026 |
| **Hardware diversity** | GPU, TPU optimization tasks | 2026 |
| **Specialized models** | Optimization-focused architectures | Research ongoing |
| **Industry integration** | Real production systems | 2026-2027 |
GSO represents a critical advancement in evaluating AI systems for real-world software engineering tasks. By focusing on performance optimization, a skill essential for production software development, the benchmark reveals that current AI systems, despite their impressive capabilities in code generation and bug fixing, fundamentally lack the understanding necessary for effective software optimization. The stark performance gap (less than 5% success rate for best models) highlights both the challenge ahead and the potential impact of solving this problem.
The benchmark's rigorous evaluation methodology, automated test generation pipeline, and focus on real-world complexity make it an essential tool for advancing AI capabilities in software engineering. As software performance becomes increasingly critical in the era of cloud computing and mobile devices, GSO provides the foundation for developing AI systems that can truly assist with the full spectrum of software development tasks.
Cite error: <ref> tag with name "gso_github" defined in <references> is not used in prior text.