Terminal-Bench

From AI Wiki


Terminal-Bench
Overview
Full name Terminal Environment Benchmark
Abbreviation Terminal-Bench
Description A benchmark for evaluating AI agents' ability to complete real-world, end-to-end tasks in terminal environments
Release date 2025-04
Latest version 0.1.1 (Core)
Benchmark updated 2025
Authors Stanford UniversityLaude Institute
Organization Stanford UniversityLaude Institute
Technical Details
Type Agent EvaluationTerminal Tasks
Modality TextCommand-line Interface
Task format End-to-end task completion
Number of tasks ~100 (80 in v0.1.1)
Total examples ~100
Evaluation metric Task completion rate
Domains System administrationSecurityData scienceModel trainingFile operations
Languages English
Performance
Human performance Not reported
Baseline ~30% (GPT-4.1 with Codex)
SOTA score 52%
SOTA model Warp Terminal Agent
SOTA date 2025
Saturated No
Resources
Website Official website
GitHub Repository
Dataset Download
License Open source



Terminal-Bench is an AI benchmark designed to evaluate language agents' ability to complete real-world, end-to-end tasks in terminal environments. Released in April 2025 through a collaboration between Stanford University and the Laude Institute, Terminal-Bench represents the first comprehensive attempt to quantify AI agents' mastery of command-line interface operations, from compiling code to training models and setting up servers.

Overview

Terminal-Bench addresses a critical gap in AI agent evaluation by focusing on practical terminal-based tasks that require autonomous problem-solving capabilities. Unlike traditional code generation benchmarks that evaluate isolated programming skills, Terminal-Bench tests agents' ability to navigate complex terminal environments, execute multi-step procedures, and adapt to real-world system configurations.

Motivation

The development of Terminal-Bench was motivated by the need to evaluate AI agents in realistic computing environments where they must:

  • Execute complex system administration tasks
  • Handle unexpected errors and edge cases
  • Navigate file systems and manage processes
  • Configure software and services
  • Solve problems that require multiple tools and commands

The benchmark provides a standardized testing ground for terminal-based AI capabilities, enabling researchers and developers to quantify their agents' terminal mastery objectively.

Technical Architecture

Components

Terminal-Bench consists of two primary components:

Component Description Function
Dataset of Tasks Collection of ~100 terminal-based challenges Provides diverse test scenarios
Execution Harness Runtime environment and evaluation system Connects LLMs to sandboxed terminals

Task Structure

Each task in Terminal-Bench includes:

  • English Instruction: Clear description of what needs to be accomplished
  • Test Script: Automated verification to check task completion
  • Reference Solution: "Oracle" implementation showing one way to solve the task
  • Docker Environment: Containerized setup ensuring consistent testing conditions

Sandboxed Environment

Terminal-Bench employs Docker containers to create isolated, reproducible environments for each task. This approach ensures:

  • **Safety**: Tasks cannot affect the host system
  • **Reproducibility**: Identical conditions for all evaluations
  • **Flexibility**: Support for various operating systems and configurations
  • **Scalability**: Parallel execution of multiple tasks

Task Categories

Terminal-Bench covers a diverse range of terminal use cases across multiple domains:

Core Task Categories

Category Example Tasks Difficulty Skills Tested
System Administration Configure Git servers, set up SSL certificates Medium-Hard Service configuration, process management
Security Crack 7z archives, analyze vulnerabilities Hard Cryptography, penetration testing
Data Science Reshape data files, analyze datasets Medium Data manipulation, statistical analysis
Machine Learning Train FastText models, configure ML pipelines Medium-Hard Model training, hyperparameter tuning
Software Development Build Linux kernel from source, compile repositories Hard Build systems, dependency management
Network Configuration Set up network services, configure APIs Medium Network protocols, service deployment
File Operations Complex file manipulations, batch processing Medium File system navigation, scripting

Task Examples

Specific tasks in the benchmark include:

  1. Building the Linux kernel from source
  2. Configuring a Git webserver
  3. Cracking password-protected archives
  4. Creating self-signed SSL certificates
  5. Reshaping and transforming data files
  6. Training FastText models
  7. Setting up database servers
  8. Debugging system configurations
  9. Playing terminal-based games
  10. Calling and configuring APIs
  11. Addressing cybersecurity vulnerabilities

Evaluation Methodology

Performance Metrics

Terminal-Bench uses a straightforward evaluation approach:

Metric Description Calculation
Task Completion Rate Percentage of tasks successfully completed (Completed Tasks / Total Tasks) × 100%
Pass/Fail Binary success measure per task Test script verification result
Time to Completion Duration taken to solve each task Optional metric for efficiency

Evaluation Process

The evaluation follows these steps: 1. **Task Initialization**: Docker container is created with task-specific environment 2. **Agent Execution**: AI agent receives task instruction and terminal access 3. **Command Execution**: Agent issues commands to complete the task 4. **Verification**: Test script checks if task objectives are met 5. **Result Recording**: Success/failure and metadata are logged

Running Evaluations

Evaluations are executed using the Terminal-Bench CLI:

```bash tb run \

   --agent <agent_type> \
   --model-name <model_identifier> \
   --dataset-name terminal-bench-core \
   --dataset-version 0.1.1

```

Performance Analysis

Current Leaderboard

As of 2025, the Terminal-Bench leaderboard shows the following performance:

Rank Agent/Model Task Completion Rate Organization
1 Warp Terminal Agent 52% Warp
2 Qwen3-32B Agent ~35% Independent
3 Terminus-Qwen3-235B-30A MoE ~32% Stanford
4 GPT-4.1 with Codex ~30% OpenAI
5 DeepSeek R1 <30% DeepSeek
6 Claude Code Benchmarking in progress Anthropic

Key Findings

Performance Characteristics

  • **Low Overall Success Rates**: Even top agents complete only about half of tasks
  • **Significant Performance Gap**: Best performers (52%) vs baseline models (~30%)
  • **Task Difficulty Impact**: Success rates vary significantly by task category
  • **Agent Architecture Matters**: Specialized terminal agents outperform general-purpose models

Challenges for Current Systems

  • Complex multi-step procedures remain difficult
  • Error recovery and debugging capabilities are limited
  • System-specific knowledge gaps affect performance
  • Long-horizon tasks with many dependencies prove challenging

Implementation Details

Installation

Terminal-Bench can be installed via multiple package managers:

```bash

  1. Using uv

uv tool install terminal-bench

  1. Using pip

pip install terminal-bench

  1. From source

git clone https://github.com/laude-institute/terminal-bench cd terminal-bench pip install -e . ```

Supported Agents

The benchmark supports evaluation of various agent types:

Agent Type Description Integration Method
Terminus Stanford's terminal agent Native support
Claude Code Anthropic's coding assistant API integration
Codex CLI OpenAI's terminal interface API integration
Goose Independent agent framework Custom adapter
Custom Agents User-defined agents Plugin interface

Technical Requirements

  • **Docker**: Required for sandboxed execution environments
  • **Python**: 3.8 or higher
  • **Memory**: Minimum 8GB RAM recommended
  • **Storage**: ~10GB for Docker images and task data
  • **Network**: Internet access for package installation

Dataset Versions

Release History

Version Release Date Tasks Major Changes
v0.1.0 April 2025 80 Initial beta release
v0.1.1 April 2025 80 Bug fixes, improved test scripts
Future 2025-2026 200+ Expanded task categories

Task Growth

Terminal-Bench is actively expanding:

  • **Current**: ~100 tasks in beta
  • **Near-term**: Adding tasks weekly
  • **Goal**: Several hundred tasks covering all major terminal use cases

Community and Development

Open Source Contribution

Terminal-Bench has garnered significant community interest:

  • **GitHub Stars**: Nearly 300 within first months
  • **Contributors**: Over 40 developers
  • **Task Submissions**: Community can propose new challenges
  • **Agent Integrations**: Multiple third-party agents added

Future Directions

Planned improvements include: 1. **Expanded Task Library**: Hundreds of additional tasks 2. **Multi-language Support**: Tasks in languages beyond English 3. **Difficulty Tiers**: Better categorization from beginner to expert 4. **Partial Credit**: Nuanced scoring beyond pass/fail 5. **Interactive Tasks**: Support for tasks requiring user interaction 6. **Performance Profiling**: Detailed metrics on resource usage

Significance and Impact

Research Applications

Terminal-Bench enables several research directions:

  • **Agent Architecture**: Optimizing designs for terminal interaction
  • **Error Recovery**: Studying how agents handle failures
  • **Long-horizon Planning**: Understanding multi-step task execution
  • **Tool Use**: Evaluating command and utility selection strategies

Practical Applications

The benchmark has implications for:

  • **DevOps Automation**: Assessing AI readiness for operations tasks
  • **System Administration**: Evaluating AI assistants for IT support
  • **Security Testing**: Understanding AI capabilities in cybersecurity
  • **Educational Tools**: Developing AI tutors for command-line skills

Limitations and Considerations

Current Limitations

Limitation Description Impact
Limited Task Count ~100 tasks in current version May not cover all use cases
Binary Scoring Pass/fail without partial credit Misses incremental progress
English Only Tasks in English language Limited global applicability
Docker Dependency Requires containerization Platform constraints
Terminal Focus Text-only interface No GUI interaction testing

Evaluation Challenges

  • **Reproducibility**: Ensuring consistent environments across runs
  • **Task Ambiguity**: Some tasks may have multiple valid solutions
  • **Resource Constraints**: Memory and compute limits affect some tasks
  • **Network Dependencies**: Tasks requiring internet access may vary

Related Benchmarks

See Also

References

Cite error: <ref> tag with name "terminal_bench_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_announcement" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "ai_native_dev" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_pypi" defined in <references> has group attribute "" which does not appear in prior text.