Terminal-Bench
| Terminal-Bench | |
|---|---|
| Overview | |
| Full name | Terminal Environment Benchmark |
| Abbreviation | Terminal-Bench |
| Description | A benchmark for evaluating AI agents' ability to complete real-world, end-to-end tasks in terminal environments |
| Release date | 2025-04 |
| Latest version | 0.1.1 (Core) |
| Benchmark updated | 2025 |
| Authors | Stanford University, Laude Institute |
| Organization | Stanford University, Laude Institute |
| Technical Details | |
| Type | Agent Evaluation, Terminal Tasks |
| Modality | Text, Command-line Interface |
| Task format | End-to-end task completion |
| Number of tasks | ~100 (80 in v0.1.1) |
| Total examples | ~100 |
| Evaluation metric | Task completion rate |
| Domains | System administration, Security, Data science, Model training, File operations |
| Languages | English |
| Performance | |
| Human performance | Not reported |
| Baseline | ~30% (GPT-4.1 with Codex) |
| SOTA score | 52% |
| SOTA model | Warp Terminal Agent |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| GitHub | Repository |
| Dataset | Download |
| License | Open source
|
Terminal-Bench is an AI benchmark designed to evaluate language agents' ability to complete real-world, end-to-end tasks in terminal environments. Released in April 2025 through a collaboration between Stanford University and the Laude Institute, Terminal-Bench represents the first comprehensive attempt to quantify AI agents' mastery of command-line interface operations, from compiling code to training models and setting up servers.
Overview
Terminal-Bench addresses a critical gap in AI agent evaluation by focusing on practical terminal-based tasks that require autonomous problem-solving capabilities. Unlike traditional code generation benchmarks that evaluate isolated programming skills, Terminal-Bench tests agents' ability to navigate complex terminal environments, execute multi-step procedures, and adapt to real-world system configurations.
Motivation
The development of Terminal-Bench was motivated by the need to evaluate AI agents in realistic computing environments where they must:
- Execute complex system administration tasks
- Handle unexpected errors and edge cases
- Navigate file systems and manage processes
- Configure software and services
- Solve problems that require multiple tools and commands
The benchmark provides a standardized testing ground for terminal-based AI capabilities, enabling researchers and developers to quantify their agents' terminal mastery objectively.
Technical Architecture
Components
Terminal-Bench consists of two primary components:
| Component | Description | Function |
|---|---|---|
| Dataset of Tasks | Collection of ~100 terminal-based challenges | Provides diverse test scenarios |
| Execution Harness | Runtime environment and evaluation system | Connects LLMs to sandboxed terminals |
Task Structure
Each task in Terminal-Bench includes:
- English Instruction: Clear description of what needs to be accomplished
- Test Script: Automated verification to check task completion
- Reference Solution: "Oracle" implementation showing one way to solve the task
- Docker Environment: Containerized setup ensuring consistent testing conditions
Sandboxed Environment
Terminal-Bench employs Docker containers to create isolated, reproducible environments for each task. This approach ensures:
- **Safety**: Tasks cannot affect the host system
- **Reproducibility**: Identical conditions for all evaluations
- **Flexibility**: Support for various operating systems and configurations
- **Scalability**: Parallel execution of multiple tasks
Task Categories
Terminal-Bench covers a diverse range of terminal use cases across multiple domains:
Core Task Categories
| Category | Example Tasks | Difficulty | Skills Tested |
|---|---|---|---|
| System Administration | Configure Git servers, set up SSL certificates | Medium-Hard | Service configuration, process management |
| Security | Crack 7z archives, analyze vulnerabilities | Hard | Cryptography, penetration testing |
| Data Science | Reshape data files, analyze datasets | Medium | Data manipulation, statistical analysis |
| Machine Learning | Train FastText models, configure ML pipelines | Medium-Hard | Model training, hyperparameter tuning |
| Software Development | Build Linux kernel from source, compile repositories | Hard | Build systems, dependency management |
| Network Configuration | Set up network services, configure APIs | Medium | Network protocols, service deployment |
| File Operations | Complex file manipulations, batch processing | Medium | File system navigation, scripting |
Task Examples
Specific tasks in the benchmark include:
- Building the Linux kernel from source
- Configuring a Git webserver
- Cracking password-protected archives
- Creating self-signed SSL certificates
- Reshaping and transforming data files
- Training FastText models
- Setting up database servers
- Debugging system configurations
- Playing terminal-based games
- Calling and configuring APIs
- Addressing cybersecurity vulnerabilities
Evaluation Methodology
Performance Metrics
Terminal-Bench uses a straightforward evaluation approach:
| Metric | Description | Calculation |
|---|---|---|
| Task Completion Rate | Percentage of tasks successfully completed | (Completed Tasks / Total Tasks) × 100% |
| Pass/Fail | Binary success measure per task | Test script verification result |
| Time to Completion | Duration taken to solve each task | Optional metric for efficiency |
Evaluation Process
The evaluation follows these steps: 1. **Task Initialization**: Docker container is created with task-specific environment 2. **Agent Execution**: AI agent receives task instruction and terminal access 3. **Command Execution**: Agent issues commands to complete the task 4. **Verification**: Test script checks if task objectives are met 5. **Result Recording**: Success/failure and metadata are logged
Running Evaluations
Evaluations are executed using the Terminal-Bench CLI:
```bash tb run \
--agent <agent_type> \ --model-name <model_identifier> \ --dataset-name terminal-bench-core \ --dataset-version 0.1.1
```
Performance Analysis
Current Leaderboard
As of 2025, the Terminal-Bench leaderboard shows the following performance:
| Rank | Agent/Model | Task Completion Rate | Organization |
|---|---|---|---|
| 1 | Warp Terminal Agent | 52% | Warp |
| 2 | Qwen3-32B Agent | ~35% | Independent |
| 3 | Terminus-Qwen3-235B-30A MoE | ~32% | Stanford |
| 4 | GPT-4.1 with Codex | ~30% | OpenAI |
| 5 | DeepSeek R1 | <30% | DeepSeek |
| 6 | Claude Code | Benchmarking in progress | Anthropic |
Key Findings
Performance Characteristics
- **Low Overall Success Rates**: Even top agents complete only about half of tasks
- **Significant Performance Gap**: Best performers (52%) vs baseline models (~30%)
- **Task Difficulty Impact**: Success rates vary significantly by task category
- **Agent Architecture Matters**: Specialized terminal agents outperform general-purpose models
Challenges for Current Systems
- Complex multi-step procedures remain difficult
- Error recovery and debugging capabilities are limited
- System-specific knowledge gaps affect performance
- Long-horizon tasks with many dependencies prove challenging
Implementation Details
Installation
Terminal-Bench can be installed via multiple package managers:
```bash
- Using uv
uv tool install terminal-bench
- Using pip
pip install terminal-bench
- From source
git clone https://github.com/laude-institute/terminal-bench cd terminal-bench pip install -e . ```
Supported Agents
The benchmark supports evaluation of various agent types:
| Agent Type | Description | Integration Method |
|---|---|---|
| Terminus | Stanford's terminal agent | Native support |
| Claude Code | Anthropic's coding assistant | API integration |
| Codex CLI | OpenAI's terminal interface | API integration |
| Goose | Independent agent framework | Custom adapter |
| Custom Agents | User-defined agents | Plugin interface |
Technical Requirements
- **Docker**: Required for sandboxed execution environments
- **Python**: 3.8 or higher
- **Memory**: Minimum 8GB RAM recommended
- **Storage**: ~10GB for Docker images and task data
- **Network**: Internet access for package installation
Dataset Versions
Release History
| Version | Release Date | Tasks | Major Changes |
|---|---|---|---|
| v0.1.0 | April 2025 | 80 | Initial beta release |
| v0.1.1 | April 2025 | 80 | Bug fixes, improved test scripts |
| Future | 2025-2026 | 200+ | Expanded task categories |
Task Growth
Terminal-Bench is actively expanding:
- **Current**: ~100 tasks in beta
- **Near-term**: Adding tasks weekly
- **Goal**: Several hundred tasks covering all major terminal use cases
Community and Development
Open Source Contribution
Terminal-Bench has garnered significant community interest:
- **GitHub Stars**: Nearly 300 within first months
- **Contributors**: Over 40 developers
- **Task Submissions**: Community can propose new challenges
- **Agent Integrations**: Multiple third-party agents added
Future Directions
Planned improvements include: 1. **Expanded Task Library**: Hundreds of additional tasks 2. **Multi-language Support**: Tasks in languages beyond English 3. **Difficulty Tiers**: Better categorization from beginner to expert 4. **Partial Credit**: Nuanced scoring beyond pass/fail 5. **Interactive Tasks**: Support for tasks requiring user interaction 6. **Performance Profiling**: Detailed metrics on resource usage
Significance and Impact
Research Applications
Terminal-Bench enables several research directions:
- **Agent Architecture**: Optimizing designs for terminal interaction
- **Error Recovery**: Studying how agents handle failures
- **Long-horizon Planning**: Understanding multi-step task execution
- **Tool Use**: Evaluating command and utility selection strategies
Practical Applications
The benchmark has implications for:
- **DevOps Automation**: Assessing AI readiness for operations tasks
- **System Administration**: Evaluating AI assistants for IT support
- **Security Testing**: Understanding AI capabilities in cybersecurity
- **Educational Tools**: Developing AI tutors for command-line skills
Limitations and Considerations
Current Limitations
| Limitation | Description | Impact |
|---|---|---|
| Limited Task Count | ~100 tasks in current version | May not cover all use cases |
| Binary Scoring | Pass/fail without partial credit | Misses incremental progress |
| English Only | Tasks in English language | Limited global applicability |
| Docker Dependency | Requires containerization | Platform constraints |
| Terminal Focus | Text-only interface | No GUI interaction testing |
Evaluation Challenges
- **Reproducibility**: Ensuring consistent environments across runs
- **Task Ambiguity**: Some tasks may have multiple valid solutions
- **Resource Constraints**: Memory and compute limits affect some tasks
- **Network Dependencies**: Tasks requiring internet access may vary
Related Benchmarks
- SWE-bench: Software engineering tasks
- HumanEval: Code generation benchmark
- WebArena: Web-based agent tasks
- AgentBench: General agent evaluation
- InterCode: Interactive coding tasks
- ML-Agent-Bench: ML research tasks
See Also
References
Cite error: <ref> tag with name "terminal_bench_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_announcement" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "ai_native_dev" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_pypi" defined in <references> has group attribute "" which does not appear in prior text.