Terminal-Bench

Terminal-Bench
Overview
Full name	Terminal Environment Benchmark
Abbreviation	Terminal-Bench
Description	A benchmark for evaluating AI agents' ability to complete real-world, end-to-end tasks in terminal environments
Release date	2025-04
Latest version	0.1.1 (Core)
Benchmark updated	2025
Authors	Stanford University, Laude Institute
Organization	Stanford University, Laude Institute
Technical Details
Type	Agent Evaluation, Terminal Tasks
Modality	Text, Command-line Interface
Task format	End-to-end task completion
Number of tasks	~100 (80 in v0.1.1)
Total examples	~100
Evaluation metric	Task completion rate
Domains	System administration, Security, Data science, Model training, File operations
Languages	English
Performance
Human performance	Not reported
Baseline	~30% (GPT-4.1 with Codex)
SOTA score	52%
SOTA model	Warp Terminal Agent
SOTA date	2025
Saturated	No
Resources
Website	Official website
GitHub	Repository
Dataset	Download
License	Open source ;

Terminal-Bench is an AI benchmark designed to evaluate language agents' ability to complete real-world, end-to-end tasks in terminal environments. Released in April 2025 through a collaboration between Stanford University and the Laude Institute, Terminal-Bench represents the first comprehensive attempt to quantify AI agents' mastery of command-line interface operations, from compiling code to training models and setting up servers.

Overview

Terminal-Bench addresses a critical gap in AI agent evaluation by focusing on practical terminal-based tasks that require autonomous problem-solving capabilities. Unlike traditional code generation benchmarks that evaluate isolated programming skills, Terminal-Bench tests agents' ability to navigate complex terminal environments, execute multi-step procedures, and adapt to real-world system configurations.

Motivation

The development of Terminal-Bench was motivated by the need to evaluate AI agents in realistic computing environments where they must:

Execute complex system administration tasks
Handle unexpected errors and edge cases
Navigate file systems and manage processes
Configure software and services
Solve problems that require multiple tools and commands

The benchmark provides a standardized testing ground for terminal-based AI capabilities, enabling researchers and developers to quantify their agents' terminal mastery objectively.

Technical Architecture

Components

Terminal-Bench consists of two primary components:

Component	Description	Function
Dataset of Tasks	Collection of ~100 terminal-based challenges	Provides diverse test scenarios
Execution Harness	Runtime environment and evaluation system	Connects LLMs to sandboxed terminals

Task Structure

Each task in Terminal-Bench includes:

English Instruction: Clear description of what needs to be accomplished
Test Script: Automated verification to check task completion
Reference Solution: "Oracle" implementation showing one way to solve the task
Docker Environment: Containerized setup ensuring consistent testing conditions

Sandboxed Environment

Terminal-Bench employs Docker containers to create isolated, reproducible environments for each task. This approach ensures:

**Safety**: Tasks cannot affect the host system
**Reproducibility**: Identical conditions for all evaluations
**Flexibility**: Support for various operating systems and configurations
**Scalability**: Parallel execution of multiple tasks

Task Categories

Terminal-Bench covers a diverse range of terminal use cases across multiple domains:

Core Task Categories

Category	Example Tasks	Difficulty	Skills Tested
System Administration	Configure Git servers, set up SSL certificates	Medium-Hard	Service configuration, process management
Security	Crack 7z archives, analyze vulnerabilities	Hard	Cryptography, penetration testing
Data Science	Reshape data files, analyze datasets	Medium	Data manipulation, statistical analysis
Machine Learning	Train FastText models, configure ML pipelines	Medium-Hard	Model training, hyperparameter tuning
Software Development	Build Linux kernel from source, compile repositories	Hard	Build systems, dependency management
Network Configuration	Set up network services, configure APIs	Medium	Network protocols, service deployment
File Operations	Complex file manipulations, batch processing	Medium	File system navigation, scripting

Task Examples

Specific tasks in the benchmark include:

Building the Linux kernel from source
Configuring a Git webserver
Cracking password-protected archives
Creating self-signed SSL certificates
Reshaping and transforming data files
Training FastText models
Setting up database servers
Debugging system configurations
Playing terminal-based games
Calling and configuring APIs
Addressing cybersecurity vulnerabilities

Evaluation Methodology

Performance Metrics

Terminal-Bench uses a straightforward evaluation approach:

Metric	Description	Calculation
Task Completion Rate	Percentage of tasks successfully completed	(Completed Tasks / Total Tasks) × 100%
Pass/Fail	Binary success measure per task	Test script verification result
Time to Completion	Duration taken to solve each task	Optional metric for efficiency

Evaluation Process

The evaluation follows these steps: 1. **Task Initialization**: Docker container is created with task-specific environment 2. **Agent Execution**: AI agent receives task instruction and terminal access 3. **Command Execution**: Agent issues commands to complete the task 4. **Verification**: Test script checks if task objectives are met 5. **Result Recording**: Success/failure and metadata are logged

Running Evaluations

Evaluations are executed using the Terminal-Bench CLI:

```bash tb run \

   --agent <agent_type> \
   --model-name <model_identifier> \
   --dataset-name terminal-bench-core \
   --dataset-version 0.1.1

```

Performance Analysis

Current Leaderboard

As of 2025, the Terminal-Bench leaderboard shows the following performance:

Rank	Agent/Model	Task Completion Rate	Organization
1	Warp Terminal Agent	52%	Warp
2	Qwen3-32B Agent	~35%	Independent
3	Terminus-Qwen3-235B-30A MoE	~32%	Stanford
4	GPT-4.1 with Codex	~30%	OpenAI
5	DeepSeek R1	<30%	DeepSeek
6	Claude Code	Benchmarking in progress	Anthropic

Key Findings

Performance Characteristics

**Low Overall Success Rates**: Even top agents complete only about half of tasks
**Significant Performance Gap**: Best performers (52%) vs baseline models (~30%)
**Task Difficulty Impact**: Success rates vary significantly by task category
**Agent Architecture Matters**: Specialized terminal agents outperform general-purpose models

Challenges for Current Systems

Complex multi-step procedures remain difficult
Error recovery and debugging capabilities are limited
System-specific knowledge gaps affect performance
Long-horizon tasks with many dependencies prove challenging

Implementation Details

Installation

Terminal-Bench can be installed via multiple package managers:

```bash

Using uv

uv tool install terminal-bench

Using pip

pip install terminal-bench

From source

git clone https://github.com/laude-institute/terminal-bench cd terminal-bench pip install -e . ```

Supported Agents

The benchmark supports evaluation of various agent types:

Agent Type	Description	Integration Method
Terminus	Stanford's terminal agent	Native support
Claude Code	Anthropic's coding assistant	API integration
Codex CLI	OpenAI's terminal interface	API integration
Goose	Independent agent framework	Custom adapter
Custom Agents	User-defined agents	Plugin interface

Technical Requirements

**Docker**: Required for sandboxed execution environments
**Python**: 3.8 or higher
**Memory**: Minimum 8GB RAM recommended
**Storage**: ~10GB for Docker images and task data
**Network**: Internet access for package installation

Dataset Versions

Release History

Version	Release Date	Tasks	Major Changes
v0.1.0	April 2025	80	Initial beta release
v0.1.1	April 2025	80	Bug fixes, improved test scripts
Future	2025-2026	200+	Expanded task categories

Task Growth

Terminal-Bench is actively expanding:

**Current**: ~100 tasks in beta
**Near-term**: Adding tasks weekly
**Goal**: Several hundred tasks covering all major terminal use cases

Community and Development

Open Source Contribution

Terminal-Bench has garnered significant community interest:

**GitHub Stars**: Nearly 300 within first months
**Contributors**: Over 40 developers
**Task Submissions**: Community can propose new challenges
**Agent Integrations**: Multiple third-party agents added

Future Directions

Planned improvements include: 1. **Expanded Task Library**: Hundreds of additional tasks 2. **Multi-language Support**: Tasks in languages beyond English 3. **Difficulty Tiers**: Better categorization from beginner to expert 4. **Partial Credit**: Nuanced scoring beyond pass/fail 5. **Interactive Tasks**: Support for tasks requiring user interaction 6. **Performance Profiling**: Detailed metrics on resource usage

Significance and Impact

Research Applications

Terminal-Bench enables several research directions:

**Agent Architecture**: Optimizing designs for terminal interaction
**Error Recovery**: Studying how agents handle failures
**Long-horizon Planning**: Understanding multi-step task execution
**Tool Use**: Evaluating command and utility selection strategies

Practical Applications

The benchmark has implications for:

**DevOps Automation**: Assessing AI readiness for operations tasks
**System Administration**: Evaluating AI assistants for IT support
**Security Testing**: Understanding AI capabilities in cybersecurity
**Educational Tools**: Developing AI tutors for command-line skills

Limitations and Considerations

Current Limitations

Limitation	Description	Impact
Limited Task Count	~100 tasks in current version	May not cover all use cases
Binary Scoring	Pass/fail without partial credit	Misses incremental progress
English Only	Tasks in English language	Limited global applicability
Docker Dependency	Requires containerization	Platform constraints
Terminal Focus	Text-only interface	No GUI interaction testing

Evaluation Challenges

**Reproducibility**: Ensuring consistent environments across runs
**Task Ambiguity**: Some tasks may have multiple valid solutions
**Resource Constraints**: Memory and compute limits affect some tasks
**Network Dependencies**: Tasks requiring internet access may vary

Related Benchmarks

SWE-bench: Software engineering tasks
HumanEval: Code generation benchmark
WebArena: Web-based agent tasks
AgentBench: General agent evaluation
InterCode: Interactive coding tasks
ML-Agent-Bench: ML research tasks

References

Cite error: <ref> tag with name "terminal_bench_github" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_website" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_announcement" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "ai_native_dev" defined in <references> has group attribute "" which does not appear in prior text.
Cite error: <ref> tag with name "terminal_bench_pypi" defined in <references> has group attribute "" which does not appear in prior text.