# Vimgolf

> Source: https://aiwiki.ai/wiki/vimgolf
> Updated: 2026-05-10
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

| vimgolf-gym |
| --- |
| Overview |
| Full name | Vimgolf Gym Environment |
| Abbreviation | vimgolf-gym |
| Description | OpenAI Gym style customizable environment and benchmark for VimGolf challenges, used to evaluate AI agents and large language models on Vim text editing tasks |
| Latest version | 0.1.1 |
| Authors | James Brown (James4Ever0) |
| Organization | Cybergod AGI Research |
| First release | 2025 |
| Technical Details |
| Type | Vim editing challenge evaluation |
| Modality | Text editing, Vim commands, terminal |
| Task format | VimGolf challenges (input text to output text via keystrokes) |
| Evaluation metric | Keystroke count (lower is better), relative inverse score, accuracy |
| Dataset size | 612 public VimGolf challenges scraped from [vimgolf.com](https://www.vimgolf.com) |
| Domains | Text editing, Vim proficiency, terminal interaction |
| Languages | Python (primary), Rust |
| Performance |
| Saturated | False |
| Reported example result | ollama/gpt-oss:20b at 11.8% accuracy (72/612) on the [Inspect AI](/wiki/inspect_ai) single-turn variant |
| Resources |
| Website | [Official site](https://james4ever0.github.io/vimgolf-gym) |
| GitHub | [Repository](https://github.com/James4Ever0/vimgolf-gym) |
| Dataset | [HuggingFace](https://huggingface.co/datasets/James4Ever0/vimgolf_challenges_and_solutions) |
| Inspect AI eval | [vimgolf_challenges](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/vimgolf_challenges/) |
| License | The Unlicense (public domain dedication) |

**vimgolf-gym** is an [OpenAI Gym](/wiki/openai_gym) style customizable environment and benchmark built around [VimGolf](https://www.vimgolf.com), the long running keystroke counting puzzle game for [Vim](/wiki/vim). The package wraps a real Vim instance, replays keystrokes against fixed input and output pairs, and reports whether a [large language model](/wiki/large_language_model) or scripted agent produced the correct output and how many keystrokes it used. It is part of the Cybergod AGI Research stack created by James Brown (GitHub handle James4Ever0), alongside CTF-Gym and a planned Cybergod-Gym for autonomous economic agents. [1] [2]

## Background: VimGolf the game

VimGolf is a website (vimgolf.com) where players compete to transform a fixed input file into a fixed output file using as few Vim keystrokes as possible. It was created by Ilya Grigorik (igrigorik on GitHub) as a holiday project in 2010, with the first tagged release of the companion `vimgolf` Ruby gem dated December 30, 2010. The site uses the slogan "Real Vim ninjas count every keystroke," and the open source command line client lives in the [igrigorik/vimgolf](https://github.com/igrigorik/vimgolf) repository under the [MIT License](/wiki/mit_license). [3] [4]

The rules are deliberately spare. Anyone can submit a challenge, entries are ranked by total keystrokes (lowest score wins) with ties broken by submission time, and the client launches a Vim session with a stock `.vimrc` so plugins and macros cannot be used. Every keystroke, including motions, normal mode commands, insertions, and the final `:wq`, counts toward the score. As of 2026 the site lists more than 600 active challenges and roughly half a million submitted entries from tens of thousands of registered golfers, which is what gives vimgolf-gym its raw material. [3] [4]

Typical challenges include reformatting tabular data, sorting lines, swapping words, transforming JSON or CSV layouts, or rewriting variable names across a small file. Many reference solutions run under twenty keystrokes, so a model that emits even a few stray characters loses to a strong human golfer by a wide margin. That gap is the signal vimgolf-gym is built to measure.

| Aspect | VimGolf game | vimgolf-gym benchmark |
| --- | --- | --- |
| Creator | Ilya Grigorik | James Brown / Cybergod AGI Research |
| First release | 2010 | 2025 |
| Player | Human | LLM, [agent](/wiki/ai_agent), or RL policy |
| Interface | Ruby CLI plus website | Python `gym.make` style API plus Docker image |
| Scoring | Keystrokes (lowest wins) | Keystrokes plus relative inverse score plus correctness |
| Source code | [igrigorik/vimgolf](https://github.com/igrigorik/vimgolf) | [James4Ever0/vimgolf-gym](https://github.com/James4Ever0/vimgolf-gym) |
| License | MIT | The Unlicense |

## Why use VimGolf as an AI benchmark

Most agent benchmarks for [coding](/wiki/code_generation) reward models for producing a final program that passes tests. VimGolf rewards something different: precise, efficient text manipulation. The only output that counts is the sequence of bytes the agent sends to a [terminal](/wiki/terminal) running Vim, which makes VimGolf a useful complement to benchmarks like [SWE-bench](/wiki/swe_bench), [HumanEval](/wiki/humaneval), and [Terminal-Bench](/wiki/terminal_bench). It also lines up with what a [coding agent](/wiki/coding_agent) actually does when it edits a file.

A few things make the task hard for current models. Vim has multiple modes (normal, insert, visual, command line, replace, operator pending), and the meaning of every key depends on the active mode, so mode confusion cascades into garbled output. Short solutions rely on motions and operators that a model has to plan jointly, for example combining `ci"` to change inside double quotes or `:%s/\v(\w+)_(\w+)/\u\1\u\2/g` to convert snake case to camel case. Solutions are scored byte by byte, so a single extra keystroke lowers the score even when the output is correct. And the agent never sees its own typing unless it explicitly pulls the buffer state.

## Architecture

vimgolf-gym is structured like a small reinforcement learning environment. The Python package exposes a `make()` factory that returns an environment object whose state is a live Vim process, usually launched inside a Docker container for reproducibility.

| Component | Description |
| --- | --- |
| Challenge environment | Vim instance with the challenge input loaded into a buffer and the expected output retained for verification |
| Local challenge dataset | Cached VimGolf challenges stored at `~/.cache/cybergod-vimgolf-challenges/` |
| Docker execution | Containerized Vim environment available as `agile4im/cybergod_vimgolf_gym`, used to isolate runs and freeze versions |
| Evaluation system | Replays the agent keystrokes, compares the resulting buffer to the target output, and reports the keystroke count |
| Screenshot module | Captures the current terminal as a PIL image, useful for vision language models or for human debugging |

### Challenge types

The package supports four challenge identifiers:

| Type | Identifier format | Description |
| --- | --- | --- |
| Test challenge | `vimgolf-test` | Built in "hello world" smoke test used to validate an installation |
| Local challenge | `vimgolf-local-<challenge_id>` | A challenge from the cached HuggingFace dataset |
| Online challenge | `vimgolf-online-<challenge_id>` | A challenge fetched directly from vimgolf.com |
| Custom challenge | `vimgolf-custom` | A user defined challenge supplied as YAML |

## Usage

### Python API

The environment uses a small, Gym style surface. An agent calls `act` to send key bytes, can optionally inspect the buffer or take a screenshot, and finally calls a verification helper that replays the full key sequence and grades it.

```python
import vimgolf_gym

# Create an environment for the smoke test challenge
env = vimgolf_gym.make("vimgolf-test")

# Send keystrokes to the running Vim instance
env.act("ihello world\n")

# Inspect the current buffer state
buffer = env.buffer

# Optional: take a screenshot of the terminal as a PIL image
img = env.screenshot()

# Verify a full solution sequence in VimGolf notation
success = env.verify_keys("ihello world<NL>hello world<Esc>:wq<NL>")

# Compute a normalized score relative to the worst public solution
relative_score = env.calculate_relative_inverse_score(score=100)
```


### Custom challenges

The `vimgolf-custom` mode accepts a YAML document. This is useful for evaluating a model on private challenges or on a fixed test set that mirrors real internal codebases.

```yaml
input: |
   The second line
   The first line

output: |
   The first line
   The second line

name: Swap lines
description: Swap the first and second lines of the input
solution: null
```

### Installation

vimgolf-gym ships through three channels. PyPI is the simplest, the Git install is useful when running off main, and the Docker image is the fastest way to get a clean Vim environment without touching the host.

```bash
# PyPI
pip install vimgolf-gym

# GitHub (latest commit)
pip install git+https://github.com/James4Ever0/vimgolf-gym.git

# Docker image
docker pull agile4im/cybergod_vimgolf_gym
```

The optional benchmark extras pin `litellm==1.76.2` for model inference and `vimgolf==0.1.1` for the upstream evaluation utilities, so the same versions can be reproduced across runs. [1]

## Evaluation methodology

### Metrics

vimgolf-gym reports three primary metrics. Keystroke score is the same metric the human leaderboard uses. Relative inverse score is normalized so that values close to 1.0 mean the agent matched the worst public solution, while values above 1.0 mean it beat at least some humans. Success rate is the binary check that the buffer matches the target output exactly.

| Metric | Description | Formula |
| --- | --- | --- |
| Keystroke score | Raw keystroke count for a successful solution | Lower is better |
| Relative inverse score | Performance relative to the worst public human solution | estimated_worst_solution_score / agent_score |
| Success rate | Binary completion check | Buffer equals expected output |

### Keystroke budget

In the [Inspect AI](/wiki/inspect_ai) port of the benchmark (see below), the model is given a keystroke budget equal to the number of characters in the target output. This rules out trivial solutions where the agent prints the output text once into insert mode, since the closing `<Esc>:wq` and any movement commands would push the total over the cap. It also keeps comparisons honest: a solution that uses more keystrokes than the output length is automatically rejected. [5]

### Inspect AI integration

The `vimgolf_challenges` task in the open source [Inspect AI evaluation framework](/wiki/inspect_ai) was contributed by James4Ever0 and uses the same 612 challenge HuggingFace dataset. It runs each challenge inside a Docker sandbox, applies the keystroke budget rule, and grades correctness against the target output. The Inspect AI page reports a sample run for `ollama/gpt-oss:20b` at 11.8% accuracy (72 of 612 challenges) with an average completion time of about 1.785 minutes per task and a standard error of 0.013 on accuracy. That number is illustrative rather than canonical: the task is configurable for model, temperature, parallelism, and task limits. [5]

A separate community port by GitHub user `bicyclespokesperson`, called `vim_golf_benchmark`, runs three hand picked starter challenges (delete first line, swap two words, CSV to pipe conversion) through Neovim and supports both Ollama models and Anthropic [Claude](/wiki/claude) models such as Claude 3.5 Sonnet and Claude 3.5 Haiku, comparing accuracy and average keystrokes against known optimal solutions. The full numbers live in the project's REPORT.md. [6]

## Dataset

The HuggingFace dataset `James4Ever0/vimgolf_challenges_and_solutions` is the canonical source for both vimgolf-gym and the Inspect AI port. It contains the 612 public challenges scraped from vimgolf.com, organized one folder per challenge hash, with three files inside: `metadata.json` (title, detail, URL, hash), `challenge.json` (input, output, client version), and `worst_solution.json` (highest scoring public solution and parsed header). The dataset is released under The Unlicense, and total file size is around 2.12 MB. [2] [7]

| Field family | Examples | Notes |
| --- | --- | --- |
| Metadata | `href`, `title`, `detail`, `challenge_hash` | Used to identify and look up challenges |
| Challenge body | `input`, `output`, `client` | The actual task and the version of the VimGolf client used |
| Worst solution | `rank`, `solution`, `header` | Anchor for the relative inverse score |

The "worst public solution" anchor matters because VimGolf normally hides solutions until a player submits an attempt, and only exposes the bottom 20% of public submissions to logged in users. Using that worst solution as the denominator gives every model a real human baseline, even if it is far from the optimum. [3] [4]

## Use cases

vimgolf-gym has shown up in three kinds of projects: straight model evaluation across the 612 challenge set, agent training where the environment plays the role of a verifier or [reward model](/wiki/reward_modeling) for a [reinforcement learning](/wiki/reinforcement_learning) loop, and integration testing for [computer use](/wiki/computer_use_agent) and terminal agent stacks where VimGolf challenges act as a regression suite. The Cybergod AGI roadmap places vimgolf-gym alongside CTF-Gym for security tasks and a planned Cybergod-Gym for end to end economic agents. [1]

## Limitations

A few caveats apply when using vimgolf-gym. The 612 challenge set is small compared to benchmarks like SWE-bench, so variance between runs can be high for models that solve only a handful of tasks. Some challenges depend on Vim's pattern engine (which differs from PCRE), penalizing models exposed only to standard regular expressions. The keystroke score also rewards trickery as much as understanding: a model that has memorized idiomatic Vim incantations may outperform a stronger reasoner that produces longer but correct sequences. Single turn evaluation does not reflect how a real coding agent would use Vim either, since it never lets the model observe an intermediate buffer state, although the native Python API supports interactive use.

## Related projects

| Project | Relationship |
| --- | --- |
| [igrigorik/vimgolf](https://github.com/igrigorik/vimgolf) | The original Ruby CLI and website that VimGolf has run on since 2010 |
| [vimgolf](https://pypi.org/project/vimgolf/) PyPI package | Python reimplementation of the VimGolf client used internally by vimgolf-gym |
| [Inspect AI vimgolf_challenges](https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/vimgolf_challenges/) | Single turn dialogue evaluation built on the same dataset |
| [bicyclespokesperson/vim_golf_benchmark](https://github.com/bicyclespokesperson/vim_golf_benchmark) | Community Claude and Ollama benchmark using three starter challenges |
| [Terminal-Bench](/wiki/terminal_bench) | Broader terminal agent benchmark that vimgolf-gym complements |
| [SWE-bench](/wiki/swe_bench) | Code editing benchmark on real GitHub issues; different scale and granularity |
| [HumanEval](/wiki/humaneval) | Function level code generation benchmark with no editor in the loop |
| [vimgolf.el](https://github.com/timvisher/vimgolf.el) | Emacs interface for the same challenges, reflecting the cross editor culture VimGolf seeded |

## License

vimgolf-gym is released under [The Unlicense](https://unlicense.org), a public domain dedication. The HuggingFace dataset is released under the same terms. The original VimGolf game and Ruby client are MIT licensed. [1] [2] [4]

## See also

- [Vim](/wiki/vim)
- [OpenAI Gym](/wiki/openai_gym)
- [Inspect AI](/wiki/inspect_ai)
- [Terminal-Bench](/wiki/terminal_bench)
- [SWE-bench](/wiki/swe_bench)
- [Coding agent](/wiki/coding_agent)
- [Computer use agent](/wiki/computer_use_agent)
- [Large language model](/wiki/large_language_model)

## References

1. James4Ever0. "vimgolf-gym: OpenAI gym style Vimgolf environment and benchmark for AI." GitHub. https://github.com/James4Ever0/vimgolf-gym
2. James4Ever0. "vimgolf_challenges_and_solutions." HuggingFace Datasets. https://huggingface.co/datasets/James4Ever0/vimgolf_challenges_and_solutions
3. VimGolf. "Real Vim ninjas count every keystroke." https://www.vimgolf.com/
4. Grigorik, Ilya. "igrigorik/vimgolf: Real Vim ninjas count every keystroke - do you?" GitHub. https://github.com/igrigorik/vimgolf
5. UK AI Safety Institute. "VimGolf: Evaluating LLMs in Vim Editing Proficiency." Inspect Evals documentation. https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/vimgolf_challenges/
6. bicyclespokesperson. "vim_golf_benchmark: Are LLMs any good at vim golf?" GitHub. https://github.com/bicyclespokesperson/vim_golf_benchmark
7. James4Ever0. "agi_computer_control: scrape_vimgolf_challenges_and_solutions." GitHub. https://github.com/James4Ever0/agi_computer_control/tree/master/scrape_vimgolf_challenges_and_solutions
8. Benchflow. "James4ever0/Vimgolf benchmark." https://www.benchflow.ai/benchmarks/James4ever0/Vimgolf
9. PyPI. "vimgolf-gym package." https://pypi.org/project/vimgolf-gym/
10. Docker Hub. "agile4im/cybergod_vimgolf_gym image." https://hub.docker.com/r/agile4im/cybergod_vimgolf_gym

