Vimgolf
Last reviewed
May 10, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,370 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 10, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 · 2,370 words
Add missing citations, update stale details, or suggest a clearer explanation.
| vimgolf-gym | |
|---|---|
| Overview | |
| Full name | Vimgolf Gym Environment |
| Abbreviation | vimgolf-gym |
| Description | OpenAI Gym style customizable environment and benchmark for VimGolf challenges, used to evaluate AI agents and large language models on Vim text editing tasks |
| Latest version | 0.1.1 |
| Authors | James Brown (James4Ever0) |
| Organization | Cybergod AGI Research |
| First release | 2025 |
| Technical Details | |
| Type | Vim editing challenge evaluation |
| Modality | Text editing, Vim commands, terminal |
| Task format | VimGolf challenges (input text to output text via keystrokes) |
| Evaluation metric | Keystroke count (lower is better), relative inverse score, accuracy |
| Dataset size | 612 public VimGolf challenges scraped from vimgolf.com |
| Domains | Text editing, Vim proficiency, terminal interaction |
| Languages | Python (primary), Rust |
| Performance | |
| Saturated | False |
| Reported example result | ollama/gpt-oss:20b at 11.8% accuracy (72/612) on the Inspect AI single-turn variant |
| Resources | |
| Website | Official site |
| GitHub | Repository |
| Dataset | HuggingFace |
| Inspect AI eval | vimgolf_challenges |
| License | The Unlicense (public domain dedication) |
vimgolf-gym is an OpenAI Gym style customizable environment and benchmark built around VimGolf, the long running keystroke counting puzzle game for Vim. The package wraps a real Vim instance, replays keystrokes against fixed input and output pairs, and reports whether a large language model or scripted agent produced the correct output and how many keystrokes it used. It is part of the Cybergod AGI Research stack created by James Brown (GitHub handle James4Ever0), alongside CTF-Gym and a planned Cybergod-Gym for autonomous economic agents. [1] [2]
VimGolf is a website (vimgolf.com) where players compete to transform a fixed input file into a fixed output file using as few Vim keystrokes as possible. It was created by Ilya Grigorik (igrigorik on GitHub) as a holiday project in 2010, with the first tagged release of the companion vimgolf Ruby gem dated December 30, 2010. The site uses the slogan "Real Vim ninjas count every keystroke," and the open source command line client lives in the igrigorik/vimgolf repository under the MIT License. [3] [4]
The rules are deliberately spare. Anyone can submit a challenge, entries are ranked by total keystrokes (lowest score wins) with ties broken by submission time, and the client launches a Vim session with a stock .vimrc so plugins and macros cannot be used. Every keystroke, including motions, normal mode commands, insertions, and the final :wq, counts toward the score. As of 2026 the site lists more than 600 active challenges and roughly half a million submitted entries from tens of thousands of registered golfers, which is what gives vimgolf-gym its raw material. [3] [4]
Typical challenges include reformatting tabular data, sorting lines, swapping words, transforming JSON or CSV layouts, or rewriting variable names across a small file. Many reference solutions run under twenty keystrokes, so a model that emits even a few stray characters loses to a strong human golfer by a wide margin. That gap is the signal vimgolf-gym is built to measure.
| Aspect | VimGolf game | vimgolf-gym benchmark |
|---|---|---|
| Creator | Ilya Grigorik | James Brown / Cybergod AGI Research |
| First release | 2010 | 2025 |
| Player | Human | LLM, agent, or RL policy |
| Interface | Ruby CLI plus website | Python gym.make style API plus Docker image |
| Scoring | Keystrokes (lowest wins) | Keystrokes plus relative inverse score plus correctness |
| Source code | igrigorik/vimgolf | James4Ever0/vimgolf-gym |
| License | MIT | The Unlicense |
Most agent benchmarks for coding reward models for producing a final program that passes tests. VimGolf rewards something different: precise, efficient text manipulation. The only output that counts is the sequence of bytes the agent sends to a terminal running Vim, which makes VimGolf a useful complement to benchmarks like SWE-bench, HumanEval, and Terminal-Bench. It also lines up with what a coding agent actually does when it edits a file.
A few things make the task hard for current models. Vim has multiple modes (normal, insert, visual, command line, replace, operator pending), and the meaning of every key depends on the active mode, so mode confusion cascades into garbled output. Short solutions rely on motions and operators that a model has to plan jointly, for example combining ci" to change inside double quotes or :%s/\v(\w+)_(\w+)/\u\1\u\2/g to convert snake case to camel case. Solutions are scored byte by byte, so a single extra keystroke lowers the score even when the output is correct. And the agent never sees its own typing unless it explicitly pulls the buffer state.
vimgolf-gym is structured like a small reinforcement learning environment. The Python package exposes a make() factory that returns an environment object whose state is a live Vim process, usually launched inside a Docker container for reproducibility.
| Component | Description |
|---|---|
| Challenge environment | Vim instance with the challenge input loaded into a buffer and the expected output retained for verification |
| Local challenge dataset | Cached VimGolf challenges stored at ~/.cache/cybergod-vimgolf-challenges/ |
| Docker execution | Containerized Vim environment available as agile4im/cybergod_vimgolf_gym, used to isolate runs and freeze versions |
| Evaluation system | Replays the agent keystrokes, compares the resulting buffer to the target output, and reports the keystroke count |
| Screenshot module | Captures the current terminal as a PIL image, useful for vision language models or for human debugging |
The package supports four challenge identifiers:
| Type | Identifier format | Description |
|---|---|---|
| Test challenge | vimgolf-test | Built in "hello world" smoke test used to validate an installation |
| Local challenge | vimgolf-local-<challenge_id> | A challenge from the cached HuggingFace dataset |
| Online challenge | vimgolf-online-<challenge_id> | A challenge fetched directly from vimgolf.com |
| Custom challenge | vimgolf-custom | A user defined challenge supplied as YAML |
The environment uses a small, Gym style surface. An agent calls act to send key bytes, can optionally inspect the buffer or take a screenshot, and finally calls a verification helper that replays the full key sequence and grades it.
import vimgolf_gym
# Create an environment for the smoke test challenge
env = vimgolf_gym.make("vimgolf-test")
# Send keystrokes to the running Vim instance
env.act("ihello world\n")
# Inspect the current buffer state
buffer = env.buffer
# Optional: take a screenshot of the terminal as a PIL image
img = env.screenshot()
# Verify a full solution sequence in VimGolf notation
success = env.verify_keys("ihello world<NL>hello world<Esc>:wq<NL>")
# Compute a normalized score relative to the worst public solution
relative_score = env.calculate_relative_inverse_score(score=100)
The vimgolf-custom mode accepts a YAML document. This is useful for evaluating a model on private challenges or on a fixed test set that mirrors real internal codebases.
input: |
The second line
The first line
output: |
The first line
The second line
name: Swap lines
description: Swap the first and second lines of the input
solution: null
vimgolf-gym ships through three channels. PyPI is the simplest, the Git install is useful when running off main, and the Docker image is the fastest way to get a clean Vim environment without touching the host.
# PyPI
pip install vimgolf-gym
# GitHub (latest commit)
pip install git+https://github.com/James4Ever0/vimgolf-gym.git
# Docker image
docker pull agile4im/cybergod_vimgolf_gym
The optional benchmark extras pin litellm==1.76.2 for model inference and vimgolf==0.1.1 for the upstream evaluation utilities, so the same versions can be reproduced across runs. [1]
vimgolf-gym reports three primary metrics. Keystroke score is the same metric the human leaderboard uses. Relative inverse score is normalized so that values close to 1.0 mean the agent matched the worst public solution, while values above 1.0 mean it beat at least some humans. Success rate is the binary check that the buffer matches the target output exactly.
| Metric | Description | Formula |
|---|---|---|
| Keystroke score | Raw keystroke count for a successful solution | Lower is better |
| Relative inverse score | Performance relative to the worst public human solution | estimated_worst_solution_score / agent_score |
| Success rate | Binary completion check | Buffer equals expected output |
In the Inspect AI port of the benchmark (see below), the model is given a keystroke budget equal to the number of characters in the target output. This rules out trivial solutions where the agent prints the output text once into insert mode, since the closing <Esc>:wq and any movement commands would push the total over the cap. It also keeps comparisons honest: a solution that uses more keystrokes than the output length is automatically rejected. [5]
The vimgolf_challenges task in the open source Inspect AI evaluation framework was contributed by James4Ever0 and uses the same 612 challenge HuggingFace dataset. It runs each challenge inside a Docker sandbox, applies the keystroke budget rule, and grades correctness against the target output. The Inspect AI page reports a sample run for ollama/gpt-oss:20b at 11.8% accuracy (72 of 612 challenges) with an average completion time of about 1.785 minutes per task and a standard error of 0.013 on accuracy. That number is illustrative rather than canonical: the task is configurable for model, temperature, parallelism, and task limits. [5]
A separate community port by GitHub user bicyclespokesperson, called vim_golf_benchmark, runs three hand picked starter challenges (delete first line, swap two words, CSV to pipe conversion) through Neovim and supports both Ollama models and Anthropic Claude models such as Claude 3.5 Sonnet and Claude 3.5 Haiku, comparing accuracy and average keystrokes against known optimal solutions. The full numbers live in the project's REPORT.md. [6]
The HuggingFace dataset James4Ever0/vimgolf_challenges_and_solutions is the canonical source for both vimgolf-gym and the Inspect AI port. It contains the 612 public challenges scraped from vimgolf.com, organized one folder per challenge hash, with three files inside: metadata.json (title, detail, URL, hash), challenge.json (input, output, client version), and worst_solution.json (highest scoring public solution and parsed header). The dataset is released under The Unlicense, and total file size is around 2.12 MB. [2] [7]
| Field family | Examples | Notes |
|---|---|---|
| Metadata | href, title, detail, challenge_hash | Used to identify and look up challenges |
| Challenge body | input, output, client | The actual task and the version of the VimGolf client used |
| Worst solution | rank, solution, header | Anchor for the relative inverse score |
The "worst public solution" anchor matters because VimGolf normally hides solutions until a player submits an attempt, and only exposes the bottom 20% of public submissions to logged in users. Using that worst solution as the denominator gives every model a real human baseline, even if it is far from the optimum. [3] [4]
vimgolf-gym has shown up in three kinds of projects: straight model evaluation across the 612 challenge set, agent training where the environment plays the role of a verifier or reward model for a reinforcement learning loop, and integration testing for computer use and terminal agent stacks where VimGolf challenges act as a regression suite. The Cybergod AGI roadmap places vimgolf-gym alongside CTF-Gym for security tasks and a planned Cybergod-Gym for end to end economic agents. [1]
A few caveats apply when using vimgolf-gym. The 612 challenge set is small compared to benchmarks like SWE-bench, so variance between runs can be high for models that solve only a handful of tasks. Some challenges depend on Vim's pattern engine (which differs from PCRE), penalizing models exposed only to standard regular expressions. The keystroke score also rewards trickery as much as understanding: a model that has memorized idiomatic Vim incantations may outperform a stronger reasoner that produces longer but correct sequences. Single turn evaluation does not reflect how a real coding agent would use Vim either, since it never lets the model observe an intermediate buffer state, although the native Python API supports interactive use.
| Project | Relationship |
|---|---|
| igrigorik/vimgolf | The original Ruby CLI and website that VimGolf has run on since 2010 |
| vimgolf PyPI package | Python reimplementation of the VimGolf client used internally by vimgolf-gym |
| Inspect AI vimgolf_challenges | Single turn dialogue evaluation built on the same dataset |
| bicyclespokesperson/vim_golf_benchmark | Community Claude and Ollama benchmark using three starter challenges |
| Terminal-Bench | Broader terminal agent benchmark that vimgolf-gym complements |
| SWE-bench | Code editing benchmark on real GitHub issues; different scale and granularity |
| HumanEval | Function level code generation benchmark with no editor in the loop |
| vimgolf.el | Emacs interface for the same challenges, reflecting the cross editor culture VimGolf seeded |
vimgolf-gym is released under The Unlicense, a public domain dedication. The HuggingFace dataset is released under the same terms. The original VimGolf game and Ruby client are MIT licensed. [1] [2] [4]