Vimgolf

vimgolf-gym
Overview
Full name	Vimgolf Gym Environment
Abbreviation	vimgolf-gym
Description	OpenAI Gym style customizable environment and benchmark for VimGolf challenges, used to evaluate AI agents and large language models on Vim text editing tasks
Latest version	0.1.1
Authors	James Brown (James4Ever0)
Organization	Cybergod AGI Research
First release	2025
Technical Details
Type	Vim editing challenge evaluation
Modality	Text editing, Vim commands, terminal
Task format	VimGolf challenges (input text to output text via keystrokes)
Evaluation metric	Keystroke count (lower is better), relative inverse score, accuracy
Dataset size	612 public VimGolf challenges scraped from vimgolf.com
Domains	Text editing, Vim proficiency, terminal interaction
Languages	Python (primary), Rust
Performance
Saturated	False
Reported example result	ollama/gpt-oss:20b at 11.8% accuracy (72/612) on the Inspect AI single-turn variant
Resources
Website	Official site
GitHub	Repository
Dataset	HuggingFace
Inspect AI eval	vimgolf_challenges
License	The Unlicense (public domain dedication)

vimgolf-gym is an OpenAI Gym style customizable environment and benchmark built around VimGolf, the long running keystroke counting puzzle game for Vim. The package wraps a real Vim instance, replays keystrokes against fixed input and output pairs, and reports whether a large language model or scripted agent produced the correct output and how many keystrokes it used. It is part of the Cybergod AGI Research stack created by James Brown (GitHub handle James4Ever0), alongside CTF-Gym and a planned Cybergod-Gym for autonomous economic agents. ^[1] ^[2]

Background: VimGolf the game

VimGolf is a website (vimgolf.com) where players compete to transform a fixed input file into a fixed output file using as few Vim keystrokes as possible. It was created by Ilya Grigorik (igrigorik on GitHub) as a holiday project in 2010, with the first tagged release of the companion vimgolf Ruby gem dated December 30, 2010. The site uses the slogan "Real Vim ninjas count every keystroke," and the open source command line client lives in the igrigorik/vimgolf repository under the MIT License. ^[3] ^[4]

The rules are deliberately spare. Anyone can submit a challenge, entries are ranked by total keystrokes (lowest score wins) with ties broken by submission time, and the client launches a Vim session with a stock .vimrc so plugins and macros cannot be used. Every keystroke, including motions, normal mode commands, insertions, and the final :wq, counts toward the score. As of 2026 the site lists more than 600 active challenges and roughly half a million submitted entries from tens of thousands of registered golfers, which is what gives vimgolf-gym its raw material. ^[3] ^[4]

Typical challenges include reformatting tabular data, sorting lines, swapping words, transforming JSON or CSV layouts, or rewriting variable names across a small file. Many reference solutions run under twenty keystrokes, so a model that emits even a few stray characters loses to a strong human golfer by a wide margin. That gap is the signal vimgolf-gym is built to measure.

Aspect	VimGolf game	vimgolf-gym benchmark
Creator	Ilya Grigorik	James Brown / Cybergod AGI Research
First release	2010	2025
Player	Human	LLM, agent, or RL policy
Interface	Ruby CLI plus website	Python `gym.make` style API plus Docker image
Scoring	Keystrokes (lowest wins)	Keystrokes plus relative inverse score plus correctness
Source code	igrigorik/vimgolf	James4Ever0/vimgolf-gym
License	MIT	The Unlicense

Why use VimGolf as an AI benchmark

Most agent benchmarks for coding reward models for producing a final program that passes tests. VimGolf rewards something different: precise, efficient text manipulation. The only output that counts is the sequence of bytes the agent sends to a terminal running Vim, which makes VimGolf a useful complement to benchmarks like SWE-bench, HumanEval, and Terminal-Bench. It also lines up with what a coding agent actually does when it edits a file.

A few things make the task hard for current models. Vim has multiple modes (normal, insert, visual, command line, replace, operator pending), and the meaning of every key depends on the active mode, so mode confusion cascades into garbled output. Short solutions rely on motions and operators that a model has to plan jointly, for example combining ci" to change inside double quotes or :%s/\v(\w+)_(\w+)/\u\1\u\2/g to convert snake case to camel case. Solutions are scored byte by byte, so a single extra keystroke lowers the score even when the output is correct. And the agent never sees its own typing unless it explicitly pulls the buffer state.

Architecture

vimgolf-gym is structured like a small reinforcement learning environment. The Python package exposes a make() factory that returns an environment object whose state is a live Vim process, usually launched inside a Docker container for reproducibility.

Component	Description
Challenge environment	Vim instance with the challenge input loaded into a buffer and the expected output retained for verification
Local challenge dataset	Cached VimGolf challenges stored at `~/.cache/cybergod-vimgolf-challenges/`
Docker execution	Containerized Vim environment available as `agile4im/cybergod_vimgolf_gym`, used to isolate runs and freeze versions
Evaluation system	Replays the agent keystrokes, compares the resulting buffer to the target output, and reports the keystroke count
Screenshot module	Captures the current terminal as a PIL image, useful for vision language models or for human debugging

Challenge types

The package supports four challenge identifiers:

Type	Identifier format	Description
Test challenge	`vimgolf-test`	Built in "hello world" smoke test used to validate an installation
Local challenge	`vimgolf-local-<challenge_id>`	A challenge from the cached HuggingFace dataset
Online challenge	`vimgolf-online-<challenge_id>`	A challenge fetched directly from vimgolf.com
Custom challenge	`vimgolf-custom`	A user defined challenge supplied as YAML

Usage

Python API

The environment uses a small, Gym style surface. An agent calls act to send key bytes, can optionally inspect the buffer or take a screenshot, and finally calls a verification helper that replays the full key sequence and grades it.

import vimgolf_gym

# Create an environment for the smoke test challenge
env = vimgolf_gym.make("vimgolf-test")

# Send keystrokes to the running Vim instance
env.act("ihello world\n")

# Inspect the current buffer state
buffer = env.buffer

# Optional: take a screenshot of the terminal as a PIL image
img = env.screenshot()

# Verify a full solution sequence in VimGolf notation
success = env.verify_keys("ihello world<NL>hello world<Esc>:wq<NL>")

# Compute a normalized score relative to the worst public solution
relative_score = env.calculate_relative_inverse_score(score=100)

Custom challenges

The vimgolf-custom mode accepts a YAML document. This is useful for evaluating a model on private challenges or on a fixed test set that mirrors real internal codebases.

input: |
   The second line
   The first line

output: |
   The first line
   The second line

name: Swap lines
description: Swap the first and second lines of the input
solution: null

Installation

vimgolf-gym ships through three channels. PyPI is the simplest, the Git install is useful when running off main, and the Docker image is the fastest way to get a clean Vim environment without touching the host.

# PyPI
pip install vimgolf-gym

# GitHub (latest commit)
pip install git+https://github.com/James4Ever0/vimgolf-gym.git

# Docker image
docker pull agile4im/cybergod_vimgolf_gym

The optional benchmark extras pin litellm==1.76.2 for model inference and vimgolf==0.1.1 for the upstream evaluation utilities, so the same versions can be reproduced across runs. ^[1]

Evaluation methodology

Metrics

vimgolf-gym reports three primary metrics. Keystroke score is the same metric the human leaderboard uses. Relative inverse score is normalized so that values close to 1.0 mean the agent matched the worst public solution, while values above 1.0 mean it beat at least some humans. Success rate is the binary check that the buffer matches the target output exactly.

Metric	Description	Formula
Keystroke score	Raw keystroke count for a successful solution	Lower is better
Relative inverse score	Performance relative to the worst public human solution	estimated_worst_solution_score / agent_score
Success rate	Binary completion check	Buffer equals expected output

Keystroke budget

In the Inspect AI port of the benchmark (see below), the model is given a keystroke budget equal to the number of characters in the target output. This rules out trivial solutions where the agent prints the output text once into insert mode, since the closing <Esc>:wq and any movement commands would push the total over the cap. It also keeps comparisons honest: a solution that uses more keystrokes than the output length is automatically rejected. ^[5]

Inspect AI integration

The vimgolf_challenges task in the open source Inspect AI evaluation framework was contributed by James4Ever0 and uses the same 612 challenge HuggingFace dataset. It runs each challenge inside a Docker sandbox, applies the keystroke budget rule, and grades correctness against the target output. The Inspect AI page reports a sample run for ollama/gpt-oss:20b at 11.8% accuracy (72 of 612 challenges) with an average completion time of about 1.785 minutes per task and a standard error of 0.013 on accuracy. That number is illustrative rather than canonical: the task is configurable for model, temperature, parallelism, and task limits. ^[5]

A separate community port by GitHub user bicyclespokesperson, called vim_golf_benchmark, runs three hand picked starter challenges (delete first line, swap two words, CSV to pipe conversion) through Neovim and supports both Ollama models and Anthropic Claude models such as Claude 3.5 Sonnet and Claude 3.5 Haiku, comparing accuracy and average keystrokes against known optimal solutions. The full numbers live in the project's REPORT.md. ^[6]

Dataset

The HuggingFace dataset James4Ever0/vimgolf_challenges_and_solutions is the canonical source for both vimgolf-gym and the Inspect AI port. It contains the 612 public challenges scraped from vimgolf.com, organized one folder per challenge hash, with three files inside: metadata.json (title, detail, URL, hash), challenge.json (input, output, client version), and worst_solution.json (highest scoring public solution and parsed header). The dataset is released under The Unlicense, and total file size is around 2.12 MB. ^[2] ^[7]

Field family	Examples	Notes
Metadata	`href`, `title`, `detail`, `challenge_hash`	Used to identify and look up challenges
Challenge body	`input`, `output`, `client`	The actual task and the version of the VimGolf client used
Worst solution	`rank`, `solution`, `header`	Anchor for the relative inverse score

The "worst public solution" anchor matters because VimGolf normally hides solutions until a player submits an attempt, and only exposes the bottom 20% of public submissions to logged in users. Using that worst solution as the denominator gives every model a real human baseline, even if it is far from the optimum. ^[3] ^[4]

Use cases

vimgolf-gym has shown up in three kinds of projects: straight model evaluation across the 612 challenge set, agent training where the environment plays the role of a verifier or reward model for a reinforcement learning loop, and integration testing for computer use and terminal agent stacks where VimGolf challenges act as a regression suite. The Cybergod AGI roadmap places vimgolf-gym alongside CTF-Gym for security tasks and a planned Cybergod-Gym for end to end economic agents. ^[1]

Limitations

A few caveats apply when using vimgolf-gym. The 612 challenge set is small compared to benchmarks like SWE-bench, so variance between runs can be high for models that solve only a handful of tasks. Some challenges depend on Vim's pattern engine (which differs from PCRE), penalizing models exposed only to standard regular expressions. The keystroke score also rewards trickery as much as understanding: a model that has memorized idiomatic Vim incantations may outperform a stronger reasoner that produces longer but correct sequences. Single turn evaluation does not reflect how a real coding agent would use Vim either, since it never lets the model observe an intermediate buffer state, although the native Python API supports interactive use.

Project	Relationship
igrigorik/vimgolf	The original Ruby CLI and website that VimGolf has run on since 2010
vimgolf PyPI package	Python reimplementation of the VimGolf client used internally by vimgolf-gym
Inspect AI vimgolf_challenges	Single turn dialogue evaluation built on the same dataset
bicyclespokesperson/vim_golf_benchmark	Community Claude and Ollama benchmark using three starter challenges
Terminal-Bench	Broader terminal agent benchmark that vimgolf-gym complements
SWE-bench	Code editing benchmark on real GitHub issues; different scale and granularity
HumanEval	Function level code generation benchmark with no editor in the loop
vimgolf.el	Emacs interface for the same challenges, reflecting the cross editor culture VimGolf seeded

License

vimgolf-gym is released under The Unlicense, a public domain dedication. The HuggingFace dataset is released under the same terms. The original VimGolf game and Ruby client are MIT licensed. ^[1] ^[2] ^[4]

References

James4Ever0. "vimgolf-gym: OpenAI gym style Vimgolf environment and benchmark for AI." GitHub. https://github.com/James4Ever0/vimgolf-gym
James4Ever0. "vimgolf_challenges_and_solutions." HuggingFace Datasets. https://huggingface.co/datasets/James4Ever0/vimgolf_challenges_and_solutions
VimGolf. "Real Vim ninjas count every keystroke." https://www.vimgolf.com/
Grigorik, Ilya. "igrigorik/vimgolf: Real Vim ninjas count every keystroke - do you?" GitHub. https://github.com/igrigorik/vimgolf
UK AI Safety Institute. "VimGolf: Evaluating LLMs in Vim Editing Proficiency." Inspect Evals documentation. https://ukgovernmentbeis.github.io/inspect_evals/evals/reasoning/vimgolf_challenges/
bicyclespokesperson. "vim_golf_benchmark: Are LLMs any good at vim golf?" GitHub. https://github.com/bicyclespokesperson/vim_golf_benchmark
James4Ever0. "agi_computer_control: scrape_vimgolf_challenges_and_solutions." GitHub. https://github.com/James4Ever0/agi_computer_control/tree/master/scrape_vimgolf_challenges_and_solutions
Benchflow. "James4ever0/Vimgolf benchmark." https://www.benchflow.ai/benchmarks/James4ever0/Vimgolf
PyPI. "vimgolf-gym package." https://pypi.org/project/vimgolf-gym/
Docker Hub. "agile4im/cybergod_vimgolf_gym image." https://hub.docker.com/r/agile4im/cybergod_vimgolf_gym

Vimgolf

Background: VimGolf the game

Why use VimGolf as an AI benchmark

Architecture

Challenge types

Usage

Python API

Custom challenges

Installation

Evaluation methodology

Metrics

Keystroke budget

Inspect AI integration

Dataset

Use cases

Limitations

License

See also

References

Improve this article

Background: VimGolf the game

Why use VimGolf as an AI benchmark

Architecture

Challenge types

Usage

Python API

Custom challenges

Installation

Evaluation methodology

Metrics

Keystroke budget

Inspect AI integration

Dataset

Use cases

Limitations

License

See also

References

Background: VimGolf the game

Why use VimGolf as an AI benchmark

Architecture

Challenge types

Usage

Python API

Custom challenges

Installation

Evaluation methodology

Metrics

Keystroke budget

Inspect AI integration

Dataset

Use cases

Limitations

Related projects

License

See also

References

Improve this article

Related Articles

τ-bench

Humanity's Last Exam

AA-LCR

GSO

AIME 2025

BrowseComp

Background: VimGolf the game

Why use VimGolf as an AI benchmark

Architecture

Challenge types

Usage

Python API

Custom challenges

Installation

Evaluation methodology

Metrics

Keystroke budget

Inspect AI integration

Dataset

Use cases

Limitations

Related projects

License

See also

References

Related Articles

τ-bench

Humanity's Last Exam

AA-LCR

GSO

AIME 2025

BrowseComp