BALROG

BALROG
Overview
Full name	Benchmarking Agentic LLM and VLM Reasoning On Games
Abbreviation	BALROG
Description	A benchmark that evaluates agentic LLM and VLM capabilities through six diverse, procedurally generated game environments
Initial release	20 November 2024 (arXiv v1)
Conference	ICLR 2025 (Singapore, April 2025)
Latest paper revision	1 April 2025 (arXiv v2)
Lead author	Davide Paglieri (UCL DARK Lab)
Co-authors	Bartlomiej Cupial, Sam Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob N. Foerster, Jack Parker-Holder, Tim Rocktaschel
Affiliated organizations	University College London (DARK Lab), IDEAS NCBR, University of Warsaw, University of Oxford, New York University, Anthropic, Polish Academy of Sciences
Technical details
Type	Agentic reasoning, long-horizon planning, sequential decision making
Modality	Language only (LLM) and vision plus language (VLM)
Task format	Interactive reinforcement learning environments wrapped for natural language action output
Number of environments	6 (BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, NetHack Learning Environment)
Total tasks	Procedurally generated, unbounded
Primary metric	Average progress percentage (0 to 100), aggregated across environments
Languages	English
Performance
Original SOTA (Nov 2024)	Claude 3.5 Sonnet, 32.6 percent (LLM)
Strong reasoner baseline	DeepSeek-R1, 34.9 percent (Jan 2025)
2026 LLM leader	Gemini 3 Pro family, around 58 percent (Feb 2026)
2026 VLM leader	Gemini 2.5 Pro Exp, 35.7 percent (Apr 2025)
Saturated	No, especially NetHack and MiniHack Boxoban
Resources
Website	balrogai.com
Paper	arXiv:2411.13543
GitHub	balrog-ai/BALROG
Docs	balrog-ai.github.io
License	MIT

BALROG (Benchmarking Agentic LLM and VLM Reasoning On Games) is a benchmark suite for evaluating the agentic capabilities of large language models and vision language models inside long-horizon, procedurally generated game environments. The framework was introduced by Davide Paglieri and collaborators from the UCL DARK Lab, IDEAS NCBR, the University of Warsaw, the University of Oxford, New York University, and Anthropic in a paper first posted to arXiv on 20 November 2024 and accepted as a poster at ICLR 2025. BALROG aggregates six existing reinforcement learning environments, namely BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, and the NetHack Learning Environment, into a single testbed where models output natural language actions over hundreds or thousands of steps. The headline finding from the original paper is that frontier models complete only a fraction of these tasks and that several vision language systems perform worse when given a picture of the environment than when given a textual description of the same state.

Background and motivation

The motivation for BALROG is rooted in a gap that the authors identify between claims about agentic AI and the empirical evaluation of those claims. Standard agent benchmarks such as WebArena, SWE-bench, and OSWorld tend to test sequences of a few dozen interactions inside a single domain, while realistic autonomous behaviors require ordered planning over orders of magnitude more steps with strong credit assignment. The team argues that games offer a natural laboratory for this kind of evaluation because they impose long horizons, sparse rewards, stochastic dynamics, and clear win conditions that cannot be passed by memorizing a static answer set.

A second motivation is the difficulty curve of the chosen environments. The authors deliberately span tasks that a non-expert can finish in seconds, such as picking up a key in BabyAI, to tasks that take expert humans years to master, such as ascending in NetHack. This range allows the benchmark to remain unsaturated as models improve. Performance is reported per environment on a 0 to 100 scale, so a single number captures whether a model is making consistent progress everywhere or only on the easiest games.

A third motivation is to study modality. BALROG runs every environment in two modes, a language only mode where the model receives a textual rendering of the state, and a vision language mode where it receives the rendered image alongside text. Because the underlying state is identical, the difference between scores in the two modes isolates the contribution of visual perception to agentic reasoning. This design has produced some of the benchmark's most discussed empirical results.

Authors and institutional origin

BALROG was conceived inside the UCL DARK Lab, an artificial intelligence research group led by Tim Rocktaschel and known previously for the NetHack Learning Environment and MiniHack. Davide Paglieri, a PhD student at UCL advised by Rocktaschel and Jack Parker-Holder, is the lead author. The remaining authors include Bartlomiej Cupial and Maciej Wolczyk from IDEAS NCBR and the University of Warsaw, Sam Coward and Jakob N. Foerster from the University of Oxford, Ulyana Piterbarg, Lerrel Pinto, and Rob Fergus from New York University, Akbir Khan who is affiliated with UCL and Anthropic, Eduardo Pignatelli at UCL, and Lukasz Kucinski from IDEAS NCBR and the Institute of Mathematics of the Polish Academy of Sciences. The corresponding author email on the BALROG website is d.paglieri at cs.ucl.ac.uk.

This combination of authors mirrors the lineage of the benchmark, since several of the environments included in BALROG originated in research papers by overlapping groups. The NetHack Learning Environment and MiniHack were originally released by teams at FAIR and UCL DARK. BabyAI is the work of Maxime Chevalier-Boisvert and collaborators at Mila. Crafter is an environment by Danijar Hafner. TextWorld is a Microsoft Research framework. Baba Is AI is a 2024 environment by Nathan Cloos and colleagues. BALROG does not modify the games themselves, instead it provides a standardized wrapper, prompting, parsing, and scoring layer on top of them.

The six environments

The BALROG suite intentionally covers a wide spectrum of skills. The following table provides a structured summary of each environment and the abilities it stresses.

Environment	Origin	Visual style	Mastery time for humans	Skills tested
BabyAI	Chevalier-Boisvert et al., 2019	2D grid	Seconds to minutes	Language grounding, instruction following, simple navigation
Crafter	Hafner, 2021	2D pixel art	Hours	Survival, resource gathering, crafting, exploration
TextWorld	Cote et al., 2018 (Microsoft Research)	Pure text	Minutes to hours	Language understanding, spatial mental models, puzzle solving
Baba Is AI	Cloos et al., 2024	2D grid with movable text tiles	Hours	Abstract rule manipulation, compositional generalization
MiniHack	Samvelyan et al., 2021	ASCII or tiles	Hours to days	Tactical combat, navigation, item use, planning
NetHack Learning Environment	Kuttler et al., 2020	ASCII or tiles	Years	Long-horizon strategy, vast game knowledge, credit assignment

BabyAI

BabyAI places an agent in a 2D gridworld with colored objects such as keys, balls, boxes, and doors. The agent receives a natural language mission, for example pick up the red key, and must complete it inside a small set of rooms. BALROG selects five navigation task types from BabyAI. The action space is small, containing six primitives, turn left, turn right, move forward, pick up, drop, and toggle. Each episode is scored as 0 or 100 based on whether the mission is completed within the step limit. BabyAI is the most approachable environment in BALROG, so it acts as a sanity check on basic language to action grounding.

Crafter

Crafter is a 2D Minecraft-style survival sandbox developed by Danijar Hafner in 2021. The world is procedurally generated and tracks 22 achievements that range from collect wood and place stone to collect diamond, defeat zombie, and wake up rested. The agent must avoid starvation, dehydration, and combat death while gathering resources and climbing the crafting tech tree. In BALROG, Crafter is scored on a continuous 0 to 100 scale equal to the percentage of achievements unlocked. Crafter exposes a model's ability to plan a hierarchical sequence of resource acquisition steps under stochastic conditions.

TextWorld

TextWorld is a Microsoft Research framework for generating text adventures. BALROG uses three game types from TextWorld, namely Treasure Hunter, The Cooking Game, and Coin Collector. Treasure Hunter is a room based navigation game across twenty rooms with locked doors and keys. Cooking Game requires multi-step recipes where the agent must chop, fry, or roast specific ingredients in a particular order. Coin Collector is a long horizon navigation task spanning forty rooms and is widely used as an exploration probe. TextWorld actions are short natural language strings such as go east, examine apple, take key, or unlock door with brass key. Scores reflect the fraction of subtasks solved.

Baba Is AI

Baba Is AI, introduced by Nathan Cloos and colleagues at ICML 2024, is a research version of the indie puzzle game Baba Is You. The defining feature of the original game is that the rules themselves are objects in the world. Sentences like ROCK IS PUSH or WALL IS STOP are made of movable text blocks, and the agent can rearrange them to alter physics. BALROG includes the forty puzzles from the public Baba Is AI release, focusing on compositional generalization rather than memorization of specific solutions. The environment is binary scored. The Cloos et al. paper found that models such as GPT-4o, Gemini 1.5 Pro, and Gemini 1.5 Flash failed dramatically on puzzles that require manipulating and combining rules to win, and this finding is reproduced inside BALROG.

MiniHack

MiniHack is a flexible framework built on top of the NetHack Learning Environment by Mikayel Samvelyan and collaborators. It allows researchers to design controlled tasks that use NetHack's underlying engine without requiring the agent to play the full dungeon. BALROG selects task types in three categories, navigation tasks like Maze and Corridor, skill acquisition tasks like Quest Easy, Medium, and Hard which require using items to cross lava or defeat a guardian wielding a wand of death, and puzzle tasks built on Boxoban, a Sokoban variant adapted to the NetHack engine. The MiniHack action space includes eight directional moves plus extras such as search, kick, open, and eat. Observations are produced through the NetHack Language Wrapper, which translates the ASCII map and message log into natural language.

NetHack Learning Environment

The NetHack Learning Environment, or NLE, exposes the classic 1987 roguelike NetHack to AI research. It is widely considered the hardest open game benchmark because dungeons are procedurally generated, mechanics are deep and idiosyncratic, and a winning ascension can require hundreds of thousands of steps and detailed game knowledge. BALROG uses a novel data informed progression metric, defined in Appendix F.2 of the paper, that combines current dungeon level, experience level, and other in-game proxies to produce a continuous score on a 0 to 100 scale. The action space exposes around eighty primitives covering movement, combat, item use, prayer, casting, eating, dropping, wearing, and dungeon navigation. NetHack functions as the upper bound of the benchmark and the place where almost every model still flatlines.

Evaluation methodology

BALROG keeps the evaluation protocol intentionally simple so that different models, prompting strategies, and inference time methods can be compared head to head.

Observation rendering

Each environment produces both a textual description and an image at every step. In LLM mode the model only sees the text. In VLM mode the model sees the image plus the same text. For ASCII heavy environments such as NetHack and MiniHack, BALROG additionally feeds the ASCII map representation alongside any visual rendering, because pure image observations are too low resolution for game-relevant glyphs to be reliably read.

Action interface

At each timestep the model is asked for a natural language action. The framework parses the output into the discrete action that the underlying environment expects. If the output is malformed or invalid, the system logs the failure, executes a noop or fallback action, and continues. This design supports detailed trajectory analysis of error types, for example whether a model is failing because of bad reasoning, bad parsing, or refused completions.

Scoring

Environment	Metric class	Range	Notes
BabyAI	Binary	0 or 100	Score per episode, then averaged across episodes
Baba Is AI	Binary	0 or 100	One score per puzzle
MiniHack	Binary	0 or 100	Per task type, averaged
Crafter	Continuous	0 to 100	Fraction of 22 achievements unlocked
TextWorld	Continuous	0 to 100	Fraction of subtasks completed
NetHack	Continuous	0 to 100	Data informed progression score combining dungeon depth and experience level

The headline number reported on the leaderboard is the unweighted mean of these per-environment scores. The paper reports each score with a 95 percent confidence interval estimated from five to twenty five seeded episodes depending on the environment.

Models tested in the original paper

The ICLR 2025 paper evaluates a deliberately broad slate of frontier systems, both closed and open weights.

Provider	Closed source models	Open weights models
OpenAI	GPT-4o, GPT-4o-mini, o1-preview (NetHack only)	None
Anthropic	Claude 3.5 Sonnet, Claude 3.5 Haiku	None
Google DeepMind	Gemini 1.5 Pro, Gemini 1.5 Flash	None
Meta	None	Llama 3.1 8B and 70B, Llama 3.2 1B, 3B, 11B Vision, 90B Vision

Original headline results

The table below reproduces the average progress reported in the November 2024 paper for the language only setting. Confidence intervals are 95 percent and come from the published Table 1.

Rank	Model	Average progress, LLM mode
1	Claude 3.5 Sonnet	32.6 plus or minus 1.9 percent
2	GPT-4o	32.3 plus or minus 1.5 percent
3	Llama 3.1 70B	27.9 plus or minus 1.4 percent
4	Llama 3.2 90B	27.3 plus or minus 1.4 percent
5	Gemini 1.5 Pro	21.0 plus or minus 1.2 percent
6	Claude 3.5 Haiku	19.3 plus or minus 1.8 percent
7	GPT-4o-mini	17.4 plus or minus 1.4 percent
8	Llama 3.2 11B Vision	16.8 plus or minus 1.5 percent

Per environment scores at launch

The per environment scores reveal that the bulk of progress comes from the easier games and that all systems collapse near zero on the hardest ones.

Model	BabyAI	Crafter	TextWorld	Baba Is AI	MiniHack	NetHack progression
GPT-4o	77.6	33.1	39.3	36.7	5.7	around 0
Claude 3.5 Sonnet	68.0	32.7	42.1	36.7	0.0	around 0
Llama 3.1 70B	73.2	26.6	15.0	23.3	0.0	around 0
Llama 3.2 90B	70.0	31.7	14.5	16.7	0.0	around 0
Gemini 1.5 Pro	71.2	24.6	0.0*	33.3	5.7	around 0
o1-preview (NetHack only)	n/a	n/a	n/a	n/a	n/a	1.6

*The Gemini result of 0 on TextWorld in the original paper is an artifact of API safety filters that returned refusals on a fraction of trajectories. The BALROG authors flagged this as a measurement issue and not a capability claim.

Vision language mode results

In vision language mode the rankings shift in unexpected ways. The key headline is that Claude 3.5 Sonnet actually improved with vision while GPT-4o and Llama 3.2 90B Vision degraded substantially.

Model	Average progress, VLM mode	Change versus LLM mode
Claude 3.5 Sonnet	35.5 plus or minus 2.0 percent	up 2.9
Gemini 1.5 Pro	25.8 plus or minus 1.4 percent	up 4.8
GPT-4o	22.6 plus or minus 1.4 percent	down 9.7
Llama 3.2 90B Vision	21.0 plus or minus 1.6 percent	down 6.3

The authors call this the vision deficiency paradox. The most prominent example is GPT-4o on Crafter, where adding the rendered image dropped scores from 33.1 percent in LLM mode to 26.8 percent in VLM mode. Llama 3.2 90B fell from 31.7 to 14.5 on the same environment. Claude 3.5 Sonnet was the exception, which the authors speculate is related to its training on computer use traces.

Key empirical findings

The paper distills several qualitative findings beyond the raw numbers.

The knowing doing gap

The most cited finding from BALROG is the gap between what models say they know and what they do. When queried out of context about NetHack mechanics, models such as GPT-4o or Claude 3.5 Sonnet correctly identify that eating rotten food can kill the character, that descending to deeper levels without proper preparation is fatal, and that prayer cooldowns must be respected. Inside the game, the same models repeatedly perform these exact suicidal actions. The authors interpret this as evidence that frontier LLMs hold relevant world knowledge but fail to apply it in context during sequential decision making.

Vision deficiency paradox

As noted above, several VLMs degrade in performance when given an image alongside text rather than the textual rendering alone. The authors hypothesize that current VLMs are optimized for descriptive captioning and visual question answering rather than action selection, that grid based scenes with small glyphs are out of distribution for natural image training data, and that the additional tokens consumed by image features push relevant text further out of effective attention windows. The exception of Claude 3.5 Sonnet, which improved with vision, suggests that training on computer use trajectories may close this gap.

Long horizon planning collapse

No model in the original paper produced a single successful trajectory on MiniHack Boxoban, MiniHack Quest Hard, or NetHack ascensions. Models can win Boxoban-like Sokoban puzzles when given them as one-shot text problems with chain of thought, but inside the BALROG loop they fail to plan reversible moves and end up in unrecoverable states. NetHack ascensions remained at zero, and even the o1-preview reasoning model topped out around 1.6 percent progression in the original paper.

Exploration is shallow

In TextWorld Coin Collector, which is a forty room maze, models often revisit explored rooms and miss unvisited ones. A simple depth first search with a visited set would solve it quickly, but the models do not maintain such state explicitly and fail at implicit retrieval over long histories.

Spatial reasoning errors

In BabyAI tasks that require placing an object adjacent to another object, models often miscount cells. In MiniHack CorridorBattle, agents become cornered because they cannot reason about which tile movements would let them avoid surrounding monsters in tight corridors.

Evolution of the leaderboard

The BALROG leaderboard at balrogai.com is updated on a weekly cadence, every Monday, and tracks both new model submissions and new agent strategies. The numbers below summarize public records from the leaderboard.

Notable LLM submissions through 2025 and 2026

Date posted	Model	Average progress (LLM)	Notes
Nov 2024	Claude 3.5 Sonnet	32.6 percent	Original SOTA at paper release
Nov 2024	GPT-4o	32.3 percent	Tied with Claude inside CI
Jan 2025	DeepSeek R1 671B (via NVIDIA NIM)	34.9 percent	First reasoning model to top leaderboard
Mid 2025	Grok 4	43.6 percent	Reported on the BALROG leaderboard
Late 2025	Claude Opus 4.5	43.5 percent	Strong improvement over 3.5 generation
Feb 2026	Gemini 3 Pro	58.1 percent	Gold medal as of Feb 2026 listing
Feb 2026	Gemini 3.1 Pro Thinking	57.0 percent	98 percent on Crafter
Feb 2026	Gemini 3.1 Pro	56.9 percent	First model to reach 100 percent Crafter

Progress on the easier environments has begun to plateau near the human ceiling. Crafter, in particular, has been effectively solved by the Gemini 3 family. The hardest environments, NetHack and MiniHack Boxoban, are still very far from human expert performance.

Notable VLM submissions

Date posted	Model	Average progress (VLM)
Nov 2024	Claude 3.5 Sonnet	35.5 percent
Nov 2024	Gemini 1.5 Pro 002	25.8 percent
Apr 2025	Gemini 2.5 Pro Experimental 03-25	35.7 percent

The VLM track has accumulated fewer submissions than the LLM track because most vendors evaluate their flagship multimodal model in both tracks but submit the language only result first.

NetHack remains the wall

A 2026 community write up by an independent evaluator using a modified BALROG style harness reported that GPT-5.2 achieved a NetHack progression score of 12.6 percent and reached dungeon level 10, which was the deepest any LLM had reached at the time. The same evaluation reported 9.8 percent for Gemini 3 Flash and 2.9 percent each for Gemini 3 Pro and Claude Opus 4.5 on NetHack specifically. No public LLM submission has produced a NetHack ascension.

Technical setup

BALROG is distributed as an open source Python package. The recommended installation is to create a conda environment with Python 3.10, clone the repository at github.com/balrog-ai/BALROG, install with pip in editable mode, and run the post install script to download environment binaries that are not redistributable through PyPI such as the NetHack data files.

conda create -n balrog python=3.10 -y
conda activate balrog
git clone https://github.com/balrog-ai/BALROG.git
cd BALROG
pip install -e .
balrog-post-install

Model access can be either via API or local serving. The package ships with adapters for the OpenAI, Anthropic, and Google Gemini APIs, configured through environment variables or a SECRETS file. For self hosted evaluation, BALROG supports vLLM serving and a worker pool architecture for running episodes in parallel across many random seeds. Researchers can plug in custom agents that wrap any callable into the same observation in, action out loop.

Submission types

The BALROG leaderboard tracks two distinct submission categories. The first is new models, where vendors or community contributors run a stock model through the official agent. The second is new inference strategies, where the model is held fixed and a new prompting, memory, planning, or search wrapper is contributed. This separation is intended to disentangle progress driven by stronger base models from progress driven by smarter agent design.

Limitations acknowledged by the authors

The paper is unusually candid about limitations.

Cost of few shot prompting. Naive few shot demonstrations are infeasible at NetHack scale. A single complete NetHack demonstration trajectory can exceed seven hundred thousand input tokens, which is too long for cost-sensitive evaluation. The authors suggest retrieval based demonstration selection as future work.
Vision pipeline. Image observations were paired with low temperature sampling and the rendered frames were limited in resolution. The authors note that future versions could include video and higher resolution renderings once efficient video VLMs are available.
Limited multi agent coverage. All six environments are single agent. The authors mention Overcooked and Hanabi as candidates for a future multi agent extension.
Single reasoning model in the original paper. Only o1-preview was tested in the original paper, and only on NetHack. Subsequent leaderboard entries by DeepSeek R1 and the Gemini 3 series fill in part of this gap.
English only. All prompts and observations are in English, which leaves multilingual agent capabilities out of scope.
Knowing doing gap analysis is qualitative. The paper documents the phenomenon through transcript analysis but does not propose a quantitative metric specifically for it.

Comparison with other agentic benchmarks

BALROG is one of a growing family of benchmarks aimed at agentic capabilities rather than single shot prediction. The table below contrasts it with several related efforts.

Benchmark	Domain	Horizon	Procedural generation	Vision required	Released
BALROG	Six varied games	10 to 100,000 steps	Yes	Optional	Nov 2024
Factorio Learning Environment	Industrial automation, factory design	Very long, open ended	Yes	Optional	2025
SWE-bench and SWE-bench Verified	Real world software engineering bugs	Tens of file operations	No, fixed issue set	No	2023 to 2024
Voyager	Minecraft skill acquisition	Open ended	Partial	Yes	2023
WebArena	Web navigation tasks	Tens of clicks	No, scripted	No	2023
OSWorld	Computer use across applications	Tens of steps	Partial	Yes	2024
MineDojo	Minecraft tasks	Long	Partial	Yes	2022
ALE	Atari games	Thousands of frames	No	Yes	2013

Voyager is the most similar in spirit, because it also embeds an LLM agent inside a procedurally generated game world, but Voyager focuses on a single environment, Minecraft, and on lifelong skill library construction rather than on direct progress scoring. The Factorio Learning Environment is a complementary effort that emphasizes industrial scale automation and quantitative production targets over diverse game mechanics. SWE-bench evaluates real engineering tasks but with a much shorter horizon than NetHack. Within this landscape BALROG is distinctive for combining six different games of widely varying difficulty under one scoring metric, with first class support for both language and vision input.

Reception and community

BALROG was first presented at the NeurIPS 2024 Open World Agents workshop and the Language Gamification workshop, then accepted at ICLR 2025 in Singapore. The benchmark received substantial coverage in the AI press in November 2024 and again in January 2025 when NVIDIA published a technical blog showing that DeepSeek R1 served through NIM microservices set a new state of the art on the language only track. The leaderboard at balrogai.com remains active, with weekly updates and verified status flags for submissions where the BALROG team has reproduced the reported result.

The BALROG GitHub repository, balrog-ai/BALROG, is MIT licensed and the documentation site at balrog-ai.github.io provides per environment task descriptions, observation schemas, and contribution guidelines. There is an active community Discord linked from the official website, and a separate balrog-ai/experiments repository hosts trajectory artifacts and reproduction scripts.

Failure mode taxonomy from trajectory analysis

Beyond the headline averages, the authors and community have produced trajectory analyses that catalog the dominant error types. The following table summarizes them.

Failure mode	Environment most affected	Description
Suicidal exploration	NetHack, MiniHack Quest	Descending without preparation, fighting beyond capability, ignoring known dangers
Loop and revisit	TextWorld Coin Collector	Returning to explored rooms without acquiring new information
Inventory neglect	NetHack, MiniHack	Failing to wear, wield, or apply items that the model can name
Rule blindness	Baba Is AI	Treating the rule blocks as inert decorations rather than as the puzzle
Adjacency miscount	BabyAI	Off by one errors in placing objects adjacent to others
Format hallucination	All	Outputting an action that is not in the allowed set, requiring a fallback
Refusal	TextWorld with some safety tuned VLMs	API filters returning refusals on benign in-game content

These categories are useful when interpreting per model differences. Closed reasoning models tend to reduce loop and revisit errors but still suffer from suicidal exploration. Smaller open weights models accumulate format hallucination at a far higher rate than larger ones, which inflates their failure rate independent of their underlying decision quality.

Why BALROG matters

BALROG occupies a useful middle ground in AI benchmark design. Static question answering benchmarks like MMLU are saturating and can be partially solved by retrieval rather than reasoning. Web and computer use benchmarks like WebArena and OSWorld are realistic but short. Open ended environments like Minecraft are realistic and long but expensive and hard to score consistently. BALROG reuses six well studied games, each with a clean reward structure, and bundles them with a unified protocol so that a single number can be compared across years of model progress.

The benchmark has also become a useful diagnostic for vendor claims. The vision deficiency paradox in particular invites skepticism about marketing material that emphasizes multimodality without measuring action quality. As frontier models like the Gemini 3 series push average BALROG progress past 55 percent in 2026, the harder environments such as NetHack and MiniHack Boxoban will remain the place where genuine long horizon reasoning is being tested.

References

Paglieri, Davide, Bartlomiej Cupial, Sam Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob N. Foerster, Jack Parker-Holder, and Tim Rocktaschel. BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games. arXiv:2411.13543. November 2024, revised April 2025. https://arxiv.org/abs/2411.13543
BALROG project website and leaderboard. https://balrogai.com/
BALROG documentation. https://balrog-ai.github.io/docs/
BALROG source code. https://github.com/balrog-ai/BALROG
OpenReview discussion for ICLR 2025. https://openreview.net/forum?id=fp6t3F669F
ICLR 2025 poster page. https://iclr.cc/virtual/2025/poster/28856
NVIDIA Developer Blog. Benchmarking Agentic LLM and VLM Reasoning for Gaming with NVIDIA NIM. January 2025. https://developer.nvidia.com/blog/benchmarking-agentic-llm-and-vlm-reasoning-for-gaming-with-nvidia-nim/
Epoch AI benchmark page for BALROG. https://epoch.ai/benchmarks/balrog
Cloos, Nathan, et al. Baba Is AI: Break the Rules to Beat the Benchmark. ICML 2024. arXiv:2407.13729. https://arxiv.org/abs/2407.13729
Kuttler, Heinrich, et al. The NetHack Learning Environment. NeurIPS 2020. https://arxiv.org/abs/2006.13760
Samvelyan, Mikayel, et al. MiniHack the Planet. NeurIPS 2021 Datasets and Benchmarks. https://arxiv.org/abs/2109.13202
Hafner, Danijar. Benchmarking the Spectrum of Agent Capabilities. ICLR 2022. (Crafter environment.) https://arxiv.org/abs/2109.06780
Chevalier-Boisvert, Maxime, et al. BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning. ICLR 2019. https://arxiv.org/abs/1810.08272
Cote, Marc-Alexandre, et al. TextWorld: A Learning Environment for Text-based Games. arXiv:1806.11532. https://arxiv.org/abs/1806.11532

Background and motivation

Authors and institutional origin

The six environments

BabyAI

Crafter

TextWorld

Baba Is AI

MiniHack

NetHack Learning Environment

Evaluation methodology

Observation rendering

Action interface

Scoring

Models tested in the original paper

Original headline results

Per environment scores at launch

Vision language mode results

Key empirical findings

The knowing doing gap

Vision deficiency paradox

Long horizon planning collapse

Exploration is shallow

Spatial reasoning errors

Evolution of the leaderboard

Notable LLM submissions through 2025 and 2026

Notable VLM submissions

NetHack remains the wall

Technical setup

Submission types

Limitations acknowledged by the authors

Comparison with other agentic benchmarks

Reception and community

Failure mode taxonomy from trajectory analysis

Why BALROG matters

See also

References

Improve this article

Related Articles

GeoBench

ARC-AGI 3

Factorio Learning Environment

τ-bench

Aider Polyglot

IFBench

Background and motivation

Authors and institutional origin

The six environments

BabyAI

Crafter

TextWorld

Baba Is AI

MiniHack

NetHack Learning Environment

Evaluation methodology

Observation rendering

Action interface

Scoring

Models tested in the original paper

Original headline results

Per environment scores at launch

Vision language mode results

Key empirical findings

The knowing doing gap

Vision deficiency paradox

Long horizon planning collapse

Exploration is shallow

Spatial reasoning errors

Evolution of the leaderboard

Notable LLM submissions through 2025 and 2026

Notable VLM submissions

NetHack remains the wall

Technical setup

Submission types

Limitations acknowledged by the authors

Comparison with other agentic benchmarks

Reception and community

Failure mode taxonomy from trajectory analysis

Why BALROG matters

See also

References