BALROG
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 5,146 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v2 · 5,146 words
Add missing citations, update stale details, or suggest a clearer explanation.
| BALROG | |
|---|---|
| Overview | |
| Full name | Benchmarking Agentic LLM and VLM Reasoning On Games |
| Abbreviation | BALROG |
| Description | A benchmark that evaluates agentic LLM and VLM capabilities through six diverse, procedurally generated game environments |
| Initial release | 20 November 2024 (arXiv v1) |
| Conference | ICLR 2025 (Singapore, April 2025) |
| Latest paper revision | 1 April 2025 (arXiv v2) |
| Lead author | Davide Paglieri (UCL DARK Lab) |
| Co-authors | Bartlomiej Cupial, Sam Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Lukasz Kucinski, Lerrel Pinto, Rob Fergus, Jakob N. Foerster, Jack Parker-Holder, Tim Rocktaschel |
| Affiliated organizations | University College London (DARK Lab), IDEAS NCBR, University of Warsaw, University of Oxford, New York University, Anthropic, Polish Academy of Sciences |
| Technical details | |
| Type | Agentic reasoning, long-horizon planning, sequential decision making |
| Modality | Language only (LLM) and vision plus language (VLM) |
| Task format | Interactive reinforcement learning environments wrapped for natural language action output |
| Number of environments | 6 (BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, NetHack Learning Environment) |
| Total tasks | Procedurally generated, unbounded |
| Primary metric | Average progress percentage (0 to 100), aggregated across environments |
| Languages | English |
| Performance | |
| Original SOTA (Nov 2024) | Claude 3.5 Sonnet, 32.6 percent (LLM) |
| Strong reasoner baseline | DeepSeek-R1, 34.9 percent (Jan 2025) |
| 2026 LLM leader | Gemini 3 Pro family, around 58 percent (Feb 2026) |
| 2026 VLM leader | Gemini 2.5 Pro Exp, 35.7 percent (Apr 2025) |
| Saturated | No, especially NetHack and MiniHack Boxoban |
| Resources | |
| Website | balrogai.com |
| Paper | arXiv:2411.13543 |
| GitHub | balrog-ai/BALROG |
| Docs | balrog-ai.github.io |
| License | MIT |
BALROG (Benchmarking Agentic LLM and VLM Reasoning On Games) is a benchmark suite for evaluating the agentic capabilities of large language models and vision language models inside long-horizon, procedurally generated game environments. The framework was introduced by Davide Paglieri and collaborators from the UCL DARK Lab, IDEAS NCBR, the University of Warsaw, the University of Oxford, New York University, and Anthropic in a paper first posted to arXiv on 20 November 2024 and accepted as a poster at ICLR 2025. BALROG aggregates six existing reinforcement learning environments, namely BabyAI, Crafter, TextWorld, Baba Is AI, MiniHack, and the NetHack Learning Environment, into a single testbed where models output natural language actions over hundreds or thousands of steps. The headline finding from the original paper is that frontier models complete only a fraction of these tasks and that several vision language systems perform worse when given a picture of the environment than when given a textual description of the same state.
The motivation for BALROG is rooted in a gap that the authors identify between claims about agentic AI and the empirical evaluation of those claims. Standard agent benchmarks such as WebArena, SWE-bench, and OSWorld tend to test sequences of a few dozen interactions inside a single domain, while realistic autonomous behaviors require ordered planning over orders of magnitude more steps with strong credit assignment. The team argues that games offer a natural laboratory for this kind of evaluation because they impose long horizons, sparse rewards, stochastic dynamics, and clear win conditions that cannot be passed by memorizing a static answer set.
A second motivation is the difficulty curve of the chosen environments. The authors deliberately span tasks that a non-expert can finish in seconds, such as picking up a key in BabyAI, to tasks that take expert humans years to master, such as ascending in NetHack. This range allows the benchmark to remain unsaturated as models improve. Performance is reported per environment on a 0 to 100 scale, so a single number captures whether a model is making consistent progress everywhere or only on the easiest games.
A third motivation is to study modality. BALROG runs every environment in two modes, a language only mode where the model receives a textual rendering of the state, and a vision language mode where it receives the rendered image alongside text. Because the underlying state is identical, the difference between scores in the two modes isolates the contribution of visual perception to agentic reasoning. This design has produced some of the benchmark's most discussed empirical results.
BALROG was conceived inside the UCL DARK Lab, an artificial intelligence research group led by Tim Rocktaschel and known previously for the NetHack Learning Environment and MiniHack. Davide Paglieri, a PhD student at UCL advised by Rocktaschel and Jack Parker-Holder, is the lead author. The remaining authors include Bartlomiej Cupial and Maciej Wolczyk from IDEAS NCBR and the University of Warsaw, Sam Coward and Jakob N. Foerster from the University of Oxford, Ulyana Piterbarg, Lerrel Pinto, and Rob Fergus from New York University, Akbir Khan who is affiliated with UCL and Anthropic, Eduardo Pignatelli at UCL, and Lukasz Kucinski from IDEAS NCBR and the Institute of Mathematics of the Polish Academy of Sciences. The corresponding author email on the BALROG website is d.paglieri at cs.ucl.ac.uk.
This combination of authors mirrors the lineage of the benchmark, since several of the environments included in BALROG originated in research papers by overlapping groups. The NetHack Learning Environment and MiniHack were originally released by teams at FAIR and UCL DARK. BabyAI is the work of Maxime Chevalier-Boisvert and collaborators at Mila. Crafter is an environment by Danijar Hafner. TextWorld is a Microsoft Research framework. Baba Is AI is a 2024 environment by Nathan Cloos and colleagues. BALROG does not modify the games themselves, instead it provides a standardized wrapper, prompting, parsing, and scoring layer on top of them.
The BALROG suite intentionally covers a wide spectrum of skills. The following table provides a structured summary of each environment and the abilities it stresses.
| Environment | Origin | Visual style | Mastery time for humans | Skills tested |
|---|---|---|---|---|
| BabyAI | Chevalier-Boisvert et al., 2019 | 2D grid | Seconds to minutes | Language grounding, instruction following, simple navigation |
| Crafter | Hafner, 2021 | 2D pixel art | Hours | Survival, resource gathering, crafting, exploration |
| TextWorld | Cote et al., 2018 (Microsoft Research) | Pure text | Minutes to hours | Language understanding, spatial mental models, puzzle solving |
| Baba Is AI | Cloos et al., 2024 | 2D grid with movable text tiles | Hours | Abstract rule manipulation, compositional generalization |
| MiniHack | Samvelyan et al., 2021 | ASCII or tiles | Hours to days | Tactical combat, navigation, item use, planning |
| NetHack Learning Environment | Kuttler et al., 2020 | ASCII or tiles | Years | Long-horizon strategy, vast game knowledge, credit assignment |
BabyAI places an agent in a 2D gridworld with colored objects such as keys, balls, boxes, and doors. The agent receives a natural language mission, for example pick up the red key, and must complete it inside a small set of rooms. BALROG selects five navigation task types from BabyAI. The action space is small, containing six primitives, turn left, turn right, move forward, pick up, drop, and toggle. Each episode is scored as 0 or 100 based on whether the mission is completed within the step limit. BabyAI is the most approachable environment in BALROG, so it acts as a sanity check on basic language to action grounding.
Crafter is a 2D Minecraft-style survival sandbox developed by Danijar Hafner in 2021. The world is procedurally generated and tracks 22 achievements that range from collect wood and place stone to collect diamond, defeat zombie, and wake up rested. The agent must avoid starvation, dehydration, and combat death while gathering resources and climbing the crafting tech tree. In BALROG, Crafter is scored on a continuous 0 to 100 scale equal to the percentage of achievements unlocked. Crafter exposes a model's ability to plan a hierarchical sequence of resource acquisition steps under stochastic conditions.
TextWorld is a Microsoft Research framework for generating text adventures. BALROG uses three game types from TextWorld, namely Treasure Hunter, The Cooking Game, and Coin Collector. Treasure Hunter is a room based navigation game across twenty rooms with locked doors and keys. Cooking Game requires multi-step recipes where the agent must chop, fry, or roast specific ingredients in a particular order. Coin Collector is a long horizon navigation task spanning forty rooms and is widely used as an exploration probe. TextWorld actions are short natural language strings such as go east, examine apple, take key, or unlock door with brass key. Scores reflect the fraction of subtasks solved.
Baba Is AI, introduced by Nathan Cloos and colleagues at ICML 2024, is a research version of the indie puzzle game Baba Is You. The defining feature of the original game is that the rules themselves are objects in the world. Sentences like ROCK IS PUSH or WALL IS STOP are made of movable text blocks, and the agent can rearrange them to alter physics. BALROG includes the forty puzzles from the public Baba Is AI release, focusing on compositional generalization rather than memorization of specific solutions. The environment is binary scored. The Cloos et al. paper found that models such as GPT-4o, Gemini 1.5 Pro, and Gemini 1.5 Flash failed dramatically on puzzles that require manipulating and combining rules to win, and this finding is reproduced inside BALROG.
MiniHack is a flexible framework built on top of the NetHack Learning Environment by Mikayel Samvelyan and collaborators. It allows researchers to design controlled tasks that use NetHack's underlying engine without requiring the agent to play the full dungeon. BALROG selects task types in three categories, navigation tasks like Maze and Corridor, skill acquisition tasks like Quest Easy, Medium, and Hard which require using items to cross lava or defeat a guardian wielding a wand of death, and puzzle tasks built on Boxoban, a Sokoban variant adapted to the NetHack engine. The MiniHack action space includes eight directional moves plus extras such as search, kick, open, and eat. Observations are produced through the NetHack Language Wrapper, which translates the ASCII map and message log into natural language.
The NetHack Learning Environment, or NLE, exposes the classic 1987 roguelike NetHack to AI research. It is widely considered the hardest open game benchmark because dungeons are procedurally generated, mechanics are deep and idiosyncratic, and a winning ascension can require hundreds of thousands of steps and detailed game knowledge. BALROG uses a novel data informed progression metric, defined in Appendix F.2 of the paper, that combines current dungeon level, experience level, and other in-game proxies to produce a continuous score on a 0 to 100 scale. The action space exposes around eighty primitives covering movement, combat, item use, prayer, casting, eating, dropping, wearing, and dungeon navigation. NetHack functions as the upper bound of the benchmark and the place where almost every model still flatlines.
BALROG keeps the evaluation protocol intentionally simple so that different models, prompting strategies, and inference time methods can be compared head to head.
Each environment produces both a textual description and an image at every step. In LLM mode the model only sees the text. In VLM mode the model sees the image plus the same text. For ASCII heavy environments such as NetHack and MiniHack, BALROG additionally feeds the ASCII map representation alongside any visual rendering, because pure image observations are too low resolution for game-relevant glyphs to be reliably read.
At each timestep the model is asked for a natural language action. The framework parses the output into the discrete action that the underlying environment expects. If the output is malformed or invalid, the system logs the failure, executes a noop or fallback action, and continues. This design supports detailed trajectory analysis of error types, for example whether a model is failing because of bad reasoning, bad parsing, or refused completions.
| Environment | Metric class | Range | Notes |
|---|---|---|---|
| BabyAI | Binary | 0 or 100 | Score per episode, then averaged across episodes |
| Baba Is AI | Binary | 0 or 100 | One score per puzzle |
| MiniHack | Binary | 0 or 100 | Per task type, averaged |
| Crafter | Continuous | 0 to 100 | Fraction of 22 achievements unlocked |
| TextWorld | Continuous | 0 to 100 | Fraction of subtasks completed |
| NetHack | Continuous | 0 to 100 | Data informed progression score combining dungeon depth and experience level |
The headline number reported on the leaderboard is the unweighted mean of these per-environment scores. The paper reports each score with a 95 percent confidence interval estimated from five to twenty five seeded episodes depending on the environment.
The ICLR 2025 paper evaluates a deliberately broad slate of frontier systems, both closed and open weights.
| Provider | Closed source models | Open weights models |
|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini, o1-preview (NetHack only) | None |
| Anthropic | Claude 3.5 Sonnet, Claude 3.5 Haiku | None |
| Google DeepMind | Gemini 1.5 Pro, Gemini 1.5 Flash | None |
| Meta | None | Llama 3.1 8B and 70B, Llama 3.2 1B, 3B, 11B Vision, 90B Vision |
The table below reproduces the average progress reported in the November 2024 paper for the language only setting. Confidence intervals are 95 percent and come from the published Table 1.
| Rank | Model | Average progress, LLM mode |
|---|---|---|
| 1 | Claude 3.5 Sonnet | 32.6 plus or minus 1.9 percent |
| 2 | GPT-4o | 32.3 plus or minus 1.5 percent |
| 3 | Llama 3.1 70B | 27.9 plus or minus 1.4 percent |
| 4 | Llama 3.2 90B | 27.3 plus or minus 1.4 percent |
| 5 | Gemini 1.5 Pro | 21.0 plus or minus 1.2 percent |
| 6 | Claude 3.5 Haiku | 19.3 plus or minus 1.8 percent |
| 7 | GPT-4o-mini | 17.4 plus or minus 1.4 percent |
| 8 | Llama 3.2 11B Vision | 16.8 plus or minus 1.5 percent |
The per environment scores reveal that the bulk of progress comes from the easier games and that all systems collapse near zero on the hardest ones.
| Model | BabyAI | Crafter | TextWorld | Baba Is AI | MiniHack | NetHack progression |
|---|---|---|---|---|---|---|
| GPT-4o | 77.6 | 33.1 | 39.3 | 36.7 | 5.7 | around 0 |
| Claude 3.5 Sonnet | 68.0 | 32.7 | 42.1 | 36.7 | 0.0 | around 0 |
| Llama 3.1 70B | 73.2 | 26.6 | 15.0 | 23.3 | 0.0 | around 0 |
| Llama 3.2 90B | 70.0 | 31.7 | 14.5 | 16.7 | 0.0 | around 0 |
| Gemini 1.5 Pro | 71.2 | 24.6 | 0.0* | 33.3 | 5.7 | around 0 |
| o1-preview (NetHack only) | n/a | n/a | n/a | n/a | n/a | 1.6 |
*The Gemini result of 0 on TextWorld in the original paper is an artifact of API safety filters that returned refusals on a fraction of trajectories. The BALROG authors flagged this as a measurement issue and not a capability claim.
In vision language mode the rankings shift in unexpected ways. The key headline is that Claude 3.5 Sonnet actually improved with vision while GPT-4o and Llama 3.2 90B Vision degraded substantially.
| Model | Average progress, VLM mode | Change versus LLM mode |
|---|---|---|
| Claude 3.5 Sonnet | 35.5 plus or minus 2.0 percent | up 2.9 |
| Gemini 1.5 Pro | 25.8 plus or minus 1.4 percent | up 4.8 |
| GPT-4o | 22.6 plus or minus 1.4 percent | down 9.7 |
| Llama 3.2 90B Vision | 21.0 plus or minus 1.6 percent | down 6.3 |
The authors call this the vision deficiency paradox. The most prominent example is GPT-4o on Crafter, where adding the rendered image dropped scores from 33.1 percent in LLM mode to 26.8 percent in VLM mode. Llama 3.2 90B fell from 31.7 to 14.5 on the same environment. Claude 3.5 Sonnet was the exception, which the authors speculate is related to its training on computer use traces.
The paper distills several qualitative findings beyond the raw numbers.
The most cited finding from BALROG is the gap between what models say they know and what they do. When queried out of context about NetHack mechanics, models such as GPT-4o or Claude 3.5 Sonnet correctly identify that eating rotten food can kill the character, that descending to deeper levels without proper preparation is fatal, and that prayer cooldowns must be respected. Inside the game, the same models repeatedly perform these exact suicidal actions. The authors interpret this as evidence that frontier LLMs hold relevant world knowledge but fail to apply it in context during sequential decision making.
As noted above, several VLMs degrade in performance when given an image alongside text rather than the textual rendering alone. The authors hypothesize that current VLMs are optimized for descriptive captioning and visual question answering rather than action selection, that grid based scenes with small glyphs are out of distribution for natural image training data, and that the additional tokens consumed by image features push relevant text further out of effective attention windows. The exception of Claude 3.5 Sonnet, which improved with vision, suggests that training on computer use trajectories may close this gap.
No model in the original paper produced a single successful trajectory on MiniHack Boxoban, MiniHack Quest Hard, or NetHack ascensions. Models can win Boxoban-like Sokoban puzzles when given them as one-shot text problems with chain of thought, but inside the BALROG loop they fail to plan reversible moves and end up in unrecoverable states. NetHack ascensions remained at zero, and even the o1-preview reasoning model topped out around 1.6 percent progression in the original paper.
In TextWorld Coin Collector, which is a forty room maze, models often revisit explored rooms and miss unvisited ones. A simple depth first search with a visited set would solve it quickly, but the models do not maintain such state explicitly and fail at implicit retrieval over long histories.
In BabyAI tasks that require placing an object adjacent to another object, models often miscount cells. In MiniHack CorridorBattle, agents become cornered because they cannot reason about which tile movements would let them avoid surrounding monsters in tight corridors.
The BALROG leaderboard at balrogai.com is updated on a weekly cadence, every Monday, and tracks both new model submissions and new agent strategies. The numbers below summarize public records from the leaderboard.
| Date posted | Model | Average progress (LLM) | Notes |
|---|---|---|---|
| Nov 2024 | Claude 3.5 Sonnet | 32.6 percent | Original SOTA at paper release |
| Nov 2024 | GPT-4o | 32.3 percent | Tied with Claude inside CI |
| Jan 2025 | DeepSeek R1 671B (via NVIDIA NIM) | 34.9 percent | First reasoning model to top leaderboard |
| Mid 2025 | Grok 4 | 43.6 percent | Reported on the BALROG leaderboard |
| Late 2025 | Claude Opus 4.5 | 43.5 percent | Strong improvement over 3.5 generation |
| Feb 2026 | Gemini 3 Pro | 58.1 percent | Gold medal as of Feb 2026 listing |
| Feb 2026 | Gemini 3.1 Pro Thinking | 57.0 percent | 98 percent on Crafter |
| Feb 2026 | Gemini 3.1 Pro | 56.9 percent | First model to reach 100 percent Crafter |
Progress on the easier environments has begun to plateau near the human ceiling. Crafter, in particular, has been effectively solved by the Gemini 3 family. The hardest environments, NetHack and MiniHack Boxoban, are still very far from human expert performance.
| Date posted | Model | Average progress (VLM) |
|---|---|---|
| Nov 2024 | Claude 3.5 Sonnet | 35.5 percent |
| Nov 2024 | Gemini 1.5 Pro 002 | 25.8 percent |
| Apr 2025 | Gemini 2.5 Pro Experimental 03-25 | 35.7 percent |
The VLM track has accumulated fewer submissions than the LLM track because most vendors evaluate their flagship multimodal model in both tracks but submit the language only result first.
A 2026 community write up by an independent evaluator using a modified BALROG style harness reported that GPT-5.2 achieved a NetHack progression score of 12.6 percent and reached dungeon level 10, which was the deepest any LLM had reached at the time. The same evaluation reported 9.8 percent for Gemini 3 Flash and 2.9 percent each for Gemini 3 Pro and Claude Opus 4.5 on NetHack specifically. No public LLM submission has produced a NetHack ascension.
BALROG is distributed as an open source Python package. The recommended installation is to create a conda environment with Python 3.10, clone the repository at github.com/balrog-ai/BALROG, install with pip in editable mode, and run the post install script to download environment binaries that are not redistributable through PyPI such as the NetHack data files.
conda create -n balrog python=3.10 -y
conda activate balrog
git clone https://github.com/balrog-ai/BALROG.git
cd BALROG
pip install -e .
balrog-post-install
Model access can be either via API or local serving. The package ships with adapters for the OpenAI, Anthropic, and Google Gemini APIs, configured through environment variables or a SECRETS file. For self hosted evaluation, BALROG supports vLLM serving and a worker pool architecture for running episodes in parallel across many random seeds. Researchers can plug in custom agents that wrap any callable into the same observation in, action out loop.
The BALROG leaderboard tracks two distinct submission categories. The first is new models, where vendors or community contributors run a stock model through the official agent. The second is new inference strategies, where the model is held fixed and a new prompting, memory, planning, or search wrapper is contributed. This separation is intended to disentangle progress driven by stronger base models from progress driven by smarter agent design.
The paper is unusually candid about limitations.
BALROG is one of a growing family of benchmarks aimed at agentic capabilities rather than single shot prediction. The table below contrasts it with several related efforts.
| Benchmark | Domain | Horizon | Procedural generation | Vision required | Released |
|---|---|---|---|---|---|
| BALROG | Six varied games | 10 to 100,000 steps | Yes | Optional | Nov 2024 |
| Factorio Learning Environment | Industrial automation, factory design | Very long, open ended | Yes | Optional | 2025 |
| SWE-bench and SWE-bench Verified | Real world software engineering bugs | Tens of file operations | No, fixed issue set | No | 2023 to 2024 |
| Voyager | Minecraft skill acquisition | Open ended | Partial | Yes | 2023 |
| WebArena | Web navigation tasks | Tens of clicks | No, scripted | No | 2023 |
| OSWorld | Computer use across applications | Tens of steps | Partial | Yes | 2024 |
| MineDojo | Minecraft tasks | Long | Partial | Yes | 2022 |
| ALE | Atari games | Thousands of frames | No | Yes | 2013 |
Voyager is the most similar in spirit, because it also embeds an LLM agent inside a procedurally generated game world, but Voyager focuses on a single environment, Minecraft, and on lifelong skill library construction rather than on direct progress scoring. The Factorio Learning Environment is a complementary effort that emphasizes industrial scale automation and quantitative production targets over diverse game mechanics. SWE-bench evaluates real engineering tasks but with a much shorter horizon than NetHack. Within this landscape BALROG is distinctive for combining six different games of widely varying difficulty under one scoring metric, with first class support for both language and vision input.
BALROG was first presented at the NeurIPS 2024 Open World Agents workshop and the Language Gamification workshop, then accepted at ICLR 2025 in Singapore. The benchmark received substantial coverage in the AI press in November 2024 and again in January 2025 when NVIDIA published a technical blog showing that DeepSeek R1 served through NIM microservices set a new state of the art on the language only track. The leaderboard at balrogai.com remains active, with weekly updates and verified status flags for submissions where the BALROG team has reproduced the reported result.
The BALROG GitHub repository, balrog-ai/BALROG, is MIT licensed and the documentation site at balrog-ai.github.io provides per environment task descriptions, observation schemas, and contribution guidelines. There is an active community Discord linked from the official website, and a separate balrog-ai/experiments repository hosts trajectory artifacts and reproduction scripts.
Beyond the headline averages, the authors and community have produced trajectory analyses that catalog the dominant error types. The following table summarizes them.
| Failure mode | Environment most affected | Description |
|---|---|---|
| Suicidal exploration | NetHack, MiniHack Quest | Descending without preparation, fighting beyond capability, ignoring known dangers |
| Loop and revisit | TextWorld Coin Collector | Returning to explored rooms without acquiring new information |
| Inventory neglect | NetHack, MiniHack | Failing to wear, wield, or apply items that the model can name |
| Rule blindness | Baba Is AI | Treating the rule blocks as inert decorations rather than as the puzzle |
| Adjacency miscount | BabyAI | Off by one errors in placing objects adjacent to others |
| Format hallucination | All | Outputting an action that is not in the allowed set, requiring a fallback |
| Refusal | TextWorld with some safety tuned VLMs | API filters returning refusals on benign in-game content |
These categories are useful when interpreting per model differences. Closed reasoning models tend to reduce loop and revisit errors but still suffer from suicidal exploration. Smaller open weights models accumulate format hallucination at a far higher rate than larger ones, which inflates their failure rate independent of their underlying decision quality.
BALROG occupies a useful middle ground in AI benchmark design. Static question answering benchmarks like MMLU are saturating and can be partially solved by retrieval rather than reasoning. Web and computer use benchmarks like WebArena and OSWorld are realistic but short. Open ended environments like Minecraft are realistic and long but expensive and hard to score consistently. BALROG reuses six well studied games, each with a clean reward structure, and bundles them with a unified protocol so that a single number can be compared across years of model progress.
The benchmark has also become a useful diagnostic for vendor claims. The vision deficiency paradox in particular invites skepticism about marketing material that emphasizes multimodality without measuring action quality. As frontier models like the Gemini 3 series push average BALROG progress past 55 percent in 2026, the harder environments such as NetHack and MiniHack Boxoban will remain the place where genuine long horizon reasoning is being tested.