| BALROG | |
|---|---|
| Overview | |
| Full name | Benchmarking Agentic LLM and VLM Reasoning On Games |
| Abbreviation | BALROG |
| Description | A benchmark evaluating agentic LLM and VLM capabilities through diverse challenging game environments |
| Release date | 2024-11 |
| Latest version | 1.0 |
| Benchmark updated | 2025-04 |
| Authors | Davide Paglieri, Bartłomiej Cupiał, Sam Coward, Ulyana Piterbarg, Maciej Wołczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel |
| Organization | UCL DARK Lab, Oxford, NYU |
| Technical Details | |
| Type | Agentic Reasoning, Game-Based Evaluation, Long-Horizon Planning |
| Modality | Text (LLM), Vision + Text (VLM) |
| Task format | Interactive game environments |
| Number of tasks | 6 game environments (procedurally generated instances) |
| Total examples | Unlimited (procedural generation) |
| Evaluation metric | Progress percentage, Task completion |
| Domains | Spatial reasoning, Planning, Exploration, Problem-solving |
| Languages | English |
| Performance | |
| Human performance | Non-expert: seconds to minutes; Expert: varies by game |
| Baseline | Varies by environment |
| SOTA score | 43.6% (LLM), 35.7% (VLM) |
| SOTA model | Grok-4 (LLM), Gemini-2.5-Pro-Exp-03-25 (VLM) |
| SOTA date | 2025 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | [N/A (procedural generation) Download] |
| License | Open source
|
BALROG (Benchmarking Agentic LLM and VLM Reasoning On Games) is a comprehensive artificial intelligence benchmark designed to evaluate the agentic capabilities of large language models (LLMs) and vision-language models (VLMs) through diverse and challenging game environments. Released in November 2024 by researchers from University College London's DARK Lab, Oxford University, and New York University[1], BALROG addresses the critical need to assess AI systems' ability to plan, reason spatially, and explore in dynamic, interactive environments over extended time horizons.
BALROG represents a significant advancement in AI evaluation by moving beyond static benchmarks to test models in interactive, procedurally generated game environments. The benchmark incorporates six diverse games ranging from simple grid-based tasks solvable by non-experts in seconds to complex roguelike adventures that take humans years to master. By requiring agents to make sequential decisions, adapt to changing environments, and pursue long-term goals, BALROG reveals fundamental limitations in current AI systems' reasoning and decision-making capabilities.
Unlike traditional benchmarks that can be solved through memorization or pattern matching, BALROG's procedurally generated environments ensure that models must genuinely understand and reason about their surroundings. The benchmark uniquely offers both text-based (LLM) and vision-based (VLM) evaluation modes, enabling direct comparison of how different input modalities affect agent performance.
BALROG evaluates agents across six carefully selected game environments, each testing different aspects of intelligence:
| Game | Type | Difficulty | Key Skills Tested | Time to Master |
|---|---|---|---|---|
| BabyAI | Grid-based instruction following | Easy | Language understanding, navigation | Minutes |
| Crafter | Survival crafting game | Medium | Resource management, planning | Hours |
| TextWorld | Text adventure | Medium | Natural language understanding, exploration | Hours |
| Baba Is AI | Rule manipulation puzzle | Hard | Logical reasoning, creativity | Days |
| MiniHack | Roguelike dungeon crawler | Hard | Tactical planning, adaptation | Weeks |
| NetHack | Complex roguelike | Extreme | Long-term strategy, vast knowledge | Years |
Each game was selected for specific reasons:
| Game | Selection Rationale | Unique Contribution |
|---|---|---|
| BabyAI | Tests basic instruction following | Baseline language grounding |
| Crafter | Open-ended survival challenges | Resource optimization |
| TextWorld | Pure text-based reasoning | Language-only evaluation |
| Baba Is AI | Meta-level rule manipulation | Abstract reasoning |
| MiniHack | Controlled complexity roguelike | Tactical decision-making |
| NetHack | Ultimate complexity test | Long-horizon planning |
BALROG uses several metrics to evaluate agent performance:
| Metric | Description | Calculation |
|---|---|---|
| **Progress %** | Percentage of game objectives completed | (Completed objectives / Total objectives) × 100 |
| **Success Rate** | Binary task completion | Number of successful runs / Total runs |
| **Efficiency** | Steps taken to achieve goals | Compared to human baseline |
| **Exploration** | Coverage of game state space | Unique states visited / Possible states |
| **Adaptation** | Learning from failures | Performance improvement over episodes |
1. **Environment Initialization**: Random seed generates unique game instance 2. **Agent Deployment**: Model receives initial observation 3. **Action Loop**: Agent takes actions based on observations 4. **Feedback Processing**: Environment provides new state and rewards 5. **Termination**: Episode ends on success, failure, or timeout 6. **Aggregation**: Results averaged across multiple random seeds
BALROG supports two evaluation modes:
| Mode | Input Type | Advantages | Challenges |
|---|---|---|---|
| **LLM Mode** | Text descriptions | Rich semantic information | Lacks spatial details |
| **VLM Mode** | Visual + text | Complete information | Requires visual reasoning |
| Rank | Model | Overall Progress % | BabyAI | Crafter | TextWorld | Baba Is AI | MiniHack | NetHack |
|---|---|---|---|---|---|---|---|---|
| 1 | Grok-4 | 43.6% | 82% | 45% | 68% | 35% | 28% | 4% |
| 2 | GPT-4o | 41.2% | 78% | 42% | 65% | 32% | 25% | 5% |
| 3 | Claude 3.5 Sonnet | 38.9% | 75% | 40% | 62% | 30% | 22% | 4% |
| 4 | DeepSeek-R1-671B | 37.5% | 73% | 38% | 60% | 28% | 20% | 6% |
| 5 | Gemini 2.0 Pro | 35.8% | 70% | 35% | 58% | 25% | 18% | 9% |
| Rank | Model | Overall Progress % | BabyAI | Crafter | Baba Is AI | MiniHack | NetHack |
|---|---|---|---|---|---|---|---|
| 1 | Gemini-2.5-Pro-Exp-03-25 | 35.7% | 68% | 38% | 28% | 32% | 13% |
| 2 | GPT-4V | 32.4% | 65% | 35% | 25% | 28% | 9% |
| 3 | Claude 3.5 Vision | 30.1% | 62% | 32% | 22% | 25% | 10% |
| 4 | Llama 3.2 Vision | 25.8% | 55% | 28% | 18% | 20% | 8% |
One of BALROG's most surprising findings is that VLMs consistently underperform LLMs despite having access to richer visual information[1]:
| Observation | Impact | Potential Cause |
|---|---|---|
| VLMs score 5-10% lower | Counterintuitive result | Visual processing interferes with reasoning |
| Spatial errors increase | More collision mistakes | Poor visual-spatial grounding |
| Slower decision-making | Longer inference times | Processing overhead |
| Game | Best Performance | Key Challenge | Failure Mode |
|---|---|---|---|
| BabyAI | 82% (Grok-4) | Multi-step instructions | Forgetting earlier objectives |
| Crafter | 45% (Grok-4) | Resource prioritization | Suboptimal crafting sequences |
| TextWorld | 68% (Grok-4) | Spatial mental models | Getting lost in mazes |
| Baba Is AI | 35% (Grok-4) | Rule modification | Cannot reason about meta-rules |
| MiniHack | 28% (Grok-4) | Combat tactics | Poor threat assessment |
| NetHack | 9% (Gemini 2.0) | Vast complexity | Overwhelming state space |
```python
from balrog import BALROGBenchmark
benchmark = BALROGBenchmark(
games=['babyai', 'crafter', 'textworld', 'baba', 'minihack', 'nethack'], mode='llm', # or 'vlm' for vision mode num_episodes=100
)
results = benchmark.evaluate(
agent=my_agent, verbose=True, save_trajectories=True
)
for game, metrics in results.items():
print(f"{game}: {metrics['progress']:.1%} progress")
print(f" Success rate: {metrics['success_rate']:.1%}")
print(f" Avg steps: {metrics['avg_steps']}")
```
```python class BALROGAgent:
def __init__(self, model):
self.model = model
self.memory = []
def act(self, observation):
"""
Generate action based on observation
Args:
observation: Game state (text or image+text)
Returns:
action: String action command
"""
# Add observation to memory
self.memory.append(observation)
# Generate action using model
prompt = self.construct_prompt(observation)
action = self.model.generate(prompt)
return action
```
BALROG's use of procedural generation is crucial for valid evaluation:
| Aspect | Implementation | Benefit |
|---|---|---|
| **Random Seeds** | Unique seed per episode | Prevents memorization |
| **Level Generation** | Algorithmic map creation | Infinite variety |
| **Item Placement** | Randomized locations | Tests exploration |
| **Enemy Behavior** | Stochastic patterns | Requires adaptation |
| **Objective Variation** | Different goals each run | Tests flexibility |
| Feature | BALROG | Traditional Benchmarks | Other Game Benchmarks |
|---|---|---|---|
| Multiple Games | 6 diverse games | Single task type | Usually 1 game |
| Difficulty Range | Seconds to years | Fixed difficulty | Limited range |
| Modality Options | LLM and VLM | Usually one | Typically vision-only |
| Procedural Generation | All environments | Static datasets | Some procedural |
| Human Baseline | Clear comparisons | Often missing | Variable |
| Benchmark | Similarity | Key Difference |
|---|---|---|
| MineDojo | Game-based evaluation | Single game (Minecraft) |
| FLE | Long-horizon planning | Focus on automation |
| ALFRED | Sequential tasks | Household domain only |
| BabyAI (standalone) | Included in BALROG | Limited scope |
| ALE | Game evaluation | Simpler games |
BALROG exposes several fundamental limitations in current AI systems:
1. **Poor Transfer Learning**: Skills from easier games don't transfer to harder ones 2. **Limited Exploration**: Models struggle with systematic exploration strategies 3. **Weak Spatial Reasoning**: Even with visual input, spatial understanding is poor 4. **Short Planning Horizons**: Long-term strategic planning remains elusive 5. **Inability to Learn from Failure**: Models don't effectively adapt from mistakes
| Direction | Motivation | Potential Approach |
|---|---|---|
| Memory Systems | Address forgetting | External memory banks |
| Hierarchical Planning | Enable long-term goals | Goal decomposition |
| World Models | Improve prediction | Learn environment dynamics |
| Curiosity Mechanisms | Better exploration | Intrinsic motivation |
| Multi-modal Integration | Fix vision paradox | Better VLM architectures |
BALROG maintains an active open-source community:
| Component | Status | Location |
|---|---|---|
| Core Framework | Published | github.com/balrog-ai/BALROG |
| Leaderboard | Live | balrogai.com |
| Documentation | Comprehensive | GitHub wiki |
| Model Submissions | Open | Via pull requests |
| Discord Community | Active | Linked from website |
In January 2025, NVIDIA provided NIM microservices for evaluating models like DeepSeek-R1 on BALROG, demonstrating industry interest in the benchmark[2].
| Enhancement | Description | Timeline |
|---|---|---|
| Additional Games | Expand to 10+ environments | 2025 Q4 |
| Multi-agent Support | Cooperative/competitive play | 2026 Q1 |
| Continuous Learning | Persistent agent improvement | 2026 Q2 |
| Human Studies | Detailed human baselines | Ongoing |
| Real-time Evaluation | Streaming game play | 2026 |
1. **Hybrid Architectures**: Combining symbolic and neural approaches 2. **Curriculum Learning**: Progressive training across games 3. **Meta-Learning**: Learning to play new games quickly 4. **Interpretability**: Understanding agent decision-making 5. **Efficiency**: Reducing computational requirements
BALROG represents a critical benchmark for evaluating true agentic AI capabilities. By requiring models to navigate diverse, procedurally generated game environments, it tests essential skills like planning, exploration, and adaptation that are fundamental to general intelligence. The benchmark's finding that current state-of-the-art models achieve less than 44% progress overall, and that vision paradoxically hinders rather than helps performance, reveals how far we remain from achieving robust, general-purpose AI agents.
The diversity of games, from simple instruction-following to complex roguelikes, provides a comprehensive evaluation framework that will remain challenging as AI capabilities advance. BALROG's emphasis on procedural generation ensures that future progress will reflect genuine reasoning improvements rather than dataset memorization, making it a valuable long-term benchmark for the AI community.
Cite error: <ref> tag with name "website" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "github" defined in <references> is not used in prior text.