Factorio Learning Environment
| Factorio Learning Environment | |
|---|---|
| Overview | |
| Full name | Factorio Learning Environment |
| Abbreviation | FLE |
| Description | An open-ended AI evaluation environment based on Factorio testing long-term planning, program synthesis, and resource optimization |
| Release date | 2025-03-06 |
| Latest version | 1.0 |
| Benchmark updated | 2025-03 |
| Authors | Jack Hopkins, Mart Bakler, Akbir Khan |
| Organization | Independent researchers |
| Technical Details | |
| Type | Long-term Planning, Resource Optimization, Program Synthesis |
| Modality | Text, Code, Spatial reasoning |
| Task format | Game-based automation challenges |
| Number of tasks | 8 lab-play tasks + open-ended factory building |
| Total examples | N/A (interactive environment) |
| Evaluation metric | Task completion, Factory size, Resource throughput |
| Domains | Automation, Manufacturing, Resource management, Spatial reasoning |
| Languages | English, Python API |
| Performance | |
| Human performance | Expert players achieve millions of resources/second |
| Baseline | Basic automation only |
| SOTA score | Electric drilling automation |
| SOTA model | Claude 3.5-Sonnet |
| SOTA date | 2025-03 |
| Saturated | No |
| Resources | |
| Website | Official website |
| Paper | Paper |
| GitHub | Repository |
| Dataset | [N/A (procedural generation) Download] |
| License | Open source
|
Factorio Learning Environment (FLE) is an innovative artificial intelligence evaluation environment based on the popular automation game Factorio, designed to test large language models (LLMs) on complex, open-ended tasks involving long-term planning, program synthesis, and resource optimization. Released in March 2025 by Jack Hopkins, Mart Bakler, and Akbir Khan[1], FLE addresses the critical need for new benchmarks as existing evaluations become saturated by rapidly improving language models. The environment provides exponentially scaling challenges ranging from basic automation to complex factories processing millions of resources per second, revealing fundamental limitations in current AI systems' spatial reasoning and long-term planning capabilities.
Overview
FLE leverages the rich complexity of Factorio, a game centered on building and optimizing automated production chains, to create an evaluation environment that tests capabilities poorly measured by traditional benchmarks. Unlike static question-answering or code generation tasks, FLE requires agents to navigate a dynamic world, manage resources, plan complex production pipelines, and continuously optimize their strategies over extended time horizons.
Motivation
The creation of FLE was motivated by several factors:
- **Benchmark Saturation**: LLMs are rapidly achieving high scores on existing benchmarks
- **Limited Open-Endedness**: Most benchmarks have fixed upper bounds on performance
- **Poor Long-Term Evaluation**: Few benchmarks test planning over thousands of steps
- **Lack of Spatial Reasoning Tests**: Traditional benchmarks poorly evaluate spatial understanding
- **Real-World Complexity Gap**: Need for evaluations that mirror real-world complexity
Environment Design
Core Game Mechanics
FLE is built on Factorio's core mechanics, which provide a rich problem space:
| Mechanic | Description | AI Challenge |
|---|---|---|
| Resource Extraction | Mining ore and gathering materials | Optimization and placement |
| Crafting System | Combining items to create complex products | Recipe planning and sequencing |
| Automation | Building machines and conveyor systems | Spatial layout and logistics |
| Research Tree | Unlocking new technologies | Strategic planning and prioritization |
| Power Management | Generating and distributing electricity | Resource allocation |
| Scaling | Exponential growth of production | Long-term optimization |
Two Evaluation Settings
FLE provides two distinct evaluation modes to test different aspects of intelligence:
Lab-Play Mode
| Aspect | Details |
|---|---|
| **Tasks** | 8 structured challenges |
| **Resources** | Fixed, limited resources |
| **Objective** | Complete specific goals |
| **Focus** | Problem-solving under constraints |
| **Time Horizon** | Short to medium term |
| **Difficulty** | Progressive, from basic to complex |
The eight lab-play tasks test specific skills: 1. **Basic Crafting**: Create simple items from raw materials 2. **Automation Setup**: Build basic automated production 3. **Power Systems**: Establish electricity generation 4. **Circuit Manufacturing**: Create electronic circuits 5. **Advanced Automation**: Complex multi-stage production 6. **Logistics Networks**: Optimize material transport 7. **Research Goals**: Unlock specific technologies 8. **Efficiency Challenge**: Maximize output with minimal resources
Open-Play Mode
| Aspect | Details |
|---|---|
| **Task** | Build the largest possible factory |
| **Resources** | Procedurally generated infinite map |
| **Objective** | Maximize production throughput |
| **Focus** | Open-ended optimization |
| **Time Horizon** | Unbounded long-term |
| **Difficulty** | Self-scaling based on progress |
Technical Implementation
Agent Interface
FLE provides a Python API for agent interaction:
```python from factorio_env import FactorioEnv
- Initialize environment
env = FactorioEnv(mode='lab-play', task_id=1)
- Agent interaction loop
observation = env.reset() done = False
while not done:
# Agent decides action based on observation action = agent.get_action(observation) # Execute action in environment observation, reward, done, info = env.step(action) # Action types include: # - move(direction) # - mine(position) # - craft(item, quantity) # - place(building, position) # - connect(pos1, pos2) # For belts/pipes # - research(technology)
```
Observation Space
Agents receive rich observations including:
| Observation Type | Information Provided | Format |
|---|---|---|
| Map View | Local area around agent | 2D grid array |
| Inventory | Current items and quantities | Dictionary |
| Production Stats | Resource flow rates | Time series data |
| Research Status | Available and completed technologies | Tree structure |
| Power Grid | Electricity generation and consumption | Network graph |
| Recipe Book | Available crafting recipes | Structured database |
Action Space
| Action Category | Examples | Complexity |
|---|---|---|
| Movement | Walk, rotate | Low |
| Resource Gathering | Mine ore, chop trees | Low |
| Crafting | Create items from inventory | Medium |
| Building Placement | Place machines, belts, inserters | High |
| Configuration | Set machine recipes, circuit conditions | High |
| Research | Select technology to unlock | Strategic |
Evaluation Results
Model Performance (March 2025)
| Model | Lab-Play Success | Open-Play Achievement | Key Strengths | Key Weaknesses |
|---|---|---|---|---|
| Claude 3.5-Sonnet | 3/8 tasks | Electric drilling automation | Complex crafting, research investment | Spatial layout errors |
| GPT-4o | 2/8 tasks | Basic automation only | Short-term planning | Poor error recovery |
| GPT-4o-Mini | 1/8 tasks | Manual operation only | Basic crafting | Limited planning horizon |
| Deepseek-v3 | 2/8 tasks | Basic automation | Resource management | Spatial reasoning |
| Gemini-2-Flash | 1/8 tasks | Manual operation | Fast execution | Complex automation failure |
| Llama-3.3-70B | 1/8 tasks | Manual operation | Open-source best | Constrained reasoning |
Key Findings
Spatial Reasoning Limitations
All evaluated models demonstrated significant spatial reasoning deficits[1]:
| Issue | Description | Frequency |
|---|---|---|
| Collision Errors | Attempting to place objects in occupied spaces | 45% |
| Connectivity Failures | Improper belt/pipe connections | 38% |
| Layout Inefficiency | Suboptimal spatial arrangements | 62% |
| Distance Misjudgment | Incorrect range calculations | 31% |
Temporal Planning Challenges
| Planning Horizon | Success Rate | Limiting Factor |
|---|---|---|
| Immediate (1-10 steps) | 85% | Good performance |
| Short-term (10-100 steps) | 45% | Error accumulation |
| Medium-term (100-1000 steps) | 12% | Lost context |
| Long-term (1000+ steps) | <5% | No coherent strategy |
Open-Play Performance Analysis
In the unbounded open-play setting, Claude 3.5-Sonnet achieved the best results:
| Milestone | Step Count | Achievement |
|---|---|---|
| Initial Setup | 0-500 | Manual resource gathering |
| Basic Automation | 500-1500 | Stone furnaces, manual feeding |
| Research Investment | 1500-3000 | Technology unlocking |
| **Electric Drilling** | 3000+ | Automated ore extraction |
| Production Boost | 3000-5000 | Increased iron plate output |
| Plateau | 5000+ | Failed to achieve next automation level |
Despite this progress, no model achieved complex automation milestones that human players routinely reach, such as:
- Electronic circuit manufacturing chains
- Oil processing and refining
- Robot-based logistics networks
- Nuclear power generation
Comparison with Other Benchmarks
Unique Characteristics
| Feature | FLE | Traditional Benchmarks |
|---|---|---|
| Time Horizon | Thousands of steps | Single step or few steps |
| State Space | Continuous, high-dimensional | Discrete, limited |
| Objective | Open-ended optimization | Fixed target |
| Feedback | Delayed rewards | Immediate scoring |
| Complexity Growth | Exponential | Linear or fixed |
| Spatial Reasoning | Critical | Minimal or absent |
Complementary Evaluations
| Benchmark | Focus | Similarity to FLE |
|---|---|---|
| MineDojo | Minecraft tasks | Game-based, open-world |
| NetHack Learning Environment | Roguelike gameplay | Sequential decision-making |
| ALFRED | Household tasks | Spatial reasoning, planning |
| BabyAI | Instruction following | Multi-step goals |
| Crafter | Survival game | Resource management |
Insights and Implications
Revealed Limitations
FLE exposes several fundamental limitations in current LLMs:
1. **Spatial Reasoning Deficit**: Models struggle with 2D spatial relationships despite handling complex language tasks 2. **Error Recovery**: Poor ability to diagnose and correct mistakes in constrained environments 3. **Long-Term Coherence**: Strategies deteriorate over extended time horizons 4. **Compositional Planning**: Difficulty combining simple skills into complex behaviors 5. **Resource Optimization**: Suboptimal allocation and prioritization decisions
Promising Capabilities
Despite limitations, models demonstrated some encouraging abilities:
| Capability | Evidence | Implications |
|---|---|---|
| Basic Automation Understanding | Successful simple crafting | Foundation for complex systems |
| Research Prioritization | Strategic technology unlocking | Goal-directed behavior |
| Incremental Improvement | Production increases over time | Learning from environment |
| Tool Use | Utilizing game mechanics | Adaptation to interfaces |
Installation and Usage
Installation
FLE can be installed via pip:
```bash
- Install via pip
pip install factorio-learning-environment
- Or using uv
uv add factorio-learning-environment ```
Basic Usage Example
```python import factorio_env as fle
- Create environment
env = fle.FactorioEnv(
mode='lab-play', task_id=1, render_mode='human' # For visualization
)
- Run evaluation
observation = env.reset() total_reward = 0
for step in range(10000):
# Your agent logic here
action = agent.policy(observation)
observation, reward, done, info = env.step(action)
total_reward += reward
if done:
print(f"Task completed! Total reward: {total_reward}")
break
```
Future Directions
Planned Enhancements
| Enhancement | Description | Timeline |
|---|---|---|
| Multi-Agent Support | Cooperative factory building | 2025 Q4 |
| Combat Scenarios | Defense against biters | 2026 Q1 |
| Mod Support | Custom content and rules | 2026 Q2 |
| Curriculum Learning | Progressive difficulty tasks | Ongoing |
| Human Demonstrations | Learning from expert play | Research phase |
Research Opportunities
1. **Hybrid Approaches**: Combining LLMs with specialized spatial reasoning modules 2. **Hierarchical Planning**: Multi-level goal decomposition strategies 3. **Memory Systems**: External memory for long-term planning 4. **Program Synthesis**: Generating automation blueprints 5. **Transfer Learning**: Applying Factorio skills to other domains
Significance
FLE represents a crucial step toward more realistic AI evaluation, moving beyond saturated benchmarks to test capabilities essential for real-world applications. By requiring agents to plan over extended time horizons, manage complex spatial relationships, and continuously optimize in an open-ended environment, FLE reveals that despite impressive progress in language understanding, current AI systems still lack fundamental reasoning capabilities that humans take for granted.
The benchmark's exponentially scaling challenges ensure it will remain relevant as AI capabilities improve, while its open-ended nature provides a rich playground for developing and testing new approaches to machine intelligence. As models struggle with tasks that experienced Factorio players find routine, FLE serves as both a humbling reminder of current limitations and a concrete target for advancing toward more capable AI systems.
See Also
- Factorio
- Long-term Planning
- Resource Optimization
- Game-Based AI Evaluation
- Spatial Reasoning
- Open-Ended Learning
- MineDojo
- NetHack Learning Environment
References
- ↑ 1.0 1.1 Hopkins, J., Bakler, M., & Khan, A. (2025). "Factorio Learning Environment". arXiv:2503.09617. Retrieved from https://arxiv.org/abs/2503.09617
Cite error: <ref> tag with name "github" defined in <references> is not used in prior text.
Cite error: <ref> tag with name "website" defined in <references> is not used in prior text.