# Factorio Learning Environment

> Source: https://aiwiki.ai/wiki/factorio_learning_environment
> Updated: 2026-05-16
> Categories: AI Benchmarks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

| Factorio Learning Environment |
| --- |
| Overview |
| Full name | Factorio Learning Environment |
| Abbreviation | FLE |
| Description | An open-ended evaluation framework that uses the industrial automation game Factorio to test long-horizon planning, spatial reasoning, program synthesis, and resource optimization in [large language model](/wiki/large_language_model) agents |
| First arXiv release | 2025-03-12 (paper 2503.09617) |
| Conference | NeurIPS 2025, Datasets and Benchmarks Track |
| Latest version | v0.4.x (2026) |
| Authors | Jack Hopkins, Mart Bakler, Akbir Khan |
| Lead affiliations | Anthropic, University College London |
| Technical Details |
| Type | Long-horizon agent benchmark, program synthesis, resource optimization |
| Modality | Code (Python REPL), text observations, optional pixel renderer |
| Task format | Lab-play production challenges and unbounded open-play factory building |
| Number of tasks | 24 lab-play target entities plus open-play |
| Total examples | Procedurally generated; unbounded in open-play |
| Evaluation metrics | Production Score (PS), Milestones, lab-play task success rate |
| Domains | Industrial automation, logistics, spatial layout, research progression |
| Languages | English prompts, Python action space |
| Performance |
| Best published lab-play (v1 paper) | Claude 3.5 Sonnet, 21.9% (7 of 24 tasks fully automated) |
| Best published open-play PS (v1) | Claude 3.5 Sonnet, 293,206 |
| Best published milestones (v1) | Claude 3.5 Sonnet, 28 milestones |
| Human ceiling | Expert players reach factories processing millions of items per second |
| Saturated | No, intentionally non-saturating |
| Resources |
| Website | [Official site](https://jackhopkins.github.io/factorio-learning-environment/) |
| Paper | [arXiv 2503.09617](https://arxiv.org/abs/2503.09617) |
| GitHub | [JackHopkins/factorio-learning-environment](https://github.com/JackHopkins/factorio-learning-environment) |
| Leaderboard | [FLE leaderboard](https://jackhopkins.github.io/factorio-learning-environment/leaderboard/) |
| Epoch AI page | [Epoch benchmark entry](https://epoch.ai/benchmarks/factorio-learning-environment) |
| License | MIT for code, CC BY 4.0 for the paper |

The **Factorio Learning Environment** (**FLE**) is an open source benchmark and research framework that uses the industrial automation game [Factorio](/wiki/factorio) to evaluate the long-horizon agentic capabilities of [large language model](/wiki/large_language_model) systems. It was introduced in March 2025 by Jack Hopkins, Mart Bakler, and Akbir Khan in the paper *Factorio Learning Environment* (arXiv:2503.09617), and it was later accepted to the NeurIPS 2025 Datasets and Benchmarks Track. FLE is designed as a deliberately non-saturating environment: rather than asking a model to pick the right multiple choice answer or close a single ticket, it asks an agent to grow a working factory whose throughput can scale across roughly six orders of magnitude, from a handful of items per minute to the millions of items per second that experienced human players routinely achieve in late game play.

The environment is built around a Python read eval print loop, or REPL, in which an agent observes the game state, writes Python programs that call a typed Factorio API, executes them against a running game server, and reads back structured feedback. It supplies two evaluation modes: a structured **lab-play** suite of 24 production tasks with fixed starting resources, and an **open-play** mode where a single agent is given a procedurally generated map and the instruction to build the largest factory it can. The first release reported that even the strongest frontier model of that era, [Claude 3.5 Sonnet](/wiki/claude_3_5_sonnet), only fully automated 7 of the 24 lab-play tasks and reached a Production Score of 293,206 in open-play, while smaller and weaker models collapsed at much earlier rungs of the technology tree.

## Background and motivation

Many classic AI benchmarks have collapsed under the pace of progress in [foundation models](/wiki/foundation_model). [MMLU](/wiki/mmlu), [HumanEval](/wiki/humaneval), [GSM8K](/wiki/gsm8k), and other multiple choice or short answer suites are now scored above 90% by several models, leaving little headroom to differentiate frontier systems. The authors of FLE argue that this saturation hides deep gaps in capabilities that matter for real world deployment: long-horizon planning over thousands of decisions, spatial reasoning over a 2D grid, error diagnosis under partial information, and the ability to compose simple primitives into reusable abstractions.

Factorio is a particularly natural fit for these questions for several reasons. The game has a fully deterministic engine, an unambiguous notion of progress measured in items produced per minute, and a technology tree that grows exponentially in complexity. A starting factory needs only two machines to mine iron ore, but a fully optimized end game base for a single unit of utility science requires coordinating close to a hundred machines across many subassemblies. The same skill set, breaking a goal into subgoals, laying out machines on a grid, debugging bottlenecks, applies at every level. That makes Factorio an environment where the gap between novice and expert behavior is enormous, where the score grows in clear logarithmic steps, and where the same engine can challenge agents far above the capabilities of today's models.

The authors note that this is also why Factorio became a cult object among engineering teams. The game encodes industrial planning problems that resemble logistics, manufacturing operations research, and distributed systems engineering. FLE therefore positions itself not only as an [AI benchmark](/wiki/ai_benchmark) but also as a sandbox where researchers can study how language model agents learn to plan, refactor code, and recover from errors over horizons measured in hours of wall clock time.

### Why a non-saturating benchmark matters

A persistent concern in the [AI evaluation](/wiki/ai_evaluation) community is that successive model generations rapidly close the gap between random performance and the human ceiling, after which differences between models stop being legible. FLE takes a different approach. The Production Score grows roughly as the logarithm of total throughput, so each new tier of automation adds a comparable interval on the score axis. Going from manual ore mining to electric drills moves a model by one band, while moving from green circuits to red circuits to blue circuits adds three more. Because there is no natural completion state, even a perfect contemporary model can be beaten by a future system that simply scales further.

## Authors and provenance

The FLE paper credits three authors. Jack Hopkins and Mart Bakler are listed as joint first authors, with Akbir Khan as the third author. Hopkins works at [Anthropic](/wiki/anthropic), and Khan is affiliated with University College London. The project was developed primarily in 2024 and early 2025 and released as an open source repository at github.com/JackHopkins/factorio-learning-environment under an MIT style license, with the paper distributed under CC BY 4.0.

The code base consists of roughly eighty thousand lines, including a Python client library, a Lua mod that exposes Factorio internals to that client, an evaluation harness with logging via Weights and Biases, and CLI tooling for spinning up Docker based clusters of Factorio servers. The repository continues to be actively developed: v0.3.0 in late 2025 added a headless renderer, an OpenAI Gym compatible interface, and an adapter for [Claude Code](/wiki/claude_code), and the v0.4 series in 2026 extended these with additional tasks and Model Context Protocol integration.

## Environment design

### Core game mechanics

FLE is built directly on Factorio, an industrial automation game by the Czech studio Wube Software. The game's core loop forces the player to gather raw materials, build machines that craft intermediate products, and string those machines together with belts, pipes, and electrical infrastructure. The environment exposes these elements to an agent through a typed Python interface.

| Mechanic | Description | Why it matters for agents |
| --- | --- | --- |
| Resource extraction | Mining ore, harvesting trees, pumping crude oil | Tests perception of resource patches and decisions about siting |
| Crafting | Combining inputs in furnaces, assemblers, chemical plants | Forces recipe planning over a directed graph of ingredients |
| Belts and inserters | Moving items between machines | Requires precise spatial layout and throughput matching |
| Pipes and fluids | Routing liquids between refineries and plants | Adds a second logistics network with different rules |
| Power | Burner, steam, solar, and nuclear generation | Introduces constraints that interact with every other system |
| Research | Unlocking technologies by feeding science packs into labs | Rewards strategic investment of throughput into research |
| Biters | Hostile insect-like creatures (optional in FLE) | Adds defensive and military objectives in later expansions |

A factory in Factorio is essentially a dataflow graph laid out on a 2D grid, with each node consuming inputs at a fixed rate and producing outputs at another fixed rate. The agent's job is to assemble that graph from primitive building blocks while keeping its surface area, latency, and bottlenecks in check.

### Lab-play

The **lab-play** mode is the primary structured benchmark. It consists of 24 production tasks, each pinned to a specific target item from the Factorio technology tree. Each task starts the agent on a small map with a fixed inventory and the relevant prerequisite technologies already researched, so the only challenge is to build a production line that achieves a target throughput within a 60 second in game holdout window. Success thresholds are 16 items per minute for solid items and 250 units per minute for fluids, with a budget of 128 API calls per task (the v0.3 release reduced this to 64 step trajectories with early stopping).

The lab-play target entities span the full game progression. Although the paper does not enumerate all 24 in a single list, the tasks cover the following categories drawn from its descriptions:

| Tier | Example target entities | Approximate machine count |
| --- | --- | --- |
| Tier 1, raw extraction | Iron ore, copper ore, coal, stone | 1 to 2 |
| Tier 2, basic smelting | Iron plate, copper plate, stone brick | 2 to 6 |
| Tier 3, intermediate parts | Iron gear wheel, copper cable, automation science pack | 6 to 12 |
| Tier 4, electronics | Electronic circuit (green chip), advanced circuit (red chip) | 10 to 25 |
| Tier 5, chemicals and fluids | Plastic bar, sulfur, sulfuric acid, lubricant, batteries | 15 to 40 |
| Tier 6, mechanical assemblies | Engine unit, electric engine, steel plate | 20 to 50 |
| Tier 7, late game | Military science pack, utility science pack, processing unit (blue chip) | 50 to 100 |

Lab-play is the closest thing FLE has to a traditional benchmark because each task has a clean pass or fail outcome. Aggregating across all 24 tasks produces the lab-play success rate that drives the published leaderboard.

### Open-play

The **open-play** mode is more open ended. The agent receives a procedurally generated map, a single instruction to build the largest possible factory, and a budget of up to 5,000 environment steps. Each model is evaluated across eight independent runs, and median Production Score and Milestones are reported. Because there is no fixed target item and no time limit beyond the step budget, open-play rewards agents that can set their own subgoals, scale infrastructure proactively, and recover from mistakes without external guidance.

Open-play exposes capabilities that lab-play deliberately suppresses. In lab-play the relevant technologies are pre-unlocked, so the model never has to decide whether to invest throughput into research. In open-play it has to choose between building more iron furnaces today and pushing for electric drilling tomorrow, an explicitly long-horizon trade off.

### Evaluation metrics

FLE reports three families of numbers, each designed to capture a different facet of agent behavior.

#### Production Score

The **Production Score** (PS) is a continuous measure of economic activity. For each item in the game, the system assigns a value V(i) derived from the recipe's complexity, ingredient depth, and energy cost. The Production Score at time t is then a weighted sum:

```
PS(t) = sum over items i of V(i) * (P_i(t) - C_i(t))
```

where P_i(t) is the cumulative quantity of item i produced up to time t and C_i(t) is the cumulative quantity consumed. Raw items such as iron ore have V near 3, while a single unit of processing unit (blue chip) is valued in the thousands. Because the same factory can extend its score by simply running longer or producing more advanced items, Production Score varies across orders of magnitude and never saturates.

#### Milestones

A **Milestone** is hit the first time an agent successfully produces a particular item or unlocks a particular technology. The paper defines a fixed list of milestones tied to the major tiers of the Factorio progression, from gathering wood through researching electric energy distribution and into late game logistics. Where Production Score answers the question "how much economic activity did the agent generate," milestones answer "how broad was the technology tree it covered."

#### Lab-play success rate

The lab-play **success rate** is the fraction of the 24 target entities for which the agent built a production line meeting the throughput threshold during the holdout window. The published numbers are means over multiple seeds with standard error.

## Technical architecture

### Agent loop and REPL

FLE deliberately rejects the classical reinforcement learning interface in favor of a Python REPL pattern. At each step the agent receives the standard output and standard error of its last program along with any structured results, then writes the next program in a persistent Python namespace. Variables, helper functions, and even cached state survive across steps, so an agent can define a `place_assembly_line` helper early on and reuse it later. This pattern echoes the way human programmers write provisional code at an interactive shell, observe the result, and iterate.

The agent program runs in a Python client that communicates synchronously over TCP to a Lua server embedded in the Factorio game itself, using the RCON protocol that Factorio normally exposes for multiplayer admin tools. The round trip latency is low enough that the system averages 218 operations per second on standard hardware, with the most expensive operations (pathfinding and large entity scans) running at 25 to 48 operations per second.

### Action and observation API

FLE exposes a typed object model of Factorio entities and a set of 23 core methods divided into three groups.

| Category | Representative methods | Purpose |
| --- | --- | --- |
| Pure queries | get_entities, inspect_inventory, get_research_progress, get_resource_patch, get_prototype_recipe | Read game state without modifying it |
| State modifications | place_entity, place_entity_next_to, pickup_entity, rotate_entity, connect_entities, set_entity_recipe | Build and edit factory layout |
| Resource management | insert_item, extract_item, harvest_resource, craft_item, set_research | Move items, perform crafts, drive research |

Returns are strongly typed, so an agent can call `nearest(Resource.IronOre)` and receive a position object with x and y fields, or call `get_entities(Furnace)` and iterate over a list of typed furnace records with attributes like position, status, output inventory, and burner inventory. This typing turns the environment into something closer to a domain specific language than a raw game API and gives the model strong hooks for compositional programming.

### Memory and long-context handling

Long-horizon episodes generate enormous logs. FLE addresses this with a hierarchical memory scheme: at every step, the most recent 32 observations remain verbatim in the context window, while older observations are summarized into 1,024 token reports. This keeps the prompt tractable even after thousands of steps, which would otherwise exhaust a model's context window. The summarization is itself performed by an LLM, which becomes a subtle hyperparameter of the evaluation.

### Running the environment

The v0.3 release packages the system for easy use. Installation is via PyPI:

```bash
pip install factorio-learning-environment
# Optional extras for evaluation, MCP, or PostgreSQL logging
pip install "factorio-learning-environment[eval,mcp,psql]"
```

The CLI then provides commands to start a Docker based cluster of Factorio servers and run an evaluation sweep:

```bash
fle cluster start
fle eval --config configs/gym_run_config.json
```

The headless renderer in v0.3 lets agents run without the official Factorio game client, making large parallel sweeps cheap on cloud hardware. Headless mode also exposes a pixel observation channel for multimodal experiments, although the headline benchmark remains text and code only.

### Gym and MCP interfaces

v0.3 added an OpenAI Gym style interface, so FLE can be plugged into off the shelf RL pipelines that use step, reset, and reward semantics. The environment also ships with a Model Context Protocol server, which allows [Claude Code](/wiki/claude_code), other MCP clients, and IDE-style agent harnesses to drive the environment without writing custom glue code. This bridge was the headline demonstration of v0.3.0: a livestream that showed Claude Code building factories interactively over many hours of play.

## Evaluated models and headline results

The original paper evaluated six frontier and open weight models in March 2025. Subsequent updates added results for newer systems, including models from the [GPT-5](/wiki/gpt_5), [Claude Opus 4.1](/wiki/claude_opus_4_1), [Gemini 2.5](/wiki/gemini_2_5_pro), and [Grok 4](/wiki/grok_4) families. The v1 leaderboard remains the most widely cited result because it offers a clean comparison across a single moment in time.

### Lab-play success rate, v1 paper

| Model | Lab-play success rate | Tasks solved (of 24) |
| --- | --- | --- |
| Claude 3.5 Sonnet | 21.9 plus or minus 1.3% | 7 |
| GPT-4o | 16.6 plus or minus 1.4% | 5 to 6 |
| DeepSeek v3 | 15.1 plus or minus 1.7% | 4 to 5 |
| Gemini 2 Flash | 13.0 plus or minus 1.3% | 4 |
| Llama 3.3 70B | 6.3 plus or minus 1.0% | 2 |
| GPT-4o mini | 5.2 plus or minus 0.6% | 1 to 2 |

Claude 3.5 Sonnet was the clear leader. It was the only model to consistently complete intermediate electronics tasks like green circuits, and it occasionally automated steel plate production, a task that requires coordinating fuel, ore, and a second smelting stage. Even so, no v1 era model came close to automating any of the late game lab-play targets in the 50 to 100 machine range.

### Open-play Production Score and Milestones, v1 paper

| Model | Open-play Production Score (median) | Milestones reached |
| --- | --- | --- |
| Claude 3.5 Sonnet | 293,206 | 28 |
| GPT-4o | mid five figures | mid twenties |
| DeepSeek v3 | lower five figures | low twenties |
| Gemini 2 Flash | lower five figures | low twenties |
| Llama 3.3 70B | 54,998 | 26 |
| GPT-4o mini | low four figures | mid teens |

Claude 3.5 Sonnet's most notable open-play accomplishment was discovering that investing science packs into the electric mining drill technology unlocks a much higher steady state throughput than any amount of manual ore harvesting can match. Llama 3.3 70B, despite a much lower lab-play score, posted a strong open-play milestone count by chaining many simple actions over its 5,000 step budget, an early hint that exploration depth and lab-play planning skill are not perfectly correlated.

### Updates since v0.3.0

With v0.3.0 the authors re-ran the benchmark on a broader set of frontier models and reported qualitative results that show the ordering Claude > GPT > Gemini > Grok in lab-play, with absolute scores improving across the board. They note that the latest generation of open weight models has now caught up with the previously state of the art Claude 3.5 Sonnet number, while [Grok 4](/wiki/grok_4) tends to enter degenerate debug loops where it repeats the same failing action many times, and [GPT-5](/wiki/gpt_5) recovers more gracefully than its predecessors. Claude Opus 4.1, the strongest model in the v0.3 evaluation, displayed an error rate of around 23% on lab-play with essentially zero syntactic errors, meaning its failures were pragmatic rather than malformed.

The v0.3 evaluations also revealed a recurring pattern: even when frontier models can build simple structures, they tend to fall back on semi-manual strategies, hand-feeding furnaces or directly crafting items rather than designing true automation pipelines that scale.

## How agents actually play Factorio

The FLE paper devotes considerable space to qualitative analysis. Watching transcripts of a successful Claude 3.5 Sonnet run, the authors identify a recurring sequence of behaviors that begins with manual gathering, transitions into ad hoc crafting, and only later attempts to set up belts and assemblers. Even strong models rarely plan an entire factory before starting; instead they iteratively bolt new subsections onto the side of an existing layout, which mirrors how many human players begin the game.

### Strengths observed

- **Compositional code reuse**, where an agent writes helper functions early in a run and calls them across many later steps to place machines or run inventory checks.
- **Recipe planning**, where the model correctly decomposes a target item into prerequisites, even when the chain runs five or six layers deep.
- **Research prioritization**, where Claude 3.5 Sonnet explicitly trades early throughput for unlocking the electric mining drill, and benefits compoundingly from doing so.
- **Local error recovery**, where a placement collision triggers a query for occupied tiles followed by an offset and a retry.

### Failure modes

- **Spatial misjudgment**, where the model misestimates the distance between a drill and a furnace and produces a belt that does not actually connect them.
- **Throughput mismatch**, where it builds a single assembler downstream of a fully saturated belt and never notices the bottleneck.
- **Debug loops**, where it repeats a failing operation many times without varying parameters, the dominant failure mode for [Grok 4](/wiki/grok_4).
- **Strategic shortsightedness**, where it scales the easiest substep to absurd lengths instead of moving on to the next tier.
- **API misunderstandings**, where the model believes a method exists or behaves differently from its actual signature, a problem present in all tested models.
- **Forgetting context**, where after a memory summarization step the model loses track of where it placed a key machine and starts a duplicate.

The authors interpret these patterns as evidence that the bottleneck is not knowledge of the game (most models have absorbed Factorio wiki content during pretraining) but the ability to translate that knowledge into stable plans, executable code, and bug-free spatial layouts.

## Comparison with other agentic benchmarks

FLE is part of a wider movement to evaluate language models in interactive environments rather than on static datasets. Each comparable benchmark stresses a different axis of capability.

| Benchmark | Environment | Focus | Time horizon | Closest analogue to FLE |
| --- | --- | --- | --- | --- |
| [BALROG](/wiki/balrog) | Six games including NetHack, BabyAI, Crafter | Multi-game agentic reasoning across small environments | Hundreds of steps | Closest in spirit, but no single game scales to FLE's complexity |
| [Voyager](/wiki/voyager) | [Minecraft](/wiki/minecraft) via Mineflayer | Open-ended skill library acquisition with GPT-4 | Hours of in game time | Shares the open-ended ethos; uses code generation but no formal score |
| [MineDojo](/wiki/minedojo) | Minecraft via internet-scale data | Multi-task internet-grounded learning | Variable | Game-based and open world, but less industrial |
| [SWE-bench](/wiki/swe_bench) | GitHub issues in real Python repositories | Coding patch generation against test suites | Single PR per task | Both stress program synthesis, but SWE-bench has fixed solutions |
| [MLE-bench](/wiki/mle_bench) | Kaggle competitions | End to end machine learning pipelines | Hours per task | Both reward building working systems under unbounded score caps |
| [AgentBench](/wiki/agentbench) | Eight code, web, and game environments | Broad coverage of agentic tasks | Tens to hundreds of steps | Broader but shallower than FLE |
| [Crafter](/wiki/crafter) | 2D survival game | Achievement based exploration | Short | A toy ancestor of FLE in style |
| [NetHack Learning Environment](/wiki/nethack_learning_environment) | The roguelike NetHack | Procedural exploration and survival | Thousands of steps | Shares the procedurally generated map and unbounded difficulty |
| [ALFWorld](/wiki/alfworld) | Text and visual household tasks | Embodied instruction following | Tens of steps | Tests grounded planning but at a much smaller scale |

FLE differentiates itself on three axes. First, its action space is Python code over a typed entity API rather than a finite discrete action list or natural language commands, which puts unusual pressure on program synthesis. Second, its score scales over six orders of magnitude, far more than any other published agentic benchmark. Third, it requires sustained competence over thousands of decisions, where most existing benchmarks max out at a few hundred steps.

In the framing of the paper, FLE is closer to a flight simulator for [agentic AI](/wiki/agentic_ai) research than to a multiple choice test. Models with similar scores on MMLU or HumanEval can produce wildly different factories, and the FLE leaderboard often aligns more closely with practical benchmarks like GDPVal that estimate economic productivity than with academic exam suites.

## Limitations

The authors are forthright about FLE's limitations. The first is **incomplete entity coverage**. Mid and late game elements such as trains, logistics robots, and programmable circuit networks are partially modeled, which restricts the very late tiers of factory design. The second is the **absence of a human baseline**. Although expert players routinely build factories thousands of times more productive than the best LLM, none have attempted to do so through the Python REPL with a strict step budget, so it is unclear what a top human would actually score in the same conditions.

A third concern is **reward hacking**. Because Production Score depends on cumulative production minus consumption, an agent that prints money via a degenerate loop could in principle inflate its score. The authors report that no current model exploits this, but they flag it as a risk that may emerge in stronger systems.

A fourth limitation is that FLE is **single agent only** in its current form, even though Factorio has well developed multiplayer support. The authors list cooperative and competitive multiplayer scenarios as natural directions for future work. Finally, the benchmark imposes a particular **memory and summarization scheme** that interacts with model performance, which complicates direct comparisons across different model families with different context windows.

More subtly, FLE inherits idiosyncrasies from Factorio itself, including biters, surface mechanics, and recipe quirks that may not generalize to other automation problems. The system is therefore best understood as one slice of agentic capability rather than a universal yardstick.

## Implications for AI research

FLE has had visible influence on how the field thinks about agent evaluation. The benchmark shows that even models that excel at coding interview problems and pass advanced math exams collapse when asked to maintain a coherent industrial plan across thousands of API calls. That signal has fed into several active research threads.

- **Hierarchical planning** for LLM agents, where models propose long-horizon outlines before generating detailed code, has shown improved FLE performance in subsequent work.
- **Tool augmented memory**, including retrieval over past program transcripts, mitigates some of the context loss observed in the original paper.
- **Reinforcement fine tuning** is being explored as a way to teach models the specific calling conventions and spatial heuristics of the Factorio API.
- **Process supervision**, where reward models score intermediate reasoning steps rather than only final outcomes, fits naturally into FLE's step level feedback loop.
- **Multimodal extensions** that pair the Python API with the v0.3 pixel renderer offer a path to test whether vision conditioning improves spatial layout.

FLE has also attracted attention outside academia. Tech press coverage in The Decoder, Gigazine, and AI newsletters has framed it as a sober antidote to the cycle of benchmark saturation, and the project's Discord and YouTube tutorials have built a small community of practitioners experimenting with custom agents. Public livestreams of [Claude Code](/wiki/claude_code) running FLE drew large audiences, and several research labs now use the environment for internal model regression testing.

### Connection to evaluation of long-horizon agents

The themes that FLE highlights, sustained coherence, spatial reasoning, debugging across long sessions, are increasingly central to discussions of [AI agents](/wiki/ai_agent) more broadly. Benchmarks aimed at end to end software engineering, like SWE-bench and [SWE-Lancer](/wiki/swe_lancer), or at scientific workflows, like [MLE-bench](/wiki/mle_bench), share much of FLE's emphasis on translating high level goals into executable code over many steps. FLE is unusual in offering an environment where the goal itself is open ended and the metric is continuous, which lets researchers study how agents allocate effort under no external instruction.

## Installation and quick start

The following snippet shows a minimal lab-play run against a running Factorio cluster.

```bash
# Install the package
pip install factorio-learning-environment

# Start a local cluster of Factorio servers via Docker
fle cluster start

# Run an evaluation sweep with a configured agent and task list
fle eval --config configs/gym_run_config.json
```

Writing a custom agent is a matter of subclassing the supplied agent interface and emitting Python programs in response to observations. A simplified loop looks like the following.

```python
from fle import FactorioEnvironment
from fle.entities import Position, Resource, Furnace, Drill

env = FactorioEnvironment(mode="lab-play", task="iron_plate")
obs = env.reset()

while not obs.done:
    # The agent writes a Python program to execute against the live game.
    program = my_agent.respond(obs)
    obs = env.step(program)

print("Final Production Score:", obs.production_score)
```

The full repository contains example agents that wrap [OpenAI](/wiki/openai), [Anthropic](/wiki/anthropic), and open weight model APIs, as well as tracing utilities that produce HTML transcripts of every API call and observation, useful for both debugging and qualitative analysis.

## Roadmap and future work

The FLE roadmap as of 2026 calls out several directions.

| Direction | Status | Notes |
| --- | --- | --- |
| Expanded entity coverage | Ongoing | Adding trains, logistics robots, programmable circuits |
| Multiplayer and cooperative settings | Planned | Letting multiple agents share a factory |
| Adversarial and biter combat | Planned | Defensive engineering against the game's hostile creatures |
| Pixel and multimodal observations | Available since v0.3 | Used for early experiments with vision models |
| MCP and IDE agent integrations | Available since v0.3 | Enables Claude Code, Cursor, and other clients to run FLE directly |
| Reinforcement learning baselines | Research phase | Compares specialized policies against language model agents |
| Human expert baselines | Open question | The authors invite the community to run controlled human trials |
| Curriculum learning | Open | Sequencing lab-play tasks for training rather than evaluation |

The authors emphasize that FLE is meant to be a moving target. They expect the lab-play leaderboard to be largely saturated by the most capable models within a few years, at which point the focus is likely to shift to longer open-play horizons, multi-agent coordination, and integration with the game's combat systems.

## Reception

FLE was published at NeurIPS 2025 in the Datasets and Benchmarks Track and has been broadly well received in the [machine learning](/wiki/machine_learning) community. Researchers have praised its non-saturating design, its emphasis on long horizons, and its careful API design. Some have noted that the choice of Python as an action space favors models with strong code generation capabilities, which may understate the abilities of systems that excel at natural language or visual reasoning. Others have pointed out that the lab-play setup, with pre-unlocked technologies and fixed starting resources, simplifies away some of the most interesting strategic questions that open-play surfaces.

The game's commercial publisher, Wube Software, has not endorsed FLE officially, but the project's compatibility with Factorio 2.0 and its respect for the game's terms of service have allowed it to coexist with the official game community. Discussion threads on the official Factorio forums and on r/factorio have responded positively, with some experienced players proposing additional lab-play targets and others volunteering to run human baseline trials.

In the broader [AI safety](/wiki/ai_safety) and capability evaluation discussion, FLE has been cited as a useful counterweight to coding benchmarks like [SWE-bench](/wiki/swe_bench) because it isolates planning and execution rather than rewarding pure code synthesis ability. Several frontier labs have publicly reported running internal versions of FLE as part of their model release evaluations.

## Significance

FLE captures a moment in AI evaluation when the field is moving away from saturated multiple choice tests toward open ended, environment-grounded benchmarks. By embedding agents in a game that humans have spent thousands of hours optimizing, it provides a window into capabilities that matter for real-world automation: planning over long horizons, programming against typed APIs, recovering from spatial mistakes, and choosing what to build next when no one is supervising the choice.

Its headline finding, that even the strongest publicly available models fully automate only a small fraction of lab-play tasks and reach only the early game in open-play, is now widely cited as evidence that current LLMs have a planning ceiling well below human industrial competence. At the same time, the rapid progress between v1 and v0.3 shows that the gap is closing fast. FLE will likely remain a useful instrument for charting that progress for several more years, both as a research benchmark and as a public dashboard for tracking how language model agents handle a hard, scalable industrial problem.

## See also

- [Factorio](/wiki/factorio)
- [Voyager](/wiki/voyager)
- [BALROG](/wiki/balrog)
- [MLE-bench](/wiki/mle_bench)
- [SWE-bench](/wiki/swe_bench)
- [NetHack Learning Environment](/wiki/nethack_learning_environment)
- [Crafter](/wiki/crafter)
- [ALFWorld](/wiki/alfworld)
- [MineDojo](/wiki/minedojo)
- [AgentBench](/wiki/agentbench)
- [AI agent](/wiki/ai_agent)
- [Large language model](/wiki/large_language_model)
- [AI benchmark](/wiki/ai_benchmark)
- [Long horizon planning](/wiki/long_horizon_planning)
- [Program synthesis](/wiki/program_synthesis)
- [Spatial reasoning](/wiki/spatial_reasoning)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Claude Code](/wiki/claude_code)
- [Model Context Protocol](/wiki/model_context_protocol)

## References

- Hopkins, Jack; Bakler, Mart; Khan, Akbir. *Factorio Learning Environment*. arXiv:2503.09617, March 2025. https://arxiv.org/abs/2503.09617
- Hopkins, Jack; Bakler, Mart; Khan, Akbir. *Factorio Learning Environment*. NeurIPS 2025 Datasets and Benchmarks Track, OpenReview. https://openreview.net/forum?id=652Q6jBFMZ
- FLE source code repository. https://github.com/JackHopkins/factorio-learning-environment
- Official FLE website and documentation. https://jackhopkins.github.io/factorio-learning-environment/
- FLE leaderboard. https://jackhopkins.github.io/factorio-learning-environment/leaderboard/
- FLE v0.3.0 release notes. https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html
- Epoch AI benchmark page for FLE. https://epoch.ai/benchmarks/factorio-learning-environment
- Hopkins, Jack. *Lecture 85: Factorio Learning Environment*. GPU MODE, 2025. https://www.youtube.com/watch?v=iXvYa2oIMbA
- The Decoder. *Factorio joins growing list of video games doubling as AI benchmarking tools*. 2025. https://the-decoder.com/factorio-joins-growing-list-of-video-games-doubling-as-ai-benchmarking-tools/
- Gigazine. *Factorio Learning Environment (FLE) is now available, a learning environment that evaluates the performance of AI models*. March 2025. https://gigazine.net/gsc_news/en/20250313-factorio-learning-environment/
- Wang, Guanzhi et al. *Voyager: An Open-Ended Embodied Agent with Large Language Models*. arXiv:2305.16291, 2023.
- Paglieri, Davide et al. *BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games*. arXiv:2411.13543, 2024. https://arxiv.org/abs/2411.13543
- Chan, Jun Shern et al. *MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering*. arXiv:2410.07095, 2024.
- Jimenez, Carlos E. et al. *SWE-bench: Can Language Models Resolve Real-World GitHub Issues?* arXiv:2310.06770, 2023.
- Hacker News. *Show HN: FLE v0.3 - Claude Code Plays Factorio*. https://news.ycombinator.com/item?id=45466865
- Factorio Forums discussion thread. https://forums.factorio.com/viewtopic.php?t=127390

