Factorio Learning Environment

AI Benchmarks

29 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v3 · 5,718 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Factorio Learning Environment
Overview
Full name	Factorio Learning Environment
Abbreviation	FLE
Description	An open-ended evaluation framework that uses the industrial automation game Factorio to test long-horizon planning, spatial reasoning, program synthesis, and resource optimization in large language model agents
First arXiv release	2025-03-12 (paper 2503.09617)
Conference	NeurIPS 2025, Datasets and Benchmarks Track
Latest version	v0.4.x (2026)
Authors	Jack Hopkins, Mart Bakler, Akbir Khan
Lead affiliations	Anthropic, University College London
Technical Details
Type	Long-horizon agent benchmark, program synthesis, resource optimization
Modality	Code (Python REPL), text observations, optional pixel renderer
Task format	Lab-play production challenges and unbounded open-play factory building
Number of tasks	24 lab-play target entities plus open-play
Total examples	Procedurally generated; unbounded in open-play
Evaluation metrics	Production Score (PS), Milestones, lab-play task success rate
Domains	Industrial automation, logistics, spatial layout, research progression
Languages	English prompts, Python action space
Performance
Best published lab-play (v1 paper)	Claude 3.5 Sonnet, 21.9% (7 of 24 tasks fully automated)
Best published open-play PS (v1)	Claude 3.5 Sonnet, 293,206
Best published milestones (v1)	Claude 3.5 Sonnet, 28 milestones
Human ceiling	Expert players reach factories processing millions of items per second
Saturated	No, intentionally non-saturating
Resources
Website	Official site
Paper	arXiv 2503.09617
GitHub	JackHopkins/factorio-learning-environment
Leaderboard	FLE leaderboard
Epoch AI page	Epoch benchmark entry
License	MIT for code, CC BY 4.0 for the paper

The Factorio Learning Environment (FLE) is an open source benchmark and research framework that uses the industrial automation game Factorio to evaluate the long-horizon agentic capabilities of large language model systems.^[1] It was introduced in March 2025 by Jack Hopkins, Mart Bakler, and Akbir Khan in the paper Factorio Learning Environment (arXiv:2503.09617),^[1] and it was later accepted to the NeurIPS 2025 Datasets and Benchmarks Track.^[2] FLE is designed as a deliberately non-saturating environment: rather than asking a model to pick the right multiple choice answer or close a single ticket, it asks an agent to grow a working factory whose throughput can scale across roughly six orders of magnitude, from a handful of items per minute to the millions of items per second that experienced human players routinely achieve in late game play.^[1]

The environment is built around a Python read eval print loop, or REPL, in which an agent observes the game state, writes Python programs that call a typed Factorio API, executes them against a running game server, and reads back structured feedback. It supplies two evaluation modes: a structured lab-play suite of 24 production tasks with fixed starting resources, and an open-play mode where a single agent is given a procedurally generated map and the instruction to build the largest factory it can.^[1] The first release reported that even the strongest frontier model of that era, Claude 3.5 Sonnet, only fully automated 7 of the 24 lab-play tasks and reached a Production Score of 293,206 in open-play, while smaller and weaker models collapsed at much earlier rungs of the technology tree.^[1]

Background and motivation

Many classic AI benchmarks have collapsed under the pace of progress in foundation models. MMLU, HumanEval, GSM8K, and other multiple choice or short answer suites are now scored above 90% by several models, leaving little headroom to differentiate frontier systems. The authors of FLE argue that this saturation hides deep gaps in capabilities that matter for real world deployment: long-horizon planning over thousands of decisions, spatial reasoning over a 2D grid, error diagnosis under partial information, and the ability to compose simple primitives into reusable abstractions.^[1]

Factorio is a particularly natural fit for these questions for several reasons. The game has a fully deterministic engine, an unambiguous notion of progress measured in items produced per minute, and a technology tree that grows exponentially in complexity. A starting factory needs only two machines to mine iron ore, but a fully optimized end game base for a single unit of utility science requires coordinating close to a hundred machines across many subassemblies.^[1] The same skill set, breaking a goal into subgoals, laying out machines on a grid, debugging bottlenecks, applies at every level. That makes Factorio an environment where the gap between novice and expert behavior is enormous, where the score grows in clear logarithmic steps, and where the same engine can challenge agents far above the capabilities of today's models.

The authors note that this is also why Factorio became a cult object among engineering teams. The game encodes industrial planning problems that resemble logistics, manufacturing operations research, and distributed systems engineering. FLE therefore positions itself not only as an AI benchmark but also as a sandbox where researchers can study how language model agents learn to plan, refactor code, and recover from errors over horizons measured in hours of wall clock time.^[1]

Why a non-saturating benchmark matters

A persistent concern in the AI evaluation community is that successive model generations rapidly close the gap between random performance and the human ceiling, after which differences between models stop being legible. FLE takes a different approach. The Production Score grows roughly as the logarithm of total throughput, so each new tier of automation adds a comparable interval on the score axis.^[1] Going from manual ore mining to electric drills moves a model by one band, while moving from green circuits to red circuits to blue circuits adds three more. Because there is no natural completion state, even a perfect contemporary model can be beaten by a future system that simply scales further.^[1]

Authors and provenance

The FLE paper credits three authors. Jack Hopkins and Mart Bakler are listed as joint first authors, with Akbir Khan as the third author. Hopkins works at Anthropic, and Khan is affiliated with University College London.^[1] The project was developed primarily in 2024 and early 2025 and released as an open source repository at github.com/JackHopkins/factorio-learning-environment under an MIT style license, with the paper distributed under CC BY 4.0.^[1]^[3]

The code base consists of roughly eighty thousand lines, including a Python client library, a Lua mod that exposes Factorio internals to that client, an evaluation harness with logging via Weights and Biases, and CLI tooling for spinning up Docker based clusters of Factorio servers.^[1]^[3] The repository continues to be actively developed: v0.3.0 in late 2025 added a headless renderer, an OpenAI Gym compatible interface, and an adapter for Claude Code,^[6] and the v0.4 series in 2026 extended these with additional tasks and Model Context Protocol integration.^[3]

Environment design

Core game mechanics

FLE is built directly on Factorio, an industrial automation game by the Czech studio Wube Software. The game's core loop forces the player to gather raw materials, build machines that craft intermediate products, and string those machines together with belts, pipes, and electrical infrastructure. The environment exposes these elements to an agent through a typed Python interface.^[1]

Mechanic	Description	Why it matters for agents
Resource extraction	Mining ore, harvesting trees, pumping crude oil	Tests perception of resource patches and decisions about siting
Crafting	Combining inputs in furnaces, assemblers, chemical plants	Forces recipe planning over a directed graph of ingredients
Belts and inserters	Moving items between machines	Requires precise spatial layout and throughput matching
Pipes and fluids	Routing liquids between refineries and plants	Adds a second logistics network with different rules
Power	Burner, steam, solar, and nuclear generation	Introduces constraints that interact with every other system
Research	Unlocking technologies by feeding science packs into labs	Rewards strategic investment of throughput into research
Biters	Hostile insect-like creatures (optional in FLE)	Adds defensive and military objectives in later expansions

A factory in Factorio is essentially a dataflow graph laid out on a 2D grid, with each node consuming inputs at a fixed rate and producing outputs at another fixed rate. The agent's job is to assemble that graph from primitive building blocks while keeping its surface area, latency, and bottlenecks in check.

Lab-play

The lab-play mode is the primary structured benchmark. It consists of 24 production tasks, each pinned to a specific target item from the Factorio technology tree. Each task starts the agent on a small map with a fixed inventory and the relevant prerequisite technologies already researched, so the only challenge is to build a production line that achieves a target throughput within a 60 second in game holdout window. Success thresholds are 16 items per minute for solid items and 250 units per minute for fluids, with a budget of 128 API calls per task (the v0.3 release reduced this to 64 step trajectories with early stopping).^[1]^[6]

The lab-play target entities span the full game progression. Although the paper does not enumerate all 24 in a single list, the tasks cover the following categories drawn from its descriptions:

Tier	Example target entities	Approximate machine count
Tier 1, raw extraction	Iron ore, copper ore, coal, stone	1 to 2
Tier 2, basic smelting	Iron plate, copper plate, stone brick	2 to 6
Tier 3, intermediate parts	Iron gear wheel, copper cable, automation science pack	6 to 12
Tier 4, electronics	Electronic circuit (green chip), advanced circuit (red chip)	10 to 25
Tier 5, chemicals and fluids	Plastic bar, sulfur, sulfuric acid, lubricant, batteries	15 to 40
Tier 6, mechanical assemblies	Engine unit, electric engine, steel plate	20 to 50
Tier 7, late game	Military science pack, utility science pack, processing unit (blue chip)	50 to 100

Lab-play is the closest thing FLE has to a traditional benchmark because each task has a clean pass or fail outcome. Aggregating across all 24 tasks produces the lab-play success rate that drives the published leaderboard.^[5]

Open-play

The open-play mode is more open ended. The agent receives a procedurally generated map, a single instruction to build the largest possible factory, and a budget of up to 5,000 environment steps. Each model is evaluated across eight independent runs, and median Production Score and Milestones are reported.^[1] Because there is no fixed target item and no time limit beyond the step budget, open-play rewards agents that can set their own subgoals, scale infrastructure proactively, and recover from mistakes without external guidance.

Open-play exposes capabilities that lab-play deliberately suppresses. In lab-play the relevant technologies are pre-unlocked, so the model never has to decide whether to invest throughput into research. In open-play it has to choose between building more iron furnaces today and pushing for electric drilling tomorrow, an explicitly long-horizon trade off.

Evaluation metrics

FLE reports three families of numbers, each designed to capture a different facet of agent behavior.

Production Score

The Production Score (PS) is a continuous measure of economic activity. For each item in the game, the system assigns a value V(i) derived from the recipe's complexity, ingredient depth, and energy cost. The Production Score at time t is then a weighted sum:

PS(t) = sum over items i of V(i) * (P_i(t) - C_i(t))

where P_i(t) is the cumulative quantity of item i produced up to time t and C_i(t) is the cumulative quantity consumed. Raw items such as iron ore have V near 3, while a single unit of processing unit (blue chip) is valued in the thousands.^[1] Because the same factory can extend its score by simply running longer or producing more advanced items, Production Score varies across orders of magnitude and never saturates.

Milestones

A Milestone is hit the first time an agent successfully produces a particular item or unlocks a particular technology. The paper defines a fixed list of milestones tied to the major tiers of the Factorio progression, from gathering wood through researching electric energy distribution and into late game logistics.^[1] Where Production Score answers the question "how much economic activity did the agent generate," milestones answer "how broad was the technology tree it covered."

Lab-play success rate

The lab-play success rate is the fraction of the 24 target entities for which the agent built a production line meeting the throughput threshold during the holdout window. The published numbers are means over multiple seeds with standard error.^[1]

Technical architecture

Agent loop and REPL

FLE deliberately rejects the classical reinforcement learning interface in favor of a Python REPL pattern. At each step the agent receives the standard output and standard error of its last program along with any structured results, then writes the next program in a persistent Python namespace.^[1] Variables, helper functions, and even cached state survive across steps, so an agent can define a place_assembly_line helper early on and reuse it later. This pattern echoes the way human programmers write provisional code at an interactive shell, observe the result, and iterate.

The agent program runs in a Python client that communicates synchronously over TCP to a Lua server embedded in the Factorio game itself, using the RCON protocol that Factorio normally exposes for multiplayer admin tools. The round trip latency is low enough that the system averages 218 operations per second on standard hardware, with the most expensive operations (pathfinding and large entity scans) running at 25 to 48 operations per second.^[1]

Action and observation API

FLE exposes a typed object model of Factorio entities and a set of 23 core methods divided into three groups.^[1]

Category	Representative methods	Purpose
Pure queries	get_entities, inspect_inventory, get_research_progress, get_resource_patch, get_prototype_recipe	Read game state without modifying it
State modifications	place_entity, place_entity_next_to, pickup_entity, rotate_entity, connect_entities, set_entity_recipe	Build and edit factory layout
Resource management	insert_item, extract_item, harvest_resource, craft_item, set_research	Move items, perform crafts, drive research

Returns are strongly typed, so an agent can call nearest(Resource.IronOre) and receive a position object with x and y fields, or call get_entities(Furnace) and iterate over a list of typed furnace records with attributes like position, status, output inventory, and burner inventory. This typing turns the environment into something closer to a domain specific language than a raw game API and gives the model strong hooks for compositional programming.

Memory and long-context handling

Long-horizon episodes generate enormous logs. FLE addresses this with a hierarchical memory scheme: at every step, the most recent 32 observations remain verbatim in the context window, while older observations are summarized into 1,024 token reports.^[1] This keeps the prompt tractable even after thousands of steps, which would otherwise exhaust a model's context window. The summarization is itself performed by an LLM, which becomes a subtle hyperparameter of the evaluation.

Running the environment

The v0.3 release packages the system for easy use. Installation is via PyPI:

pip install factorio-learning-environment
# Optional extras for evaluation, MCP, or PostgreSQL logging
pip install "factorio-learning-environment[eval,mcp,psql]"

The CLI then provides commands to start a Docker based cluster of Factorio servers and run an evaluation sweep:

fle cluster start
fle eval --config configs/gym_run_config.json

The headless renderer in v0.3 lets agents run without the official Factorio game client, making large parallel sweeps cheap on cloud hardware. Headless mode also exposes a pixel observation channel for multimodal experiments, although the headline benchmark remains text and code only.^[6]

Gym and MCP interfaces

v0.3 added an OpenAI Gym style interface, so FLE can be plugged into off the shelf RL pipelines that use step, reset, and reward semantics.^[6] The environment also ships with a Model Context Protocol server, which allows Claude Code, other MCP clients, and IDE-style agent harnesses to drive the environment without writing custom glue code.^[6] This bridge was the headline demonstration of v0.3.0: a livestream that showed Claude Code building factories interactively over many hours of play.^[15]

Evaluated models and headline results

The original paper evaluated six frontier and open weight models in March 2025.^[1] Subsequent updates added results for newer systems, including models from the GPT-5, Claude Opus 4.1, Gemini 2.5, and Grok 4 families.^[5]^[7] The v1 leaderboard remains the most widely cited result because it offers a clean comparison across a single moment in time.

Lab-play success rate, v1 paper

Model	Lab-play success rate	Tasks solved (of 24)
Claude 3.5 Sonnet	21.9 plus or minus 1.3%	7
GPT-4o	16.6 plus or minus 1.4%	5 to 6
DeepSeek v3	15.1 plus or minus 1.7%	4 to 5
Gemini 2 Flash	13.0 plus or minus 1.3%	4
Llama 3.3 70B	6.3 plus or minus 1.0%	2
GPT-4o mini	5.2 plus or minus 0.6%	1 to 2

Claude 3.5 Sonnet was the clear leader. It was the only model to consistently complete intermediate electronics tasks like green circuits, and it occasionally automated steel plate production, a task that requires coordinating fuel, ore, and a second smelting stage. Even so, no v1 era model came close to automating any of the late game lab-play targets in the 50 to 100 machine range.^[1]

Open-play Production Score and Milestones, v1 paper

Model	Open-play Production Score (median)	Milestones reached
Claude 3.5 Sonnet	293,206	28
GPT-4o	mid five figures	mid twenties
DeepSeek v3	lower five figures	low twenties
Gemini 2 Flash	lower five figures	low twenties
Llama 3.3 70B	54,998	26
GPT-4o mini	low four figures	mid teens

Claude 3.5 Sonnet's most notable open-play accomplishment was discovering that investing science packs into the electric mining drill technology unlocks a much higher steady state throughput than any amount of manual ore harvesting can match. Llama 3.3 70B, despite a much lower lab-play score, posted a strong open-play milestone count by chaining many simple actions over its 5,000 step budget, an early hint that exploration depth and lab-play planning skill are not perfectly correlated.^[1]

Updates since v0.3.0

With v0.3.0 the authors re-ran the benchmark on a broader set of frontier models and reported qualitative results that show the ordering Claude > GPT > Gemini > Grok in lab-play, with absolute scores improving across the board.^[6] They note that the latest generation of open weight models has now caught up with the previously state of the art Claude 3.5 Sonnet number, while Grok 4 tends to enter degenerate debug loops where it repeats the same failing action many times, and GPT-5 recovers more gracefully than its predecessors. Claude Opus 4.1, the strongest model in the v0.3 evaluation, displayed an error rate of around 23% on lab-play with essentially zero syntactic errors, meaning its failures were pragmatic rather than malformed.^[6]

The v0.3 evaluations also revealed a recurring pattern: even when frontier models can build simple structures, they tend to fall back on semi-manual strategies, hand-feeding furnaces or directly crafting items rather than designing true automation pipelines that scale.^[6]

How agents actually play Factorio

The FLE paper devotes considerable space to qualitative analysis. Watching transcripts of a successful Claude 3.5 Sonnet run, the authors identify a recurring sequence of behaviors that begins with manual gathering, transitions into ad hoc crafting, and only later attempts to set up belts and assemblers. Even strong models rarely plan an entire factory before starting; instead they iteratively bolt new subsections onto the side of an existing layout, which mirrors how many human players begin the game.^[1]

Strengths observed

Compositional code reuse, where an agent writes helper functions early in a run and calls them across many later steps to place machines or run inventory checks.
Recipe planning, where the model correctly decomposes a target item into prerequisites, even when the chain runs five or six layers deep.
Research prioritization, where Claude 3.5 Sonnet explicitly trades early throughput for unlocking the electric mining drill, and benefits compoundingly from doing so.^[1]
Local error recovery, where a placement collision triggers a query for occupied tiles followed by an offset and a retry.

Failure modes

Spatial misjudgment, where the model misestimates the distance between a drill and a furnace and produces a belt that does not actually connect them.
Throughput mismatch, where it builds a single assembler downstream of a fully saturated belt and never notices the bottleneck.
Debug loops, where it repeats a failing operation many times without varying parameters, the dominant failure mode for Grok 4.^[6]
Strategic shortsightedness, where it scales the easiest substep to absurd lengths instead of moving on to the next tier.
API misunderstandings, where the model believes a method exists or behaves differently from its actual signature, a problem present in all tested models.
Forgetting context, where after a memory summarization step the model loses track of where it placed a key machine and starts a duplicate.

The authors interpret these patterns as evidence that the bottleneck is not knowledge of the game (most models have absorbed Factorio wiki content during pretraining) but the ability to translate that knowledge into stable plans, executable code, and bug-free spatial layouts.^[1]

Comparison with other agentic benchmarks

FLE is part of a wider movement to evaluate language models in interactive environments rather than on static datasets. Each comparable benchmark stresses a different axis of capability.

Benchmark	Environment	Focus	Time horizon	Closest analogue to FLE
BALROG^[12]	Six games including NetHack, BabyAI, Crafter	Multi-game agentic reasoning across small environments	Hundreds of steps	Closest in spirit, but no single game scales to FLE's complexity
Voyager^[11]	Minecraft via Mineflayer	Open-ended skill library acquisition with GPT-4	Hours of in game time	Shares the open-ended ethos; uses code generation but no formal score
MineDojo	Minecraft via internet-scale data	Multi-task internet-grounded learning	Variable	Game-based and open world, but less industrial
SWE-bench^[14]	GitHub issues in real Python repositories	Coding patch generation against test suites	Single PR per task	Both stress program synthesis, but SWE-bench has fixed solutions
MLE-bench^[13]	Kaggle competitions	End to end machine learning pipelines	Hours per task	Both reward building working systems under unbounded score caps
AgentBench	Eight code, web, and game environments	Broad coverage of agentic tasks	Tens to hundreds of steps	Broader but shallower than FLE
Crafter	2D survival game	Achievement based exploration	Short	A toy ancestor of FLE in style
NetHack Learning Environment	The roguelike NetHack	Procedural exploration and survival	Thousands of steps	Shares the procedurally generated map and unbounded difficulty
ALFWorld	Text and visual household tasks	Embodied instruction following	Tens of steps	Tests grounded planning but at a much smaller scale

FLE differentiates itself on three axes. First, its action space is Python code over a typed entity API rather than a finite discrete action list or natural language commands, which puts unusual pressure on program synthesis. Second, its score scales over six orders of magnitude, far more than any other published agentic benchmark. Third, it requires sustained competence over thousands of decisions, where most existing benchmarks max out at a few hundred steps.^[1]

In the framing of the paper, FLE is closer to a flight simulator for agentic AI research than to a multiple choice test. Models with similar scores on MMLU or HumanEval can produce wildly different factories, and the FLE leaderboard often aligns more closely with practical benchmarks like GDPVal that estimate economic productivity than with academic exam suites.

Limitations

The authors are forthright about FLE's limitations. The first is incomplete entity coverage. Mid and late game elements such as trains, logistics robots, and programmable circuit networks are partially modeled, which restricts the very late tiers of factory design. The second is the absence of a human baseline. Although expert players routinely build factories thousands of times more productive than the best LLM, none have attempted to do so through the Python REPL with a strict step budget, so it is unclear what a top human would actually score in the same conditions.^[1]

A third concern is reward hacking. Because Production Score depends on cumulative production minus consumption, an agent that prints money via a degenerate loop could in principle inflate its score. The authors report that no current model exploits this, but they flag it as a risk that may emerge in stronger systems.^[1]

A fourth limitation is that FLE is single agent only in its current form, even though Factorio has well developed multiplayer support. The authors list cooperative and competitive multiplayer scenarios as natural directions for future work. Finally, the benchmark imposes a particular memory and summarization scheme that interacts with model performance, which complicates direct comparisons across different model families with different context windows.^[1]

More subtly, FLE inherits idiosyncrasies from Factorio itself, including biters, surface mechanics, and recipe quirks that may not generalize to other automation problems. The system is therefore best understood as one slice of agentic capability rather than a universal yardstick.

Implications for AI research

FLE has had visible influence on how the field thinks about agent evaluation. The benchmark shows that even models that excel at coding interview problems and pass advanced math exams collapse when asked to maintain a coherent industrial plan across thousands of API calls. That signal has fed into several active research threads.

Hierarchical planning for LLM agents, where models propose long-horizon outlines before generating detailed code, has shown improved FLE performance in subsequent work.
Tool augmented memory, including retrieval over past program transcripts, mitigates some of the context loss observed in the original paper.
Reinforcement fine tuning is being explored as a way to teach models the specific calling conventions and spatial heuristics of the Factorio API.
Process supervision, where reward models score intermediate reasoning steps rather than only final outcomes, fits naturally into FLE's step level feedback loop.
Multimodal extensions that pair the Python API with the v0.3 pixel renderer offer a path to test whether vision conditioning improves spatial layout.

FLE has also attracted attention outside academia. Tech press coverage in The Decoder, Gigazine, and AI newsletters has framed it as a sober antidote to the cycle of benchmark saturation,^[9]^[10] and the project's Discord and YouTube tutorials have built a small community of practitioners experimenting with custom agents. Public livestreams of Claude Code running FLE drew large audiences,^[15] and several research labs now use the environment for internal model regression testing.

Connection to evaluation of long-horizon agents

The themes that FLE highlights, sustained coherence, spatial reasoning, debugging across long sessions, are increasingly central to discussions of AI agents more broadly. Benchmarks aimed at end to end software engineering, like SWE-bench and SWE-Lancer, or at scientific workflows, like MLE-bench, share much of FLE's emphasis on translating high level goals into executable code over many steps. FLE is unusual in offering an environment where the goal itself is open ended and the metric is continuous, which lets researchers study how agents allocate effort under no external instruction.

Installation and quick start

The following snippet shows a minimal lab-play run against a running Factorio cluster.

# Install the package
pip install factorio-learning-environment

# Start a local cluster of Factorio servers via Docker
fle cluster start

# Run an evaluation sweep with a configured agent and task list
fle eval --config configs/gym_run_config.json

Writing a custom agent is a matter of subclassing the supplied agent interface and emitting Python programs in response to observations. A simplified loop looks like the following.

from fle import FactorioEnvironment
from fle.entities import Position, Resource, Furnace, Drill

env = FactorioEnvironment(mode="lab-play", task="iron_plate")
obs = env.reset()

while not obs.done:
    # The agent writes a Python program to execute against the live game.
    program = my_agent.respond(obs)
    obs = env.step(program)

print("Final Production Score:", obs.production_score)

The full repository contains example agents that wrap OpenAI, Anthropic, and open weight model APIs, as well as tracing utilities that produce HTML transcripts of every API call and observation, useful for both debugging and qualitative analysis.^[3]^[4]

Roadmap and future work

The FLE roadmap as of 2026 calls out several directions.

Direction	Status	Notes
Expanded entity coverage	Ongoing	Adding trains, logistics robots, programmable circuits
Multiplayer and cooperative settings	Planned	Letting multiple agents share a factory
Adversarial and biter combat	Planned	Defensive engineering against the game's hostile creatures
Pixel and multimodal observations	Available since v0.3	Used for early experiments with vision models^[6]
MCP and IDE agent integrations	Available since v0.3	Enables Claude Code, Cursor, and other clients to run FLE directly^[6]
Reinforcement learning baselines	Research phase	Compares specialized policies against language model agents
Human expert baselines	Open question	The authors invite the community to run controlled human trials
Curriculum learning	Open	Sequencing lab-play tasks for training rather than evaluation

The authors emphasize that FLE is meant to be a moving target. They expect the lab-play leaderboard to be largely saturated by the most capable models within a few years, at which point the focus is likely to shift to longer open-play horizons, multi-agent coordination, and integration with the game's combat systems.^[1]

Reception

FLE was published at NeurIPS 2025 in the Datasets and Benchmarks Track and has been broadly well received in the machine learning community.^[2] Researchers have praised its non-saturating design, its emphasis on long horizons, and its careful API design. Some have noted that the choice of Python as an action space favors models with strong code generation capabilities, which may understate the abilities of systems that excel at natural language or visual reasoning. Others have pointed out that the lab-play setup, with pre-unlocked technologies and fixed starting resources, simplifies away some of the most interesting strategic questions that open-play surfaces.

The game's commercial publisher, Wube Software, has not endorsed FLE officially, but the project's compatibility with Factorio 2.0 and its respect for the game's terms of service have allowed it to coexist with the official game community. Discussion threads on the official Factorio forums and on r/factorio have responded positively, with some experienced players proposing additional lab-play targets and others volunteering to run human baseline trials.^[16]

In the broader AI safety and capability evaluation discussion, FLE has been cited as a useful counterweight to coding benchmarks like SWE-bench because it isolates planning and execution rather than rewarding pure code synthesis ability. Several frontier labs have publicly reported running internal versions of FLE as part of their model release evaluations.

Significance

FLE captures a moment in AI evaluation when the field is moving away from saturated multiple choice tests toward open ended, environment-grounded benchmarks. By embedding agents in a game that humans have spent thousands of hours optimizing, it provides a window into capabilities that matter for real-world automation: planning over long horizons, programming against typed APIs, recovering from spatial mistakes, and choosing what to build next when no one is supervising the choice.

Its headline finding, that even the strongest publicly available models fully automate only a small fraction of lab-play tasks and reach only the early game in open-play, is now widely cited as evidence that current LLMs have a planning ceiling well below human industrial competence.^[1] At the same time, the rapid progress between v1 and v0.3 shows that the gap is closing fast.^[6] FLE will likely remain a useful instrument for charting that progress for several more years, both as a research benchmark and as a public dashboard for tracking how language model agents handle a hard, scalable industrial problem.

References

Hopkins, Jack; Bakler, Mart; Khan, Akbir. *Factorio Learning Environment*. arXiv:2503.09617, March 2025. https://arxiv.org/abs/2503.09617 ↩
Hopkins, Jack; Bakler, Mart; Khan, Akbir. *Factorio Learning Environment*. NeurIPS 2025 Datasets and Benchmarks Track, OpenReview. https://openreview.net/forum?id=652Q6jBFMZ ↩
FLE source code repository. https://github.com/JackHopkins/factorio-learning-environment ↩
Official FLE website and documentation. https://jackhopkins.github.io/factorio-learning-environment/ ↩
FLE leaderboard. https://jackhopkins.github.io/factorio-learning-environment/leaderboard/ ↩
FLE v0.3.0 release notes. https://jackhopkins.github.io/factorio-learning-environment/versions/0.3.0.html ↩
Epoch AI benchmark page for FLE. https://epoch.ai/benchmarks/factorio-learning-environment ↩
Hopkins, Jack. *Lecture 85: Factorio Learning Environment*. GPU MODE, 2025. https://www.youtube.com/watch?v=iXvYa2oIMbA
The Decoder. *Factorio joins growing list of video games doubling as AI benchmarking tools*. 2025. https://the-decoder.com/factorio-joins-growing-list-of-video-games-doubling-as-ai-benchmarking-tools/ ↩
Gigazine. *Factorio Learning Environment (FLE) is now available, a learning environment that evaluates the performance of AI models*. March 2025. https://gigazine.net/gsc_news/en/20250313-factorio-learning-environment/ ↩
Wang, Guanzhi et al. *Voyager: An Open-Ended Embodied Agent with Large Language Models*. arXiv:2305.16291, 2023. ↩
Paglieri, Davide et al. *BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games*. arXiv:2411.13543, 2024. https://arxiv.org/abs/2411.13543 ↩
Chan, Jun Shern et al. *MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering*. arXiv:2410.07095, 2024. ↩
Jimenez, Carlos E. et al. *SWE-bench: Can Language Models Resolve Real-World GitHub Issues?* arXiv:2310.06770, 2023. ↩
Hacker News. *Show HN: FLE v0.3 - Claude Code Plays Factorio*. https://news.ycombinator.com/item?id=45466865 ↩
Factorio Forums discussion thread. https://forums.factorio.com/viewtopic.php?t=127390 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

BALROG