Factorio Learning Environment
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 5,718 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 ยท 5,718 words
Add missing citations, update stale details, or suggest a clearer explanation.
| Factorio Learning Environment | |
|---|---|
| Overview | |
| Full name | Factorio Learning Environment |
| Abbreviation | FLE |
| Description | An open-ended evaluation framework that uses the industrial automation game Factorio to test long-horizon planning, spatial reasoning, program synthesis, and resource optimization in large language model agents |
| First arXiv release | 2025-03-12 (paper 2503.09617) |
| Conference | NeurIPS 2025, Datasets and Benchmarks Track |
| Latest version | v0.4.x (2026) |
| Authors | Jack Hopkins, Mart Bakler, Akbir Khan |
| Lead affiliations | Anthropic, University College London |
| Technical Details | |
| Type | Long-horizon agent benchmark, program synthesis, resource optimization |
| Modality | Code (Python REPL), text observations, optional pixel renderer |
| Task format | Lab-play production challenges and unbounded open-play factory building |
| Number of tasks | 24 lab-play target entities plus open-play |
| Total examples | Procedurally generated; unbounded in open-play |
| Evaluation metrics | Production Score (PS), Milestones, lab-play task success rate |
| Domains | Industrial automation, logistics, spatial layout, research progression |
| Languages | English prompts, Python action space |
| Performance | |
| Best published lab-play (v1 paper) | Claude 3.5 Sonnet, 21.9% (7 of 24 tasks fully automated) |
| Best published open-play PS (v1) | Claude 3.5 Sonnet, 293,206 |
| Best published milestones (v1) | Claude 3.5 Sonnet, 28 milestones |
| Human ceiling | Expert players reach factories processing millions of items per second |
| Saturated | No, intentionally non-saturating |
| Resources | |
| Website | Official site |
| Paper | arXiv 2503.09617 |
| GitHub | JackHopkins/factorio-learning-environment |
| Leaderboard | FLE leaderboard |
| Epoch AI page | Epoch benchmark entry |
| License | MIT for code, CC BY 4.0 for the paper |
The Factorio Learning Environment (FLE) is an open source benchmark and research framework that uses the industrial automation game Factorio to evaluate the long-horizon agentic capabilities of large language model systems. It was introduced in March 2025 by Jack Hopkins, Mart Bakler, and Akbir Khan in the paper Factorio Learning Environment (arXiv:2503.09617), and it was later accepted to the NeurIPS 2025 Datasets and Benchmarks Track. FLE is designed as a deliberately non-saturating environment: rather than asking a model to pick the right multiple choice answer or close a single ticket, it asks an agent to grow a working factory whose throughput can scale across roughly six orders of magnitude, from a handful of items per minute to the millions of items per second that experienced human players routinely achieve in late game play.
The environment is built around a Python read eval print loop, or REPL, in which an agent observes the game state, writes Python programs that call a typed Factorio API, executes them against a running game server, and reads back structured feedback. It supplies two evaluation modes: a structured lab-play suite of 24 production tasks with fixed starting resources, and an open-play mode where a single agent is given a procedurally generated map and the instruction to build the largest factory it can. The first release reported that even the strongest frontier model of that era, Claude 3.5 Sonnet, only fully automated 7 of the 24 lab-play tasks and reached a Production Score of 293,206 in open-play, while smaller and weaker models collapsed at much earlier rungs of the technology tree.
Many classic AI benchmarks have collapsed under the pace of progress in foundation models. MMLU, HumanEval, GSM8K, and other multiple choice or short answer suites are now scored above 90% by several models, leaving little headroom to differentiate frontier systems. The authors of FLE argue that this saturation hides deep gaps in capabilities that matter for real world deployment: long-horizon planning over thousands of decisions, spatial reasoning over a 2D grid, error diagnosis under partial information, and the ability to compose simple primitives into reusable abstractions.
Factorio is a particularly natural fit for these questions for several reasons. The game has a fully deterministic engine, an unambiguous notion of progress measured in items produced per minute, and a technology tree that grows exponentially in complexity. A starting factory needs only two machines to mine iron ore, but a fully optimized end game base for a single unit of utility science requires coordinating close to a hundred machines across many subassemblies. The same skill set, breaking a goal into subgoals, laying out machines on a grid, debugging bottlenecks, applies at every level. That makes Factorio an environment where the gap between novice and expert behavior is enormous, where the score grows in clear logarithmic steps, and where the same engine can challenge agents far above the capabilities of today's models.
The authors note that this is also why Factorio became a cult object among engineering teams. The game encodes industrial planning problems that resemble logistics, manufacturing operations research, and distributed systems engineering. FLE therefore positions itself not only as an AI benchmark but also as a sandbox where researchers can study how language model agents learn to plan, refactor code, and recover from errors over horizons measured in hours of wall clock time.
A persistent concern in the AI evaluation community is that successive model generations rapidly close the gap between random performance and the human ceiling, after which differences between models stop being legible. FLE takes a different approach. The Production Score grows roughly as the logarithm of total throughput, so each new tier of automation adds a comparable interval on the score axis. Going from manual ore mining to electric drills moves a model by one band, while moving from green circuits to red circuits to blue circuits adds three more. Because there is no natural completion state, even a perfect contemporary model can be beaten by a future system that simply scales further.
The FLE paper credits three authors. Jack Hopkins and Mart Bakler are listed as joint first authors, with Akbir Khan as the third author. Hopkins works at Anthropic, and Khan is affiliated with University College London. The project was developed primarily in 2024 and early 2025 and released as an open source repository at github.com/JackHopkins/factorio-learning-environment under an MIT style license, with the paper distributed under CC BY 4.0.
The code base consists of roughly eighty thousand lines, including a Python client library, a Lua mod that exposes Factorio internals to that client, an evaluation harness with logging via Weights and Biases, and CLI tooling for spinning up Docker based clusters of Factorio servers. The repository continues to be actively developed: v0.3.0 in late 2025 added a headless renderer, an OpenAI Gym compatible interface, and an adapter for Claude Code, and the v0.4 series in 2026 extended these with additional tasks and Model Context Protocol integration.
FLE is built directly on Factorio, an industrial automation game by the Czech studio Wube Software. The game's core loop forces the player to gather raw materials, build machines that craft intermediate products, and string those machines together with belts, pipes, and electrical infrastructure. The environment exposes these elements to an agent through a typed Python interface.
| Mechanic | Description | Why it matters for agents |
|---|---|---|
| Resource extraction | Mining ore, harvesting trees, pumping crude oil | Tests perception of resource patches and decisions about siting |
| Crafting | Combining inputs in furnaces, assemblers, chemical plants | Forces recipe planning over a directed graph of ingredients |
| Belts and inserters | Moving items between machines | Requires precise spatial layout and throughput matching |
| Pipes and fluids | Routing liquids between refineries and plants | Adds a second logistics network with different rules |
| Power | Burner, steam, solar, and nuclear generation | Introduces constraints that interact with every other system |
| Research | Unlocking technologies by feeding science packs into labs | Rewards strategic investment of throughput into research |
| Biters | Hostile insect-like creatures (optional in FLE) | Adds defensive and military objectives in later expansions |
A factory in Factorio is essentially a dataflow graph laid out on a 2D grid, with each node consuming inputs at a fixed rate and producing outputs at another fixed rate. The agent's job is to assemble that graph from primitive building blocks while keeping its surface area, latency, and bottlenecks in check.
The lab-play mode is the primary structured benchmark. It consists of 24 production tasks, each pinned to a specific target item from the Factorio technology tree. Each task starts the agent on a small map with a fixed inventory and the relevant prerequisite technologies already researched, so the only challenge is to build a production line that achieves a target throughput within a 60 second in game holdout window. Success thresholds are 16 items per minute for solid items and 250 units per minute for fluids, with a budget of 128 API calls per task (the v0.3 release reduced this to 64 step trajectories with early stopping).
The lab-play target entities span the full game progression. Although the paper does not enumerate all 24 in a single list, the tasks cover the following categories drawn from its descriptions:
| Tier | Example target entities | Approximate machine count |
|---|---|---|
| Tier 1, raw extraction | Iron ore, copper ore, coal, stone | 1 to 2 |
| Tier 2, basic smelting | Iron plate, copper plate, stone brick | 2 to 6 |
| Tier 3, intermediate parts | Iron gear wheel, copper cable, automation science pack | 6 to 12 |
| Tier 4, electronics | Electronic circuit (green chip), advanced circuit (red chip) | 10 to 25 |
| Tier 5, chemicals and fluids | Plastic bar, sulfur, sulfuric acid, lubricant, batteries | 15 to 40 |
| Tier 6, mechanical assemblies | Engine unit, electric engine, steel plate | 20 to 50 |
| Tier 7, late game | Military science pack, utility science pack, processing unit (blue chip) | 50 to 100 |
Lab-play is the closest thing FLE has to a traditional benchmark because each task has a clean pass or fail outcome. Aggregating across all 24 tasks produces the lab-play success rate that drives the published leaderboard.
The open-play mode is more open ended. The agent receives a procedurally generated map, a single instruction to build the largest possible factory, and a budget of up to 5,000 environment steps. Each model is evaluated across eight independent runs, and median Production Score and Milestones are reported. Because there is no fixed target item and no time limit beyond the step budget, open-play rewards agents that can set their own subgoals, scale infrastructure proactively, and recover from mistakes without external guidance.
Open-play exposes capabilities that lab-play deliberately suppresses. In lab-play the relevant technologies are pre-unlocked, so the model never has to decide whether to invest throughput into research. In open-play it has to choose between building more iron furnaces today and pushing for electric drilling tomorrow, an explicitly long-horizon trade off.
FLE reports three families of numbers, each designed to capture a different facet of agent behavior.
The Production Score (PS) is a continuous measure of economic activity. For each item in the game, the system assigns a value V(i) derived from the recipe's complexity, ingredient depth, and energy cost. The Production Score at time t is then a weighted sum:
PS(t) = sum over items i of V(i) * (P_i(t) - C_i(t))
where P_i(t) is the cumulative quantity of item i produced up to time t and C_i(t) is the cumulative quantity consumed. Raw items such as iron ore have V near 3, while a single unit of processing unit (blue chip) is valued in the thousands. Because the same factory can extend its score by simply running longer or producing more advanced items, Production Score varies across orders of magnitude and never saturates.
A Milestone is hit the first time an agent successfully produces a particular item or unlocks a particular technology. The paper defines a fixed list of milestones tied to the major tiers of the Factorio progression, from gathering wood through researching electric energy distribution and into late game logistics. Where Production Score answers the question "how much economic activity did the agent generate," milestones answer "how broad was the technology tree it covered."
The lab-play success rate is the fraction of the 24 target entities for which the agent built a production line meeting the throughput threshold during the holdout window. The published numbers are means over multiple seeds with standard error.
FLE deliberately rejects the classical reinforcement learning interface in favor of a Python REPL pattern. At each step the agent receives the standard output and standard error of its last program along with any structured results, then writes the next program in a persistent Python namespace. Variables, helper functions, and even cached state survive across steps, so an agent can define a place_assembly_line helper early on and reuse it later. This pattern echoes the way human programmers write provisional code at an interactive shell, observe the result, and iterate.
The agent program runs in a Python client that communicates synchronously over TCP to a Lua server embedded in the Factorio game itself, using the RCON protocol that Factorio normally exposes for multiplayer admin tools. The round trip latency is low enough that the system averages 218 operations per second on standard hardware, with the most expensive operations (pathfinding and large entity scans) running at 25 to 48 operations per second.
FLE exposes a typed object model of Factorio entities and a set of 23 core methods divided into three groups.
| Category | Representative methods | Purpose |
|---|---|---|
| Pure queries | get_entities, inspect_inventory, get_research_progress, get_resource_patch, get_prototype_recipe | Read game state without modifying it |
| State modifications | place_entity, place_entity_next_to, pickup_entity, rotate_entity, connect_entities, set_entity_recipe | Build and edit factory layout |
| Resource management | insert_item, extract_item, harvest_resource, craft_item, set_research | Move items, perform crafts, drive research |
Returns are strongly typed, so an agent can call nearest(Resource.IronOre) and receive a position object with x and y fields, or call get_entities(Furnace) and iterate over a list of typed furnace records with attributes like position, status, output inventory, and burner inventory. This typing turns the environment into something closer to a domain specific language than a raw game API and gives the model strong hooks for compositional programming.
Long-horizon episodes generate enormous logs. FLE addresses this with a hierarchical memory scheme: at every step, the most recent 32 observations remain verbatim in the context window, while older observations are summarized into 1,024 token reports. This keeps the prompt tractable even after thousands of steps, which would otherwise exhaust a model's context window. The summarization is itself performed by an LLM, which becomes a subtle hyperparameter of the evaluation.
The v0.3 release packages the system for easy use. Installation is via PyPI:
pip install factorio-learning-environment
# Optional extras for evaluation, MCP, or PostgreSQL logging
pip install "factorio-learning-environment[eval,mcp,psql]"
The CLI then provides commands to start a Docker based cluster of Factorio servers and run an evaluation sweep:
fle cluster start
fle eval --config configs/gym_run_config.json
The headless renderer in v0.3 lets agents run without the official Factorio game client, making large parallel sweeps cheap on cloud hardware. Headless mode also exposes a pixel observation channel for multimodal experiments, although the headline benchmark remains text and code only.
v0.3 added an OpenAI Gym style interface, so FLE can be plugged into off the shelf RL pipelines that use step, reset, and reward semantics. The environment also ships with a Model Context Protocol server, which allows Claude Code, other MCP clients, and IDE-style agent harnesses to drive the environment without writing custom glue code. This bridge was the headline demonstration of v0.3.0: a livestream that showed Claude Code building factories interactively over many hours of play.
The original paper evaluated six frontier and open weight models in March 2025. Subsequent updates added results for newer systems, including models from the GPT-5, Claude Opus 4.1, Gemini 2.5, and Grok 4 families. The v1 leaderboard remains the most widely cited result because it offers a clean comparison across a single moment in time.
| Model | Lab-play success rate | Tasks solved (of 24) |
|---|---|---|
| Claude 3.5 Sonnet | 21.9 plus or minus 1.3% | 7 |
| GPT-4o | 16.6 plus or minus 1.4% | 5 to 6 |
| DeepSeek v3 | 15.1 plus or minus 1.7% | 4 to 5 |
| Gemini 2 Flash | 13.0 plus or minus 1.3% | 4 |
| Llama 3.3 70B | 6.3 plus or minus 1.0% | 2 |
| GPT-4o mini | 5.2 plus or minus 0.6% | 1 to 2 |
Claude 3.5 Sonnet was the clear leader. It was the only model to consistently complete intermediate electronics tasks like green circuits, and it occasionally automated steel plate production, a task that requires coordinating fuel, ore, and a second smelting stage. Even so, no v1 era model came close to automating any of the late game lab-play targets in the 50 to 100 machine range.
| Model | Open-play Production Score (median) | Milestones reached |
|---|---|---|
| Claude 3.5 Sonnet | 293,206 | 28 |
| GPT-4o | mid five figures | mid twenties |
| DeepSeek v3 | lower five figures | low twenties |
| Gemini 2 Flash | lower five figures | low twenties |
| Llama 3.3 70B | 54,998 | 26 |
| GPT-4o mini | low four figures | mid teens |
Claude 3.5 Sonnet's most notable open-play accomplishment was discovering that investing science packs into the electric mining drill technology unlocks a much higher steady state throughput than any amount of manual ore harvesting can match. Llama 3.3 70B, despite a much lower lab-play score, posted a strong open-play milestone count by chaining many simple actions over its 5,000 step budget, an early hint that exploration depth and lab-play planning skill are not perfectly correlated.
With v0.3.0 the authors re-ran the benchmark on a broader set of frontier models and reported qualitative results that show the ordering Claude > GPT > Gemini > Grok in lab-play, with absolute scores improving across the board. They note that the latest generation of open weight models has now caught up with the previously state of the art Claude 3.5 Sonnet number, while Grok 4 tends to enter degenerate debug loops where it repeats the same failing action many times, and GPT-5 recovers more gracefully than its predecessors. Claude Opus 4.1, the strongest model in the v0.3 evaluation, displayed an error rate of around 23% on lab-play with essentially zero syntactic errors, meaning its failures were pragmatic rather than malformed.
The v0.3 evaluations also revealed a recurring pattern: even when frontier models can build simple structures, they tend to fall back on semi-manual strategies, hand-feeding furnaces or directly crafting items rather than designing true automation pipelines that scale.
The FLE paper devotes considerable space to qualitative analysis. Watching transcripts of a successful Claude 3.5 Sonnet run, the authors identify a recurring sequence of behaviors that begins with manual gathering, transitions into ad hoc crafting, and only later attempts to set up belts and assemblers. Even strong models rarely plan an entire factory before starting; instead they iteratively bolt new subsections onto the side of an existing layout, which mirrors how many human players begin the game.
The authors interpret these patterns as evidence that the bottleneck is not knowledge of the game (most models have absorbed Factorio wiki content during pretraining) but the ability to translate that knowledge into stable plans, executable code, and bug-free spatial layouts.
FLE is part of a wider movement to evaluate language models in interactive environments rather than on static datasets. Each comparable benchmark stresses a different axis of capability.
| Benchmark | Environment | Focus | Time horizon | Closest analogue to FLE |
|---|---|---|---|---|
| BALROG | Six games including NetHack, BabyAI, Crafter | Multi-game agentic reasoning across small environments | Hundreds of steps | Closest in spirit, but no single game scales to FLE's complexity |
| Voyager | Minecraft via Mineflayer | Open-ended skill library acquisition with GPT-4 | Hours of in game time | Shares the open-ended ethos; uses code generation but no formal score |
| MineDojo | Minecraft via internet-scale data | Multi-task internet-grounded learning | Variable | Game-based and open world, but less industrial |
| SWE-bench | GitHub issues in real Python repositories | Coding patch generation against test suites | Single PR per task | Both stress program synthesis, but SWE-bench has fixed solutions |
| MLE-bench | Kaggle competitions | End to end machine learning pipelines | Hours per task | Both reward building working systems under unbounded score caps |
| AgentBench | Eight code, web, and game environments | Broad coverage of agentic tasks | Tens to hundreds of steps | Broader but shallower than FLE |
| Crafter | 2D survival game | Achievement based exploration | Short | A toy ancestor of FLE in style |
| NetHack Learning Environment | The roguelike NetHack | Procedural exploration and survival | Thousands of steps | Shares the procedurally generated map and unbounded difficulty |
| ALFWorld | Text and visual household tasks | Embodied instruction following | Tens of steps | Tests grounded planning but at a much smaller scale |
FLE differentiates itself on three axes. First, its action space is Python code over a typed entity API rather than a finite discrete action list or natural language commands, which puts unusual pressure on program synthesis. Second, its score scales over six orders of magnitude, far more than any other published agentic benchmark. Third, it requires sustained competence over thousands of decisions, where most existing benchmarks max out at a few hundred steps.
In the framing of the paper, FLE is closer to a flight simulator for agentic AI research than to a multiple choice test. Models with similar scores on MMLU or HumanEval can produce wildly different factories, and the FLE leaderboard often aligns more closely with practical benchmarks like GDPVal that estimate economic productivity than with academic exam suites.
The authors are forthright about FLE's limitations. The first is incomplete entity coverage. Mid and late game elements such as trains, logistics robots, and programmable circuit networks are partially modeled, which restricts the very late tiers of factory design. The second is the absence of a human baseline. Although expert players routinely build factories thousands of times more productive than the best LLM, none have attempted to do so through the Python REPL with a strict step budget, so it is unclear what a top human would actually score in the same conditions.
A third concern is reward hacking. Because Production Score depends on cumulative production minus consumption, an agent that prints money via a degenerate loop could in principle inflate its score. The authors report that no current model exploits this, but they flag it as a risk that may emerge in stronger systems.
A fourth limitation is that FLE is single agent only in its current form, even though Factorio has well developed multiplayer support. The authors list cooperative and competitive multiplayer scenarios as natural directions for future work. Finally, the benchmark imposes a particular memory and summarization scheme that interacts with model performance, which complicates direct comparisons across different model families with different context windows.
More subtly, FLE inherits idiosyncrasies from Factorio itself, including biters, surface mechanics, and recipe quirks that may not generalize to other automation problems. The system is therefore best understood as one slice of agentic capability rather than a universal yardstick.
FLE has had visible influence on how the field thinks about agent evaluation. The benchmark shows that even models that excel at coding interview problems and pass advanced math exams collapse when asked to maintain a coherent industrial plan across thousands of API calls. That signal has fed into several active research threads.
FLE has also attracted attention outside academia. Tech press coverage in The Decoder, Gigazine, and AI newsletters has framed it as a sober antidote to the cycle of benchmark saturation, and the project's Discord and YouTube tutorials have built a small community of practitioners experimenting with custom agents. Public livestreams of Claude Code running FLE drew large audiences, and several research labs now use the environment for internal model regression testing.
The themes that FLE highlights, sustained coherence, spatial reasoning, debugging across long sessions, are increasingly central to discussions of AI agents more broadly. Benchmarks aimed at end to end software engineering, like SWE-bench and SWE-Lancer, or at scientific workflows, like MLE-bench, share much of FLE's emphasis on translating high level goals into executable code over many steps. FLE is unusual in offering an environment where the goal itself is open ended and the metric is continuous, which lets researchers study how agents allocate effort under no external instruction.
The following snippet shows a minimal lab-play run against a running Factorio cluster.
# Install the package
pip install factorio-learning-environment
# Start a local cluster of Factorio servers via Docker
fle cluster start
# Run an evaluation sweep with a configured agent and task list
fle eval --config configs/gym_run_config.json
Writing a custom agent is a matter of subclassing the supplied agent interface and emitting Python programs in response to observations. A simplified loop looks like the following.
from fle import FactorioEnvironment
from fle.entities import Position, Resource, Furnace, Drill
env = FactorioEnvironment(mode="lab-play", task="iron_plate")
obs = env.reset()
while not obs.done:
# The agent writes a Python program to execute against the live game.
program = my_agent.respond(obs)
obs = env.step(program)
print("Final Production Score:", obs.production_score)
The full repository contains example agents that wrap OpenAI, Anthropic, and open weight model APIs, as well as tracing utilities that produce HTML transcripts of every API call and observation, useful for both debugging and qualitative analysis.
The FLE roadmap as of 2026 calls out several directions.
| Direction | Status | Notes |
|---|---|---|
| Expanded entity coverage | Ongoing | Adding trains, logistics robots, programmable circuits |
| Multiplayer and cooperative settings | Planned | Letting multiple agents share a factory |
| Adversarial and biter combat | Planned | Defensive engineering against the game's hostile creatures |
| Pixel and multimodal observations | Available since v0.3 | Used for early experiments with vision models |
| MCP and IDE agent integrations | Available since v0.3 | Enables Claude Code, Cursor, and other clients to run FLE directly |
| Reinforcement learning baselines | Research phase | Compares specialized policies against language model agents |
| Human expert baselines | Open question | The authors invite the community to run controlled human trials |
| Curriculum learning | Open | Sequencing lab-play tasks for training rather than evaluation |
The authors emphasize that FLE is meant to be a moving target. They expect the lab-play leaderboard to be largely saturated by the most capable models within a few years, at which point the focus is likely to shift to longer open-play horizons, multi-agent coordination, and integration with the game's combat systems.
FLE was published at NeurIPS 2025 in the Datasets and Benchmarks Track and has been broadly well received in the machine learning community. Researchers have praised its non-saturating design, its emphasis on long horizons, and its careful API design. Some have noted that the choice of Python as an action space favors models with strong code generation capabilities, which may understate the abilities of systems that excel at natural language or visual reasoning. Others have pointed out that the lab-play setup, with pre-unlocked technologies and fixed starting resources, simplifies away some of the most interesting strategic questions that open-play surfaces.
The game's commercial publisher, Wube Software, has not endorsed FLE officially, but the project's compatibility with Factorio 2.0 and its respect for the game's terms of service have allowed it to coexist with the official game community. Discussion threads on the official Factorio forums and on r/factorio have responded positively, with some experienced players proposing additional lab-play targets and others volunteering to run human baseline trials.
In the broader AI safety and capability evaluation discussion, FLE has been cited as a useful counterweight to coding benchmarks like SWE-bench because it isolates planning and execution rather than rewarding pure code synthesis ability. Several frontier labs have publicly reported running internal versions of FLE as part of their model release evaluations.
FLE captures a moment in AI evaluation when the field is moving away from saturated multiple choice tests toward open ended, environment-grounded benchmarks. By embedding agents in a game that humans have spent thousands of hours optimizing, it provides a window into capabilities that matter for real-world automation: planning over long horizons, programming against typed APIs, recovering from spatial mistakes, and choosing what to build next when no one is supervising the choice.
Its headline finding, that even the strongest publicly available models fully automate only a small fraction of lab-play tasks and reach only the early game in open-play, is now widely cited as evidence that current LLMs have a planning ceiling well below human industrial competence. At the same time, the rapid progress between v1 and v0.3 shows that the gap is closing fast. FLE will likely remain a useful instrument for charting that progress for several more years, both as a research benchmark and as a public dashboard for tracking how language model agents handle a hard, scalable industrial problem.