Voyager (Minecraft LLM agent)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,960 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,960 words
Add missing citations, update stale details, or suggest a clearer explanation.
Voyager is an open-ended embodied agent that uses a large language model to play Minecraft by writing, executing, and storing JavaScript programs against the Mineflayer bot API. It was introduced in May 2023 by Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi "Jim" Fan, and Anima Anandkumar, working across NVIDIA, the California Institute of Technology, Stanford University, the University of Texas at Austin, and Arizona State University.[^1] The system combines three components: an automatic curriculum that proposes the next task to attempt, an iterative prompting mechanism that writes and refines code using environmental feedback, execution errors, and a self-verification critic, and a skill library that indexes successful code by an embedding of its natural-language description so that prior skills can be retrieved and composed into new behaviors.[^1][^2] Voyager runs against the live Minecraft game through a Mineflayer JavaScript controller and uses GPT-4 for code generation and curriculum proposals, with no model fine-tuning required.[^2][^3] In experiments reported in the paper, it discovered 3.3 times as many unique items, traveled 2.3 times farther, and unlocked the wooden, stone, and iron tech-tree milestones substantially faster than ReAct, Reflexion, and Auto-GPT baselines, and it was the only method to reach diamond tools.[^1][^4] The paper was later accepted to the Transactions on Machine Learning Research in March 2024 and received the journal-track certification for the International Conference on Learning Representations 2025.[^5]
| Field | Value |
|---|---|
| Project name | Voyager |
| Full title | Voyager: An Open-Ended Embodied Agent with Large Language Models |
| First arXiv preprint | 25 May 2023 (v1), revised 19 October 2023 (v2) |
| arXiv identifier | 2305.16291 |
| Venue | Transactions on Machine Learning Research (TMLR), accepted 19 March 2024 |
| Authors | Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar |
| Affiliations | NVIDIA, Caltech, UT Austin, Stanford, ASU |
| Project page | voyager.minedojo.org |
| Code repository | github.com/MineDojo/Voyager (MIT license) |
| Environment | Minecraft (Java Edition) via Mineflayer bot API |
| Backbone model | GPT-4 (gpt-4-0314) for reasoning and code; GPT-3.5-turbo and text-embedding-ada-002 for auxiliary roles |
| License | MIT (code), CC BY 4.0 (paper) |
The Voyager paper was uploaded to arXiv on 25 May 2023 as preprint 2305.16291, two days before a related effort, Ghost in the Minecraft (GITM), appeared as preprint 2305.17144 on 27 May 2023.[^1][^6] Both were part of a sudden wave of work in late spring and summer 2023 that asked whether the new generation of large language models could replace handcrafted policies in Minecraft, an environment that had been a benchmark for open-ended reinforcement learning since the MineRL competitions and OpenAI's Video PreTraining (VPT) demonstration in 2022. Earlier Minecraft work largely treated the game as an RL problem, training pixel-conditioned policies with imitation learning, behavior cloning from gameplay videos, or model-based methods, all of which required large amounts of compute and were typically specialized to a narrow set of tasks such as obtaining a diamond.[^6][^7]
Several of Voyager's authors had worked on the MineDojo platform, an open framework released in 2022 that exposes thousands of Minecraft tasks and a large internet-scraped corpus of Minecraft videos, wiki entries, and forum posts. MineDojo established the lab's interest in using internet-scale knowledge to drive agents that learn open-ended skills rather than optimizing a single reward function. Voyager extends that program by removing learned visual perception and learned low-level control entirely. Instead, it queries the environment through structured Mineflayer API calls (such as bot.dig, bot.placeBlock, or bot.craft) and treats the act of writing those calls in JavaScript as the entire learning loop, with GPT-4 as the only "policy" and a vector database of accumulated programs as long-term memory.[^1][^2]
The paper frames Voyager as the first LLM-powered "lifelong learning" agent in an open-ended world, where lifelong learning is taken to mean three properties: continual exploration of new tasks driven by the agent's own evolving capabilities, accumulation of reusable skills without catastrophic forgetting, and zero-shot transfer of those skills to new worlds and new goals.[^1] The authors argue that Minecraft is an unusually suitable testbed because progression is genuinely open-ended (the world is procedurally generated and effectively unbounded), the technology tree imposes a partial order on capabilities (you cannot mine iron until you have a stone pickaxe, you cannot mine diamond until you have an iron pickaxe, and so on), and the game's verbosity of in-game events lets a text-only model receive useful feedback without vision.[^1]
Voyager's loop is a continuous interaction between four pieces: the Minecraft server (modified for additional event reporting through Fabric mods), the Mineflayer JavaScript bot connected to the server, the language model that proposes tasks and writes code, and a persistent vector database that stores successfully verified programs as named skills.[^2][^3] The agent does not see pixels; everything reaches GPT-4 as text describing inventory, nearby blocks and entities, biome, time of day, and similar state, plus the result of any code that has just been executed.
The automatic curriculum is a GPT-4 prompt that, given the agent's state, inventory, biome, recent completed and failed tasks, and a short instruction emphasizing the goal of maximizing discovery, returns the next task to attempt as a short natural-language description (for example, "mine 1 wood log" or "craft a stone pickaxe").[^1][^2] The curriculum uses chain-of-thought prompting, asking the model first to reason about what is novel and feasible given the inventory, and then to produce the next task. The authors describe this as "in-context novelty search," because the LLM, biased by its world knowledge, tends to suggest tasks that are reachable but not yet attempted, producing a diverse curriculum without explicit reward shaping.[^1][^4]
When a proposed task fails repeatedly the curriculum is asked to either decompose it into easier subtasks or to skip it for now, which avoids getting stuck on a single hard task. Temperature is set to 0 for most prompts but raised slightly (to 0.1) for the curriculum to encourage diversity in proposed tasks.[^4] The lack of any hard-coded task tree is important to the paper's claim of open-endedness: the only goal given to the agent is "discover as much as you can," and the technology-tree milestones reported in evaluations are emergent consequences of the LLM's prior knowledge of the game.[^1]
Once a task is proposed, Voyager enters the iterative prompting loop. GPT-4 receives the task, the agent's state, a description of the available Mineflayer primitives, and the top-five most relevant skills retrieved from the skill library. It is then asked to produce a JavaScript function that solves the task using those primitives and any retrieved helpers. The function is sent to the Mineflayer bot, executed in the running Minecraft world, and the bot reports back three streams of feedback: the environment state after execution, any execution errors raised by the JavaScript runtime or the Mineflayer API, and a self-verification verdict from a second GPT-4 instance acting as a critic.[^1][^2][^4]
The critic is given the task description, the post-execution agent state, and the most recent code, and is prompted to decide whether the task has been completed and, if not, to suggest concrete fixes. If the critic says the task is incomplete or the runtime raised errors, the original GPT-4 instance is re-prompted with the previous code, the error messages, and the critic's suggestions, and produces a revised program. The loop runs for up to four rounds per task in the experiments reported in the paper. If all four attempts fail, the task is returned to the curriculum, which may decide to decompose it or shelve it.[^1][^4] The combination of execution errors, environment state, and a separate critic call is what the paper means by "self-verification," and ablations in the paper show that removing the critic substantially degrades tech-tree progress.[^4]
The iterative prompting design is closely related to ReAct-style and Reflexion-style loops in that the model reflects on feedback and revises, but the action space is different. ReAct and Reflexion typically emit a single textual action per step. Voyager emits a complete program that may run for many in-game ticks and call many primitives, and the feedback is collected only after the program terminates. This trade lets the model amortize reasoning over many low-level actions and makes the resulting behaviors directly storable as reusable units.[^1][^4]
The skill library is the agent's long-term memory. After a task is judged successful by the self-verifier, the corresponding JavaScript function is paired with a short natural-language description (also generated by the LLM) and stored together with an embedding of that description.[^1][^2] The embedding is produced by OpenAI's text-embedding-ada-002 model. When a new task arrives in a future iteration, the description of that task is embedded and the top-five nearest skills in the vector database are retrieved and prepended to the prompt as documentation and as importable helpers. The new program can call these stored skills directly, which lets the agent compose previously verified behaviors (such as mineWoodLog or craftWoodenPickaxe) into larger tasks (such as craftStonePickaxe).[^1][^4]
The skill library is therefore an in-context learning mechanism layered on top of an external vector store. Unlike the parameters of the LLM, it grows over the agent's lifetime, and unlike a flat episodic memory it is indexed semantically. The paper presents this design as a way to bypass catastrophic forgetting: because skills are stored as verified, self-contained programs rather than embedded in model weights, adding new skills cannot corrupt old ones. The authors also note that storing skills as code rather than as text traces makes them deterministic to re-execute, which matters when the same skill is reused in a different world or against a different starting inventory.[^1][^4] Architecturally the system resembles a primitive form of retrieval-augmented generation where the retrieved documents are executable functions rather than encyclopedia passages.
The reference implementation released on GitHub uses Python for orchestration and JavaScript for the Mineflayer environment. The repository reports being roughly 66% JavaScript and 33% Python by line count and is licensed under MIT.[^3] Minimum requirements are Python 3.9 or later, Node.js 16.13.0 or later, an OpenAI API key with access to GPT-4 (the original experiments used the gpt-4-0314 snapshot and gpt-3.5-turbo-0301 as an auxiliary model), and a running Minecraft Java Edition instance with LAN enabled and a set of Fabric mods installed for additional event reporting.[^3][^4] The bot connects to that instance through Mineflayer, which is a long-running open-source JavaScript library that exposes high-level Minecraft primitives such as block placement, mob attacks, crafting, and pathfinding.[^8] The Python side runs the curriculum, code-generation, and verification prompts and persists the skill library to disk.
The paper's main experiments used GPT-4 at temperature 0 for code generation and verification and temperature 0.1 for the curriculum. The skill library was reset between independent experimental runs but allowed to grow continuously within a run. The token budget per prompt depended on the number of retrieved skills and the size of the running context; the authors report that empirically the prompt rarely exceeded GPT-4's 8K context window during training.[^4]
The four components above compose into a loop that the authors describe as lifelong learning. At each cycle, the curriculum proposes a task; the iterative prompting mechanism produces a candidate program, executes it, and revises it for up to four rounds; the self-verification critic determines whether the task is complete; on success, the verified program is added to the skill library with its embedding; on failure, the curriculum either decomposes the task or sets it aside. Over many cycles the skill library grows, the prompts to GPT-4 get richer because retrieval surfaces more relevant prior skills, and the curriculum is encouraged by its prompt to suggest tasks that exercise new biomes, new materials, and new tools.[^1][^4]
A key property of this loop is that no external reward is required. The only training signal is the binary success or failure judgment of the self-verifier and the structured event log returned by Mineflayer. This is closer in spirit to curriculum learning driven by intrinsic motivation than to standard reinforcement learning with a hand-designed reward. It is also distinct from reinforcement learning from human feedback: there is no human in the loop at training time, and the LLM weights are never updated. All "learning" lives in the skill library, in the curriculum's running history of completed and failed tasks, and in the LLM's in-context use of these.[^1][^4]
The paper evaluates Voyager along three axes: exploration breadth, tech-tree mastery, and zero-shot transfer to new worlds and new tasks. The three baselines are ReAct, Reflexion, and Auto-GPT, all adapted for Minecraft by giving them the same Mineflayer action space; the original ReAct and Reflexion papers used text-only NLP benchmarks, so they had to be embodied through an adaptation layer.[^1][^4]
In an exploration-oriented run lasting 160 outer iterations, Voyager discovered 63 unique items in the Minecraft inventory. The paper reports this as 3.3 times the number of unique items found by the best baseline within the same iteration budget; ReAct and Reflexion stalled early in the wood tier and Auto-GPT plateaued in the stone tier. The agent also traveled 2.3 times longer cumulative distances than baselines, which the authors take as evidence that the curriculum is genuinely pushing the agent into new biomes rather than circling near the spawn point.[^1][^4]
The authors then measure how quickly each method unlocks the wooden, stone, iron, and diamond tiers of the Minecraft technology tree, averaging over three random seeds. The reported figures are:[^4]
| Tier | Voyager (iterations) | Speedup vs. previous SOTA |
|---|---|---|
| Wooden tools | 6 ± 2 | 15.3 times faster |
| Stone tools | 11 ± 2 | 8.5 times faster |
| Iron tools | 21 ± 7 | 6.4 times faster |
| Diamond tools | 102 | Only Voyager reached diamond |
ReAct and Reflexion failed to reach any tech tier in the iteration budget given in the paper, and Auto-GPT reached wooden, stone, and iron tiers but did not unlock diamond. The "previous state of the art" in the speedup column refers to ReAct, the strongest of the three text-action baselines under the iteration budget used in the paper.[^1][^4]
The third evaluation tests whether the skill library learned in one world transfers to fresh worlds with novel goals. The authors place the learned library into an empty-inventory, freshly seeded world and ask Voyager to solve four held-out tasks: obtain a diamond pickaxe, craft a golden sword, fill a lava bucket, and craft a compass. Voyager solved all four within 50 iterations (mean iterations to first success: 19 ± 3, 18 ± 7, 21 ± 5, and 18 ± 2 respectively), while ReAct, Reflexion, and Auto-GPT solved none of the four within the same budget.[^1][^4] The paper highlights this result as the clearest evidence that the skill library is not just memorizing the training world but is composing reusable, world-agnostic behaviors.
Ablations in the paper remove each of the three components in turn. Removing the skill library (forcing the agent to write every program from scratch) substantially slows tech-tree progress and causes diamond to become unreachable in the iteration budget. Removing the self-verifier (using only environment errors and state) causes failure rates on intermediate tasks to climb because the agent does not notice "silent" failures where the code ran without errors but did not actually achieve the goal. Removing the automatic curriculum (replacing it with a fixed task list) forces the authors to hand-engineer a sequence that ends up looking like a primitive technology tree, which constrains the agent's exploration and reduces unique-item counts.[^1][^4] The ablations are presented as evidence that all three components are required: any two of them without the third produce significantly weaker agents.
The headline comparison in the paper places Voyager alongside three text-agent baselines that were each prominent in the broader AI agent discourse of early 2023.
| Method | Action space | Memory | Learns to compose skills? | Diamond reached in budget? |
|---|---|---|---|---|
| Voyager | JavaScript programs over Mineflayer API | Vector-indexed skill library, persisted across tasks | Yes, via retrieved-and-composed code | Yes (102 iterations on average) |
| ReAct (adapted) | Single textual action per step | Recent context only | No | No |
| Reflexion (adapted) | Single textual action per step plus reflection traces | Reflection trace appended to prompt | No | No |
| Auto-GPT (adapted) | Single textual goal-decomposed action | Append-only conversation history | Partial (only via goal decomposition) | No |
ReAct (Yao et al., 2023) interleaves reasoning and actions as alternating "Thought" and "Action" tokens, designed originally for question answering and web navigation tasks. Adapting it to Minecraft means asking the LLM to emit a single Mineflayer call per step rather than a complete program, which the Voyager paper argues is too low a level of abstraction for a verbose game environment.[^1][^4]
Reflexion (Shinn et al., 2023) extends ReAct by maintaining a verbal reflection on previous attempts and appending it to the prompt for the next attempt. In Minecraft, this helps the agent avoid repeating identical mistakes but does not give it any way to reuse a successful behavior in a new task, because the reflections are textual descriptions of episode outcomes rather than callable code.[^1][^4]
Auto-GPT (released in March 2023 as an open-source project) uses an LLM to maintain a persistent goal queue and to decompose goals into subgoals, with a long-running event loop. In the Voyager evaluation it performs better than ReAct or Reflexion because its decomposition is closer to what a Minecraft technology tree requires, but it still lacks a structured skill library and so its reasoning bloats over time.[^1][^4]
A common thread across the baselines is that none of them treats the agent's policy as a growing library of executable code. Voyager's central design decision is to make code the unit of skill and to retrieve those code units by embedding similarity, which is the property the ablations attribute the largest performance gains to.[^1][^4]
Voyager became one of the most cited Minecraft-as-benchmark agent papers of 2023 and a frequent reference point for any subsequent work that proposed a "skill library" for LLM agents. Several reasons are worth noting.
First, it provided a concrete demonstration that a single frozen LLM, given the right scaffolding, could outperform RL-trained specialists on a hard exploration benchmark without any reward shaping. This had been argued in the abstract by tool-use proponents but Voyager grounded the claim in a difficult, well-known game with a quantitative tech-tree comparison.[^1][^4]
Second, Voyager popularized "skill library" as an architectural pattern. Subsequent agent papers, including ones outside Minecraft, adopted the idea that successful behaviors should be stored as named, embedded, retrievable code units rather than as freeform text. The skill library can also be viewed as an early example of agentic RAG where the retrieved content is executable.[^1][^9]
Third, the system showed that environment-grounded execution errors are a useful, almost free, training signal for an LLM agent. Voyager's iterative refinement loop uses runtime errors as one of three feedback channels and the paper's ablations show that even just the error messages, without the self-verifier, suffice to recover much of the wooden- and stone-tier performance. This contributed to a broader 2023 to 2024 trend of "execute then revise" LLM coding workflows.[^4]
The paper has been widely cited as one of the inspirations for follow-on Minecraft and embodied-agent systems. The next subsection covers the most directly comparable works in that line.
Ghost in the Minecraft (Zhu et al., 2023) appeared on arXiv as preprint 2305.17144 two days after Voyager and proposed an alternative LLM-driven architecture composed of an LLM Decomposer, an LLM Planner, and an LLM Interface that emits structured keyboard and mouse actions.[^6] GITM relies more heavily on text-based knowledge collected from the Minecraft wiki to seed its planning module, and unlike Voyager it does not maintain a vector-indexed skill library of executable programs. GITM reports reaching 67.5% success on the "ObtainDiamond" task and unlocking 100% of the Overworld technology tree items, and the authors highlight that they require only a single 32-core CPU node for the agent's planning, in contrast to the thousands of GPU days that OpenAI's Video PreTraining (VPT) used.[^6] The two papers are usually read together as the inception of the LLM-driven Minecraft agent line.
Steve-Eye (Zheng et al., 2023, arXiv 2310.13255) extended the LLM-Minecraft line to multimodality by attaching a visual encoder to a large language model and training the combined model on roughly 850,000 multimodal Minecraft instruction pairs.[^10] Unlike Voyager, Steve-Eye fine-tunes the model rather than freezing it, and unlike Voyager it consumes pixel observations from the game rather than structured event logs. The paper, presented at ICLR 2024, reports improved environmental captioning and reasoning over BLIP-2 style baselines and explicitly cites Voyager as motivation, framing visual perception as the next axis along which an LLM Minecraft agent must grow.[^10] Steve-Eye is the first entry in a longer "STEVE" sequence of Minecraft agent papers that build progressively more elaborate planning, memory, and visual modules on top of an LLM.
OmniJARVIS (Wang et al., NeurIPS 2024, arXiv 2407.00114) proposed a unified vision-language-action model for Minecraft trained on a single tokenizer covering visual, textual, and behavior trajectories.[^11] Where Voyager treats action selection as code generation and Steve-Eye treats it as a multimodal text decision, OmniJARVIS treats it as decoding tokens from a single learned vocabulary, with a self-supervised behavior encoder producing discrete action tokens and an imitation-learning policy decoder consuming them. OmniJARVIS was published as a NeurIPS 2024 poster and is the direct descendant of the earlier JARVIS-1 and MP5 Minecraft systems by the CraftJarvis group.[^11] It explicitly cites Voyager and GITM as the language-only baselines that it sets out to improve upon by integrating vision and action under one model.
Voyager has been referenced as a methodological precedent across the broader AI agent literature: the design pattern of "LLM plus growing code library plus self-verifier" appears in software-engineering agents, robotics policy programs (where the action space is a robotics-API call instead of a Mineflayer call), and household-task agents in simulated kitchens. The paper's central claim that skill composition is the missing ingredient for open-ended LLM agents has been adopted, with adaptations, by many of those downstream systems. Voyager is also frequently used as a teaching example for tool use in LLM agents and for AI code generation as an action space.[^9]
The Voyager paper is candid about several limitations.
The most obvious is the dependence on GPT-4. The authors report that GPT-3.5 was substantially worse at producing correct Mineflayer code and at acting as a verifier, and that performance with smaller open models was not sufficient to reproduce the tech-tree results. This makes Voyager expensive to run for long horizons and ties its capabilities to a specific commercial API; the original experiments used the gpt-4-0314 snapshot that has since been retired by OpenAI, so exact reproduction of the original numbers is no longer possible against the same model.[^1][^4]
A second limitation is the lack of vision. Voyager interacts only with the Mineflayer event stream and structured state representations. Tasks that require visual reasoning, such as recognizing unusual structures, reading natural-language signs placed in the world, or interpreting the appearance of mobs, are out of scope. Steve-Eye and OmniJARVIS, among others, were positioned in part as responses to this constraint.[^10][^11]
A third limitation is the brittleness of the iterative prompting loop. The four-attempts-per-task budget is a hard cap, and failures that require more than four refinements (for example, tasks that have rare but critical edge cases) are simply abandoned. The authors mention that performance on certain late-game tasks would likely improve with a larger refinement budget but that token costs grow rapidly.[^4]
Fourth, the curriculum and the self-verifier are themselves LLM prompts and can fail silently. The self-verifier can produce false positives, marking a task as completed when it was not, especially when the agent state representation is incomplete (for example, when a relevant block is just out of the agent's reported viewing distance). The skill library then stores a program tagged with a misleading description, which can poison future retrievals. The paper notes this failure mode and suggests stricter verification heuristics as future work but does not provide a definitive solution.[^4]
Finally, although Voyager outperforms its 2023-era LLM baselines, it does not compare against the strongest learned RL baselines from earlier Minecraft work, such as VPT or DreamerV3, on identical task setups. Ghost in the Minecraft did report such comparisons and noted vastly better sample efficiency for LLM-based agents, but Voyager itself focuses on the ReAct/Reflexion/Auto-GPT axis. The paper's contributions to the Minecraft literature should therefore be read primarily as a demonstration about LLM agents rather than as a definitive benchmark against learned policies.[^4][^6]
The Voyager codebase reached widespread visibility on GitHub shortly after release. As of mid-2026 the MineDojo/Voyager repository reports roughly 6,900 stars and over 670 forks, distributed under the MIT license with separate releases of learned skill libraries for task-specific inference.[^3] The official project page hosts videos of the agent unlocking the tech tree, the camera-ready paper, and pointers to the codebase. The paper has been an assigned reading in multiple university courses on LLM-based agents and is one of the canonical references in survey articles on embodied AI and LLM-driven game agents written since 2023.[^9]
The TMLR camera-ready version, accepted on 19 March 2024, is the formal publication of record.[^5] The journal-track certification for ICLR 2025 (denoted "iclr.cc/ICLR/2025/Journal_Track") added a conference presentation in 2025 to the paper's record. The associated OpenReview page lists Aleksandra Faust as the action editor and reports submission number 1701.[^5]