Droidlet
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,532 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
7 citations
Review status
Source-backed
Revision
v1 · 1,532 words
Add missing citations, update stale details, or suggest a clearer explanation.
Droidlet is an open-source platform from Facebook AI Research (now Meta AI) for building embodied AI agents. Released in 2021, it is a modular framework that lets researchers prototype agents which connect natural language understanding, perception, memory, and action, so that an agent can map human language commands to actions in a world. Droidlet is designed to work across both virtual environments (a Minecraft-like setting derived from Meta's CraftAssist project) and physical robots (such as a LoCoBot), letting individual components be swapped in and out. Its stated goal is to lower the engineering effort needed to combine machine-learning models and hand-written heuristics into a single working embodied AI agent.[1][2]
The platform was described in the paper "droidlet: modular, heterogenous, multi-modal agents" by Anurag Pratik, Soumith Chintala, Kavya Srinet, Dhiraj Gandhi, Rebecca Qian, Yuxuan Sun, Ryan Drew, Sara Elkafrawy, Anoushka Tiwari, Tucker Hart, Mary Williamson, Abhinav Gupta, and Arthur Szlam. It was posted to arXiv on January 25, 2021, with most authors at Facebook AI Research and one affiliation at Carnegie Mellon University.[1] Meta AI announced the public, open-source release in a blog post on July 28, 2021.[2]
The droidlet authors argue that most large-scale machine-learning systems are isolated to a single modality (perception, speech, or language) and trained on static datasets, while robotics struggles because real-world supervision is hard to gather and physical interactions are slow and expensive. Their proposed resolution is to stop treating "the" agent as a monolith and instead build "a family of agents, made up of a collection of modules, some of which may be heuristic, and others learned."[1] A carefully designed architecture, in this view, lets each component train on large static data when such data exists, or use a programmer's heuristics when good ones are available, without requiring every part to learn from the same experience. The broader research aim is to bring perception, language, and action onto one platform so that agents can eventually learn from the richness of real-world interaction rather than only from internet data.[1]
Droidlet separates an agent into a small number of cooperating parts so that any one can be replaced with a learned model or a scripted heuristic. The core event loop runs fast perceptual modules, updates memory, steps a controller (which checks for incoming commands and steps the highest-priority dialogue object), and then steps the highest-priority task.[1] The principal pieces are summarized below.
| Component | Role |
|---|---|
| Memory system | Stores and organizes what the agent perceives and does. It is the interface between perception and the controller, built from "memory nodes" backed by SQL tables, with support for semantic triples (for example, that a particular chair has the color red) and stored floating-point vectors such as locations, poses, and ML features.[1] |
| Perceptual API and modules | Process information from the world and write it into memory. Bundled modules are mostly visual: object detection and instance segmentation, real-time 2D object tracking, human pose estimation, and face detection and recognition. A deduplication step decides whether a detection is a new object or another view of a known one.[1] |
| Task queue | Holds lower-level, mostly self-contained world interactions such as moving to a coordinate, turning to face a direction, pointing, or grasping with an arm.[1][2] |
| Controller | Decides which tasks to run based on the state of memory. In the example agents it is split into a dialogue controller and a task controller; the dialogue controller places dialogue objects on a dialogue queue, which interpret human utterances and add tasks (or ask for clarification).[1] |
| Neural semantic parser | Translates human utterances into partially specified programs in a domain-specific language (DSL). The droidlet version reuses the BERT-based architecture from the earlier CraftAssist parser, with training data extended to cover new DSL constructs such as grasping and more powerful memory queries.[1] |
| Interpreter | The main dialogue object in the example agents. It reads the parsed logical form, queries memory through the DSL, and places concrete tasks on the task queue. It is heuristic in the released agents but is structured so its modules can be replaced with learned versions.[1] |
| Dashboard | A web interface for operating and inspecting the agent. It supports remote control (text or speech), diagnostics and state visualization (a 2D map of remembered objects, plus views of the task and dialogue queues and parsed logical forms), and data-annotation tools for labeling perception and language data.[1][2] |
When a person types a command such as "go to the chair," the message is passed through the dashboard, the semantic parser labels it as a command and emits a logical form, the interpreter searches memory for an object tagged "chair," and, if one is found, it places a movement task with the object's coordinates on the task queue. If no matching object is in memory, the interpreter instead places a clarification dialogue object on the dialogue queue.[1] This division means a contributor who improves one module, for example a better object detector, improves every agent that uses that module rather than just their own.[1][2]
Droidlet also ships a model and data "zoo" so contributors can post new models or heuristics and share training data. Examples include a modified Mask R-CNN object detector with an added head predicting object properties (pre-trained on LVIS and fine-tuned on a locally collected dataset of 38 object classes and 298 properties), a SORT-based 2D tracker, a COCO-trained pose detector, face recognition combining dlib with Multi-task Cascaded Convolutional Networks, a laser-pointer handler for "look at where I am pointing," and a learning-based grasping model carried over from PyRobot.[1]
A central design claim is that the same agent abstractions can be instantiated in any setting where the perceptual and low-level action APIs can be implemented.[1] The released examples cover one physical and two virtual or simulated settings.
| Environment | Description |
|---|---|
| LoCoBot (real world) | A low-cost mobile robot driven through the PyRobot API as the low-level controller, used to demonstrate droidlet on physical hardware.[1] |
| AI Habitat (simulation) | Meta's simulator, used to run the same Locobot-style agent in simulation.[1] |
| Minecraft / CraftAssist (virtual) | A voxel-world assistant in Minecraft, used as the low-level controller for the virtual agent and inherited from Meta's earlier CraftAssist work.[1][3] |
Droidlet grew out of Meta's CraftAssist project, described in the 2019 paper "CraftAssist: A Framework for Dialogue-enabled Interactive Agents" by Jonathan Gray, Kavya Srinet, Yacine Jernite, Haonan Yu, Zhuoyuan Chen, Demi Guo, Siddharth Goyal, C. Lawrence Zitnick, and Arthur Szlam. CraftAssist built a collaborative assistant bot inside Minecraft that could carry out tasks specified by player dialogue, and it shipped tools for recording those interactions to study grounded dialogue and interactive learning.[3] CraftAssist is effectively an earlier, Minecraft-specific predecessor of droidlet, and its Minecraft agent is included in the droidlet source so it can run alongside the platform's other modules. Several droidlet pieces, including the semantic parser architecture and DSL, are direct descendants of CraftAssist work.[1][3]
Droidlet was open-sourced under the MIT license and published at the GitHub repository facebookresearch/droidlet. The repository was later folded into a unified Meta AI robotics project called Fairo, which collects droidlet alongside Polymetis (a PyTorch-based real-time controller manager) and the Meta Robotics Platform for orchestrating heterogeneous robots; within Fairo, droidlet is described as an early research project for exploring grounded dialogue, interactive learning, and human-computer interfaces.[4] The project provided basic agents intended for researchers and hobbyists rather than a finished product, and Meta framed it as a substrate for new research in self-supervised, multi-modal, interactive, and lifelong learning.[1][2]
Coverage at the July 2021 release described droidlet as a modular platform sitting at the intersection of natural language processing, computer vision, and robotics, and as a way to prototype embodied agents faster by making it easy to swap one perception or language model for another.[5][6] Later Meta AI work built directly on the platform; the 2022 paper "Many Episode Learning in a Modular Embodied Agent via End-to-End Interaction" used a droidlet-style modular agent to study learning across many interaction episodes.[7]