Large Behavior Model
Last reviewed
Sources
19 citations
Review status
Source-backed
Revision
v2 · 2,928 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
19 citations
Review status
Source-backed
Revision
v2 · 2,928 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Large Behavior Model (LBM) is a single neural network for robotics that is pretrained on large, diverse datasets of robot demonstrations and outputs robot actions, learning to perform many dexterous manipulation tasks with one set of weights rather than a separate policy per task. An LBM consumes streams of sensor data (camera images, robot proprioception, and natural-language task descriptions) and produces sequences of low-level motor commands, making it the robotics analogue of a large language model. The term was coined by Toyota Research Institute (TRI) on September 19, 2023, which framed the goal as building "'Large Behavior Models (LBMs)' for robots, analogous to the Large Language Models (LLMs) that have recently revolutionized conversational AI." [5]
Where an LLM is trained on internet-scale text to model language, an LBM is trained on large corpora of robot behavior to model action. LBMs sit inside the broader robot foundation model category and overlap heavily with vision-language-action (VLA) models, but the LBM label is most closely associated with TRI's specific line of work, which builds on the Diffusion Policy method rather than on a fine-tuned vision-language backbone. As a class of robotics foundation model, the defining bet behind LBMs is that pretraining on many tasks at once makes a robot policy more capable and more data-efficient when adapted to a new task.
LBMs belong to the wider effort in embodied AI and robot learning to replace hand-engineered, task-specific controllers with general-purpose policies learned from data. The defining characteristics, as used by TRI and adopted by collaborators such as Boston Dynamics, are:
Because the term originated as a deliberate parallel to LLMs, "Large Behavior Model" is sometimes used loosely as a synonym for any large robotics model that maps perception to action. In the more precise sense used in TRI's published research, an LBM is a scaled, multitask, diffusion-based visuomotor policy. The wiki maintains separate articles for the general robot foundation model and embodied AI categories; this article covers the LBM term specifically.
TRI introduced the LBM framing on September 19, 2023, alongside a generative-AI method, developed with Professor Shuran Song's group at Columbia University, for teaching robots dexterous skills from demonstration. [5] The robot learned each skill from a few dozen teleoperated demonstrations paired with a language description of the goal, using Diffusion Policy to model the demonstrated behavior. TRI reported that it had already taught its robots more than 60 difficult, dexterous skills, including pouring liquids, using tools, and manipulating deformable objects such as cloth, "without writing a single line of new code," with the only change being the supplied data. [5] The institute set targets of hundreds of skills by the end of 2023 and 1,000 by the end of 2024. [5] Russ Tedrake, who leads robotics research at TRI and is a professor at MIT, said at the time: "What is so exciting about this new approach is the rate and reliability with which we can add new skills." [5]
Gill Pratt, then TRI's CEO and Chief Scientist for Toyota Motor Corporation, positioned the work within Toyota's human-amplification mission rather than as labor replacement. [5] That 2023 announcement established LBMs as a research vision; the substantive empirical evidence for the approach arrived in 2025 with a large study, described below.
The technical foundation of TRI's LBMs is Diffusion Policy, introduced in the paper "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" by Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song, presented at Robotics: Science and Systems (RSS) in 2023 and later extended in the International Journal of Robotics Research. [11][12] Diffusion Policy represents a robot's visuomotor policy as a conditional denoising diffusion model over action sequences: instead of regressing a single action, the network iteratively denoises a sampled action trajectory conditioned on recent observations. The authors argued this is well suited to robot control because it gracefully represents multimodal action distributions (situations where several different motions are all valid), scales to high-dimensional action spaces, and trains stably. As the project reports, "Diffusion Policy outperforms prior state-of-the-art on 12 tasks across 4 benchmarks with an average success-rate improvement of 46.9%." [12] Several of the same authors, including Feng, Cousineau, Burchfiel, and Tedrake, are affiliated with TRI, which is why the LBM program is built directly on this method.
LBMs are trained by imitation learning (behavioral cloning) over teleoperated demonstrations rather than by reinforcement learning or reward design. Following Diffusion Policy and related work such as Action Chunking with Transformers, the model predicts short "action chunks" (a block of future timesteps at once) instead of a single next action, which improves temporal consistency and robustness. In TRI's 2025 study the policy outputs 20-dimensional actions over a 16-timestep horizon (about 1.6 seconds of motion), and at deployment executes 8 of those timesteps at 10 Hz (0.8 seconds) before recomputing. [1] Language conditioning is supplied through a natural-language prompt encoded alongside the visual and proprioceptive inputs, letting one network select among many learned behaviors.
The central hypothesis of the LBM program, mirroring the pretraining-then-finetuning recipe of LLMs, is that pretraining a single policy on a large, diverse mixture of tasks yields a model that is more capable and more data-efficient when adapted to a new task than a policy trained from scratch on that task alone. [1]
In 2025 the TRI Large Behavior Model team published "A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation" (arXiv:2507.05331, posted July 7, 2025), a large empirical study with 82 authors including Jose Barreiros, Rares Ambrus, Hadas Kress-Gazit, Siyuan Feng, and Russ Tedrake. [1] A peer-reviewed version subsequently appeared in Science Robotics (DOI 10.1126/scirobotics.aea6201). [2] The paper is notable both for its scale and for an unusually rigorous evaluation protocol, which the authors present as a corrective to the loosely controlled comparisons common in robot learning. Its headline conclusion is that "multi-task pretraining makes the policies more successful and robust, and enables teaching complex new tasks more quickly, using a fraction of the data when compared to single-task baselines." [1]
The LBMs studied are scaled multitask diffusion policies with multimodal Vision Transformer (ViT) vision-language encoders and a transformer denoising head conditioned on the encoded observations via Adaptive Layer Normalization (AdaLN). Inputs are wrist and scene camera images, robot proprioception, and a language prompt; the output is the 20-dimensional, 16-timestep (1.6 second) action chunk described above. [1] The models were deployed on bimanual stations built from dual Franka Panda FR3 arms with parallel grippers and six cameras (two FRAMOS scene cameras plus two FLIR wrist cameras per arm). [1]
The models were pretrained on roughly 1,700 hours (about 1,695 hours) of robot demonstration data drawn from four sources, as reported by TRI: [1]
| Source | Approx. hours |
|---|---|
| Internal bimanual robot teleoperation | 468 |
| Simulation-collected teleoperation | 45 |
| Universal Manipulation Interface (UMI) data | 32 |
| Public internet data (Open X-Embodiment) | ~1,150 |
| Total | ~1,700 |
The internally collected portion (called TRI-Ramen, roughly 545 hours) spanned 532 high-diversity tasks. [1]
The study evaluated 29 tasks (16 simulated tasks seen during pretraining, 3 real-world seen tasks, 5 unseen simulated tasks, and 5 unseen real-world tasks) across more than 47,000 simulation rollouts and roughly 1,800 rigorously controlled real-world trials, using 200 rollouts per simulation task and 50 per real-world task (per policy, per condition). [1] Crucially, the real-world comparisons were run as blind, randomized A/B tests against single-task baselines, with statistical significance assessed via sequential hypothesis testing and Clopper-Pearson confidence intervals. [1] This level of statistical control is the "careful examination" the title refers to.
The study's headline result is that multitask pretraining helps: a pretrained LBM, after fine-tuning, learns complex new tasks more successfully, more robustly, and with substantially less task-specific data than a policy trained from scratch. The paper states that "to achieve similar performance in simulation, when finetuning an LBM we require less than 30% of the data needed for training from scratch," [1] a roughly 3 to 5 times reduction in the demonstration data needed in challenging settings; TRI summarized this in popular communications as up to about 80% less data for new skills. [7][8] The benefit appeared "far before internet-scale data," emerging with just a few hundred diverse hours, and fine-tuned performance improved smoothly as pretraining data grew, with no evidence of sharp inflection points or emergent discontinuities of the kind sometimes claimed for language models. [1]
The authors were also candid about limits. Pretrained models used zero-shot, without fine-tuning, showed mixed results and did not clearly beat single-task baselines. And they found that seemingly minor design choices, such as input and action normalization, could have large effects on performance, often dominating architectural or algorithmic changes. [1] These caveats are part of why the paper is framed as a sober examination rather than a breakthrough announcement.
On August 20, 2025, TRI and Boston Dynamics announced a demonstration of the electric Atlas humanoid robot driven by a single Large Behavior Model, the first public fruit of a research partnership the two organizations had announced in October 2024. [9][10] In the demonstration, one language-conditioned LBM provided whole-body control of Atlas across its roughly 50 degrees of freedom through a continuous sequence of packing, sorting, and organizing tasks that combined object manipulation with locomotion: walking, crouching, and lifting, and adapting to mid-task perturbations such as a box being moved. [9][18] On the Atlas platform the policy predicts longer action chunks (length 48, about 1.6 seconds) and executes roughly half of each chunk per inference cycle. [18] Notably, the model treated hands and feet uniformly under a single policy rather than separating low-level walking and balancing from arm manipulation, an approach the partners highlighted as a departure from conventional humanoid software stacks. [18]
Scott Kuindersma, Vice President of Robotics Research at Boston Dynamics, said: "Training a single neural network to perform many long-horizon manipulation tasks will lead to better generalization, and highly capable robots like Atlas present the fewest barriers to data collection for tasks requiring whole-body precision, dexterity, and strength." [9] Russ Tedrake, by then Senior Vice President of Large Behavior Models at TRI, described the appeal of the approach this way: "Large Behavior Models address this opportunity in a fundamentally new way: skills are added quickly via demonstrations from humans, and as the LBMs get stronger, they require less and less demonstrations to achieve more and more robust behaviors." [9] The partners again stressed that new behaviors were added through demonstrations rather than by writing new control code.
LBMs are closely related to, but not identical with, vision-language-action models and other robot foundation models. The categories overlap, and the distinctions are partly about architecture and partly about which research group uses which label.
| System | Developer | Common label | Action mechanism | Notes |
|---|---|---|---|---|
| TRI LBMs | Toyota Research Institute | Large Behavior Model | Diffusion policy (denoising) | Origin of the LBM term; bimanual Franka arms; also drives Atlas with Boston Dynamics |
| pi-zero | Physical Intelligence | VLA / robot foundation model | Flow matching action expert on a VLM backbone | Built on the PaliGemma VLM; outputs continuous actions at high frequency |
| RT-2 | Google DeepMind | VLA | Actions emitted as discrete tokens by a VLM | Co-fine-tunes a large vision-language model on web and robot data |
| RT-1 / RT-X | Google DeepMind | Robotics transformer | Discrete action tokens | RT-X trained on the pooled Open X-Embodiment dataset across many labs |
| Gemini Robotics | Google DeepMind | VLA on a frontier VLM | Action decoding on a Gemini backbone | Brings a general-purpose multimodal model into embodied control |
| GR00T N1 | NVIDIA | Humanoid foundation model / VLA | Dual-system (VLM planner plus fast action model) | Open model trained on robot trajectories, human video, and synthetic data |
| Helix | Figure AI | VLA | Dual-system architecture | Controls full humanoid upper body; runs onboard |
The clearest technical fault line is architectural. The mainstream VLA recipe, exemplified by RT-2, Gemini Robotics, GR00T, and Helix, starts from a pretrained vision-language model and adds an action head, so the policy inherits semantic and reasoning priors from internet-scale image and text data. TRI's LBMs instead scale up the Diffusion Policy line of imitation learning, using a multimodal ViT encoder trained primarily on robot data rather than a large pretrained language backbone. Physical Intelligence's pi-zero sits between these poles: it uses a VLM backbone (PaliGemma) but, like the diffusion-based LBMs and unlike token-emitting VLAs, generates continuous actions, in its case via flow matching rather than denoising diffusion. Other companies such as Skild AI pursue general-purpose robot "brains" under their own branding without adopting the LBM term.
In short, every LBM is a robot foundation model, and TRI's LBMs perform the same vision-to-language-to-action mapping that defines a VLA, but not every VLA is called an LBM, and the LBM label specifically signals the TRI-style, diffusion-based, demonstration-trained lineage.
The LBM program shares the unresolved challenges of the broader embodied-AI field, several of which TRI's own evaluation makes explicit:
Robot foundation model · Foundation model · Embodied AI · Robot learning · Robot manipulation · Robotics models · Diffusion Policy · Diffusion model · Vision-language-action · pi-zero · RT-2 · Gemini Robotics · GR00T · Mobile ALOHA · Toyota Research Institute · Atlas