Large Behavior Model
Last reviewed
Jun 4, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 2,632 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 4, 2026
Sources
18 citations
Review status
Source-backed
Revision
v1 · 2,632 words
Add missing citations, update stale details, or suggest a clearer explanation.
A Large Behavior Model (LBM) is a class of AI model for robotics that is pretrained on large, diverse datasets of robot demonstrations and outputs robot actions, learning to perform many manipulation tasks with a single network rather than one policy per task. An LBM consumes streams of sensor data (camera images, robot proprioception, and natural-language task descriptions) and produces sequences of low-level motor commands. The term was coined by Toyota Research Institute (TRI) in September 2023, when TRI's then-CEO Gill Pratt framed the goal as building "Large Behavior Models for robots, analogous to the Large Language Models that have recently revolutionized conversational AI." In that analogy, where a large language model is trained on internet-scale text to model language, an LBM is trained on large corpora of robot behavior to model action. LBMs sit inside the broader robot foundation model category and overlap heavily with vision-language-action (VLA) models, but the LBM label is most closely associated with TRI's specific line of work, which builds on the Diffusion Policy method rather than on a fine-tuned vision-language backbone.
LBMs belong to the wider effort in embodied AI and robot learning to replace hand-engineered, task-specific controllers with general-purpose policies learned from data. The defining characteristics, as used by TRI and adopted by collaborators such as Boston Dynamics, are:
Because the term originated as a deliberate parallel to LLMs, "Large Behavior Model" is sometimes used loosely as a synonym for any large robotics model that maps perception to action. In the more precise sense used in TRI's published research, an LBM is a scaled, multitask, diffusion-based visuomotor policy. The wiki maintains separate articles for the general robot foundation model and embodied AI categories; this article covers the LBM term specifically.
TRI introduced the LBM framing on September 19, 2023, alongside a generative-AI method, developed with Columbia University, for teaching robots dexterous skills from demonstration. The robot learned each skill from a few dozen teleoperated demonstrations paired with a language description of the goal, using Diffusion Policy to model the demonstrated behavior. TRI reported that it had already taught its robots more than 60 difficult, dexterous skills, including pouring liquids, using tools, and manipulating deformable objects such as cloth, "without writing a single line of new code," with the only change being the supplied data. The institute set targets of hundreds of skills by the end of 2023 and 1,000 by the end of 2024. Russ Tedrake, who leads robotics research at TRI and is a professor at MIT, said at the time that "what is so exciting about this new approach is the rate and reliability with which we can add new skills."
That 2023 announcement established LBMs as a research vision. The substantive evidence for the approach arrived in 2025 with a large empirical study, described below.
The technical foundation of TRI's LBMs is Diffusion Policy, introduced in the paper "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" by Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song, presented at Robotics: Science and Systems (RSS) in 2023 and later extended in the International Journal of Robotics Research. Diffusion Policy represents a robot's visuomotor policy as a conditional denoising diffusion model over action sequences: instead of regressing a single action, the network iteratively denoises a sampled action trajectory conditioned on recent observations. The authors argued this is well suited to robot control because it gracefully represents multimodal action distributions (situations where several different motions are all valid), scales to high-dimensional action spaces, and trains stably. Benchmarked across 12 tasks from 4 manipulation benchmarks, Diffusion Policy reported an average improvement of 46.9% over prior state-of-the-art methods. Several of the same authors, including Feng, Cousineau, Burchfiel, and Tedrake, are affiliated with TRI, which is why the LBM program is built directly on this method.
LBMs are trained by imitation learning (behavioral cloning) over teleoperated demonstrations rather than by reinforcement learning or reward design. Following Diffusion Policy and related work such as Action Chunking with Transformers, the model predicts short "action chunks" (a block of future timesteps at once) instead of a single next action, which improves temporal consistency and robustness. TRI's LBMs predict 16-timestep chunks, corresponding to about 1.6 seconds of motion. Language conditioning is supplied through a natural-language prompt encoded alongside the visual and proprioceptive inputs, letting one network select among many learned behaviors.
The central hypothesis of the LBM program, mirroring the pretraining-then-finetuning recipe of LLMs, is that pretraining a single policy on a large, diverse mixture of tasks yields a model that is more capable and more data-efficient when adapted to a new task than a policy trained from scratch on that task alone.
In 2025 the TRI Large Behavior Model team published "A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation" (arXiv:2507.05331, posted July 7, 2025), a large empirical study with more than 80 authors including Jose Barreiros, Rares Ambrus, Hadas Kress-Gazit, Siyuan Feng, and Russ Tedrake. A peer-reviewed version subsequently appeared in Science Robotics (DOI 10.1126/scirobotics.aea6201). The paper is notable both for its scale and for an unusually rigorous evaluation protocol, which the authors present as a corrective to the loosely controlled comparisons common in robot learning.
The LBMs studied are scaled multitask diffusion policies with multimodal Vision Transformer (ViT) vision-language encoders and a transformer denoising head conditioned on the encoded observations via Adaptive Layer Normalization (AdaLN). Inputs are wrist and scene camera images, robot proprioception, and a language prompt; the output is a 16-timestep (1.6 second) action chunk. The models were deployed on bimanual stations built from dual Franka Panda FR3 arms with up to six cameras (two per wrist and two static scene cameras).
The models were pretrained on roughly 1,700 hours of robot demonstration data drawn from four sources, as reported by TRI:
| Source | Approx. hours |
|---|---|
| Internal bimanual robot teleoperation | 468 |
| Simulation-collected teleoperation | 45 |
| Universal Manipulation Interface (UMI) data | 32 |
| Public internet data (Open X-Embodiment) | ~1,150 |
| Total | ~1,700 |
The internally collected portion spanned over 500 high-diversity tasks.
The study evaluated 29 tasks (16 simulated tasks seen during pretraining, 3 real-world seen tasks, 5 unseen simulated tasks, and 5 unseen real-world tasks) across more than 47,000 simulation rollouts and roughly 1,800 controlled real-world trials, using 200 rollouts per simulation task and 50 per real-world task. Crucially, the real-world comparisons were run as blind, randomized A/B tests against single-task baselines, with statistical significance assessed via sequential hypothesis testing and Clopper-Pearson confidence intervals. This level of statistical control is the "careful examination" the title refers to.
The study's headline result is that multitask pretraining helps: a pretrained LBM, after fine-tuning, learns complex new tasks more successfully, more robustly, and with substantially less task-specific data than a policy trained from scratch. TRI summarized the data-efficiency gain in popular communications as up to about 80% less data for new skills, and the paper reports enabling new tasks to be learned with roughly 3 to 5 times less data in challenging settings that demand robustness. The benefit appeared "far before internet-scale data," emerging with just a few hundred diverse hours, and fine-tuned performance improved smoothly as pretraining data grew, with no evidence of sharp inflection points or emergent discontinuities of the kind sometimes claimed for language models.
The authors were also candid about limits. Pretrained models used zero-shot, without fine-tuning, showed mixed results and did not clearly beat single-task baselines. And they found that seemingly minor design choices, such as input and action normalization, could have large effects on performance, often dominating architectural or algorithmic changes. These caveats are part of why the paper is framed as a sober examination rather than a breakthrough announcement.
On August 20, 2025, TRI and Boston Dynamics announced a demonstration of the electric Atlas humanoid robot driven by a single Large Behavior Model, the first public fruit of a research partnership the two organizations had announced in October 2024. In the demonstration, one LBM provided whole-body control of Atlas through a continuous sequence of packing, sorting, and organizing tasks that combined object manipulation with locomotion: walking, crouching, and lifting, and adapting to mid-task perturbations such as a box being moved. Notably, the model treated hands and feet uniformly under a single policy rather than separating low-level walking and balancing from arm manipulation, an approach the partners highlighted as a departure from conventional humanoid software stacks.
Scott Kuindersma of Boston Dynamics said that "training a single neural network to perform many long-horizon manipulation tasks will lead to better generalization, and highly capable robots like Atlas present the fewest barriers to data collection." Russ Tedrake described the appeal of the approach as one in which "skills are added quickly via demonstrations from humans, and as the LBMs get stronger, they require less and less demonstrations." The partners again stressed that new behaviors were added through demonstrations rather than by writing new control code.
LBMs are closely related to, but not identical with, vision-language-action models and other robot foundation models. The categories overlap, and the distinctions are partly about architecture and partly about which research group uses which label.
| System | Developer | Common label | Action mechanism | Notes |
|---|---|---|---|---|
| TRI LBMs | Toyota Research Institute | Large Behavior Model | Diffusion policy (denoising) | Origin of the LBM term; bimanual Franka arms; also drives Atlas with Boston Dynamics |
| π₀ (pi-zero) | Physical Intelligence | VLA / robot foundation model | Flow matching action expert on a VLM backbone | Built on the PaliGemma VLM; outputs continuous actions at high frequency |
| RT-2 | Google DeepMind | VLA | Actions emitted as discrete tokens by a VLM | Co-fine-tunes a large vision-language model on web and robot data |
| RT-1 / RT-X | Google DeepMind | Robotics transformer | Discrete action tokens | RT-X trained on the pooled Open X-Embodiment dataset across many labs |
| Gemini Robotics | Google DeepMind | VLA on a frontier VLM | Action decoding on a Gemini backbone | Brings a general-purpose multimodal model into embodied control |
| GR00T N1 | NVIDIA | Humanoid foundation model / VLA | Dual-system (VLM planner plus fast action model) | Open model trained on robot trajectories, human video, and synthetic data |
| Helix | Figure AI | VLA | Dual-system architecture | Controls full humanoid upper body; runs onboard |
The clearest technical fault line is architectural. The mainstream VLA recipe, exemplified by RT-2, Gemini Robotics, GR00T, and Helix, starts from a pretrained vision-language model and adds an action head, so the policy inherits semantic and reasoning priors from internet-scale image and text data. TRI's LBMs instead scale up the Diffusion Policy line of imitation learning, using a multimodal ViT encoder trained primarily on robot data rather than a large pretrained language backbone. Physical Intelligence's π₀ sits between these poles: it uses a VLM backbone (PaliGemma) but, like the diffusion-based LBMs and unlike token-emitting VLAs, generates continuous actions, in its case via flow matching rather than denoising diffusion. Other companies such as Skild AI pursue general-purpose robot "brains" under their own branding without adopting the LBM term.
In short, every LBM is a robot foundation model, and TRI's LBMs perform the same vision-to-language-to-action mapping that defines a VLA, but not every VLA is called an LBM, and the LBM label specifically signals the TRI-style, diffusion-based, demonstration-trained lineage.
The LBM program shares the unresolved challenges of the broader embodied-AI field, several of which TRI's own evaluation makes explicit:
Robot foundation model · Embodied AI · Robot learning · Robot manipulation · Robotics models · Diffusion model · Vision-language-action · π₀ (pi-zero) · RT-2 · Gemini Robotics · GR00T · Mobile ALOHA · Toyota Research Institute · Atlas