# Large Behavior Model

> Source: https://aiwiki.ai/wiki/large_behavior_models
> Updated: 2026-06-27
> Categories: Embodied AI, Machine Learning, Robotics
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **Large Behavior Model (LBM)** is a single neural network for robotics that is pretrained on large, diverse datasets of robot demonstrations and outputs robot actions, learning to perform many dexterous manipulation tasks with one set of weights rather than a separate policy per task. An LBM consumes streams of sensor data (camera images, robot proprioception, and natural-language task descriptions) and produces sequences of low-level motor commands, making it the robotics analogue of a [large language model](/wiki/large_language_model). The term was coined by [Toyota Research Institute](/wiki/toyota_research_institute) (TRI) on September 19, 2023, which framed the goal as building "'Large Behavior Models (LBMs)' for robots, analogous to the Large Language Models (LLMs) that have recently revolutionized conversational AI." [5]

Where an LLM is trained on internet-scale text to model language, an LBM is trained on large corpora of robot behavior to model action. LBMs sit inside the broader [robot foundation model](/wiki/robot_foundation_model) category and overlap heavily with [vision-language-action](/wiki/vision_language_action_model) (VLA) models, but the LBM label is most closely associated with TRI's specific line of work, which builds on the [Diffusion Policy](/wiki/diffusion_policy) method rather than on a fine-tuned vision-language backbone. As a class of robotics [foundation model](/wiki/foundation_model), the defining bet behind LBMs is that pretraining on many tasks at once makes a robot policy more capable and more data-efficient when adapted to a new task.

## What is a Large Behavior Model?

LBMs belong to the wider effort in [embodied AI](/wiki/embodied_ai) and [robot learning](/wiki/robot_learning) to replace hand-engineered, task-specific controllers with general-purpose policies learned from data. The defining characteristics, as used by TRI and adopted by collaborators such as Boston Dynamics, are:

- **Behavior generation, not text generation.** The model's output space is robot actions (for example, end-effector or joint targets over a short horizon), so it can be executed directly on hardware. This is the property that distinguishes a "behavior" model from a language or vision model.
- **Multitask by construction.** A single set of weights is trained across hundreds of distinct tasks and, often, multiple robot embodiments, rather than fitting a separate model to each skill.
- **Learned from demonstrations.** Training data comes overwhelmingly from imitation learning over teleoperated human demonstrations, supplemented with simulation and pooled public datasets.
- **Language-conditioned.** A natural-language prompt selects which behavior the model should perform, so one network can be steered to many goals at inference time.

Because the term originated as a deliberate parallel to LLMs, "Large Behavior Model" is sometimes used loosely as a synonym for any large [robotics model](/wiki/robotics_models) that maps perception to action. In the more precise sense used in TRI's published research, an LBM is a scaled, multitask, diffusion-based visuomotor policy. The wiki maintains separate articles for the general [robot foundation model](/wiki/robot_foundation_model) and [embodied AI](/wiki/embodied_ai) categories; this article covers the LBM term specifically.

## When was the term coined?

TRI introduced the LBM framing on September 19, 2023, alongside a generative-AI method, developed with Professor Shuran Song's group at Columbia University, for teaching robots dexterous skills from demonstration. [5] The robot learned each skill from a few dozen teleoperated demonstrations paired with a language description of the goal, using [Diffusion Policy](/wiki/diffusion_policy) to model the demonstrated behavior. TRI reported that it had already taught its robots more than 60 difficult, dexterous skills, including pouring liquids, using tools, and manipulating deformable objects such as cloth, "without writing a single line of new code," with the only change being the supplied data. [5] The institute set targets of hundreds of skills by the end of 2023 and 1,000 by the end of 2024. [5] Russ Tedrake, who leads robotics research at TRI and is a professor at MIT, said at the time: "What is so exciting about this new approach is the rate and reliability with which we can add new skills." [5]

Gill Pratt, then TRI's CEO and Chief Scientist for Toyota Motor Corporation, positioned the work within Toyota's human-amplification mission rather than as labor replacement. [5] That 2023 announcement established LBMs as a research vision; the substantive empirical evidence for the approach arrived in 2025 with a large study, described below.

## How do LBMs work?

### Diffusion Policy

The technical foundation of TRI's LBMs is [Diffusion Policy](/wiki/diffusion_policy), introduced in the paper "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion" by Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song, presented at Robotics: Science and Systems (RSS) in 2023 and later extended in the International Journal of Robotics Research. [11][12] Diffusion Policy represents a robot's visuomotor policy as a conditional denoising [diffusion model](/wiki/diffusion_model) over action sequences: instead of regressing a single action, the network iteratively denoises a sampled action trajectory conditioned on recent observations. The authors argued this is well suited to robot control because it gracefully represents multimodal action distributions (situations where several different motions are all valid), scales to high-dimensional action spaces, and trains stably. As the project reports, "Diffusion Policy outperforms prior state-of-the-art on 12 tasks across 4 benchmarks with an average success-rate improvement of 46.9%." [12] Several of the same authors, including Feng, Cousineau, Burchfiel, and Tedrake, are affiliated with TRI, which is why the LBM program is built directly on this method.

### Imitation learning, action chunking, and language conditioning

LBMs are trained by **imitation learning** (behavioral cloning) over teleoperated demonstrations rather than by reinforcement learning or reward design. Following Diffusion Policy and related work such as Action Chunking with Transformers, the model predicts short "action chunks" (a block of future timesteps at once) instead of a single next action, which improves temporal consistency and robustness. In TRI's 2025 study the policy outputs 20-dimensional actions over a 16-timestep horizon (about 1.6 seconds of motion), and at deployment executes 8 of those timesteps at 10 Hz (0.8 seconds) before recomputing. [1] Language conditioning is supplied through a natural-language prompt encoded alongside the visual and proprioceptive inputs, letting one network select among many learned behaviors.

### Multitask pretraining

The central hypothesis of the LBM program, mirroring the pretraining-then-finetuning recipe of LLMs, is that pretraining a single policy on a large, diverse mixture of tasks yields a model that is more capable and more data-efficient when adapted to a new task than a policy trained from scratch on that task alone. [1]

## What did TRI's 2025 study find?

In 2025 the TRI Large Behavior Model team published "A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation" (arXiv:2507.05331, posted July 7, 2025), a large empirical study with 82 authors including Jose Barreiros, Rares Ambrus, Hadas Kress-Gazit, Siyuan Feng, and Russ Tedrake. [1] A peer-reviewed version subsequently appeared in Science Robotics (DOI 10.1126/scirobotics.aea6201). [2] The paper is notable both for its scale and for an unusually rigorous evaluation protocol, which the authors present as a corrective to the loosely controlled comparisons common in robot learning. Its headline conclusion is that "multi-task pretraining makes the policies more successful and robust, and enables teaching complex new tasks more quickly, using a fraction of the data when compared to single-task baselines." [1]

### Architecture

The LBMs studied are scaled multitask diffusion policies with multimodal Vision Transformer (ViT) vision-language encoders and a transformer denoising head conditioned on the encoded observations via Adaptive Layer Normalization (AdaLN). Inputs are wrist and scene camera images, robot proprioception, and a language prompt; the output is the 20-dimensional, 16-timestep (1.6 second) action chunk described above. [1] The models were deployed on bimanual stations built from dual Franka Panda FR3 arms with parallel grippers and six cameras (two FRAMOS scene cameras plus two FLIR wrist cameras per arm). [1]

### Data

The models were pretrained on roughly 1,700 hours (about 1,695 hours) of robot demonstration data drawn from four sources, as reported by TRI: [1]

| Source | Approx. hours |
| --- | --- |
| Internal bimanual robot teleoperation | 468 |
| Simulation-collected teleoperation | 45 |
| Universal Manipulation Interface (UMI) data | 32 |
| Public internet data ([Open X-Embodiment](/wiki/open_x_embodiment)) | ~1,150 |
| **Total** | **~1,700** |

The internally collected portion (called TRI-Ramen, roughly 545 hours) spanned 532 high-diversity tasks. [1]

### Evaluation

The study evaluated 29 tasks (16 simulated tasks seen during pretraining, 3 real-world seen tasks, 5 unseen simulated tasks, and 5 unseen real-world tasks) across more than 47,000 simulation rollouts and roughly 1,800 rigorously controlled real-world trials, using 200 rollouts per simulation task and 50 per real-world task (per policy, per condition). [1] Crucially, the real-world comparisons were run as blind, randomized A/B tests against single-task baselines, with statistical significance assessed via sequential hypothesis testing and Clopper-Pearson confidence intervals. [1] This level of statistical control is the "careful examination" the title refers to.

### Findings

The study's headline result is that multitask pretraining helps: a pretrained LBM, after fine-tuning, learns complex new tasks more successfully, more robustly, and with substantially less task-specific data than a policy trained from scratch. The paper states that "to achieve similar performance in simulation, when finetuning an LBM we require less than 30% of the data needed for training from scratch," [1] a roughly 3 to 5 times reduction in the demonstration data needed in challenging settings; TRI summarized this in popular communications as up to about 80% less data for new skills. [7][8] The benefit appeared "far before internet-scale data," emerging with just a few hundred diverse hours, and fine-tuned performance improved smoothly as pretraining data grew, with no evidence of sharp inflection points or emergent discontinuities of the kind sometimes claimed for language models. [1]

The authors were also candid about limits. Pretrained models used zero-shot, without fine-tuning, showed mixed results and did not clearly beat single-task baselines. And they found that seemingly minor design choices, such as input and action normalization, could have large effects on performance, often dominating architectural or algorithmic changes. [1] These caveats are part of why the paper is framed as a sober examination rather than a breakthrough announcement.

## How does an LBM control the Atlas humanoid?

On August 20, 2025, TRI and Boston Dynamics announced a demonstration of the electric [Atlas](/wiki/atlas_robot) humanoid robot driven by a single Large Behavior Model, the first public fruit of a research partnership the two organizations had announced in October 2024. [9][10] In the demonstration, one language-conditioned LBM provided whole-body control of Atlas across its roughly 50 degrees of freedom through a continuous sequence of packing, sorting, and organizing tasks that combined object [manipulation](/wiki/robot_manipulation) with locomotion: walking, crouching, and lifting, and adapting to mid-task perturbations such as a box being moved. [9][18] On the Atlas platform the policy predicts longer action chunks (length 48, about 1.6 seconds) and executes roughly half of each chunk per inference cycle. [18] Notably, the model treated hands and feet uniformly under a single policy rather than separating low-level walking and balancing from arm manipulation, an approach the partners highlighted as a departure from conventional humanoid software stacks. [18]

Scott Kuindersma, Vice President of Robotics Research at Boston Dynamics, said: "Training a single neural network to perform many long-horizon manipulation tasks will lead to better generalization, and highly capable robots like Atlas present the fewest barriers to data collection for tasks requiring whole-body precision, dexterity, and strength." [9] Russ Tedrake, by then Senior Vice President of Large Behavior Models at TRI, described the appeal of the approach this way: "Large Behavior Models address this opportunity in a fundamentally new way: skills are added quickly via demonstrations from humans, and as the LBMs get stronger, they require less and less demonstrations to achieve more and more robust behaviors." [9] The partners again stressed that new behaviors were added through demonstrations rather than by writing new control code.

## How do LBMs differ from VLA models?

LBMs are closely related to, but not identical with, [vision-language-action](/wiki/vision_language_action_model) models and other robot foundation models. The categories overlap, and the distinctions are partly about architecture and partly about which research group uses which label.

| System | Developer | Common label | Action mechanism | Notes |
| --- | --- | --- | --- | --- |
| TRI LBMs | Toyota Research Institute | Large Behavior Model | Diffusion policy (denoising) | Origin of the LBM term; bimanual Franka arms; also drives Atlas with Boston Dynamics |
| [pi-zero](/wiki/pi_zero) | Physical Intelligence | VLA / robot foundation model | Flow matching action expert on a VLM backbone | Built on the PaliGemma VLM; outputs continuous actions at high frequency |
| [RT-2](/wiki/rt_2) | Google DeepMind | VLA | Actions emitted as discrete tokens by a VLM | Co-fine-tunes a large vision-language model on web and robot data |
| RT-1 / RT-X | Google DeepMind | Robotics transformer | Discrete action tokens | RT-X trained on the pooled Open X-Embodiment dataset across many labs |
| [Gemini Robotics](/wiki/gemini_robotics) | Google DeepMind | VLA on a frontier VLM | Action decoding on a Gemini backbone | Brings a general-purpose multimodal model into embodied control |
| GR00T N1 | NVIDIA | Humanoid foundation model / VLA | Dual-system (VLM planner plus fast action model) | Open model trained on robot trajectories, human video, and synthetic data |
| Helix | Figure AI | VLA | Dual-system architecture | Controls full humanoid upper body; runs onboard |

The clearest technical fault line is architectural. The mainstream VLA recipe, exemplified by RT-2, [Gemini Robotics](/wiki/gemini_robotics), GR00T, and Helix, starts from a pretrained [vision-language model](/wiki/large_language_model) and adds an action head, so the policy inherits semantic and reasoning priors from internet-scale image and text data. TRI's LBMs instead scale up the [Diffusion Policy](/wiki/diffusion_policy) line of imitation learning, using a multimodal ViT encoder trained primarily on robot data rather than a large pretrained language backbone. [Physical Intelligence's](/wiki/pi_zero) pi-zero sits between these poles: it uses a VLM backbone (PaliGemma) but, like the diffusion-based LBMs and unlike token-emitting VLAs, generates continuous actions, in its case via flow matching rather than denoising diffusion. Other companies such as Skild AI pursue general-purpose robot "brains" under their own branding without adopting the LBM term.

In short, every LBM is a robot foundation model, and TRI's LBMs perform the same vision-to-language-to-action mapping that defines a VLA, but not every VLA is called an LBM, and the LBM label specifically signals the TRI-style, diffusion-based, demonstration-trained lineage.

## What are the limitations of LBMs?

The LBM program shares the unresolved challenges of the broader embodied-AI field, several of which TRI's own evaluation makes explicit:

- **Data scarcity.** Unlike text or images, robot demonstration data does not exist at internet scale and must be physically collected, largely through teleoperation. Building even ~1,700 hours required dedicated data-collection stations and pooling public datasets, and there is no robot equivalent of the web for pretraining.
- **Evaluation.** Robot policies are hard to compare rigorously because success depends on physical setups, object placement, and run-to-run variation. TRI's emphasis on blind A/B trials and sequential hypothesis testing is itself a response to a field where weaker evaluation can overstate gains. [1]
- **Generalization.** Pretrained LBMs did not reliably outperform single-task baselines without fine-tuning, indicating that robust zero-shot transfer to genuinely novel tasks and environments remains unsolved. [1]
- **Sensitivity to engineering details.** Performance was found to hinge on choices like data normalization, which complicates fair comparison and reproducibility across labs. [1]
- **No demonstrated emergence.** Performance scaled smoothly with data rather than exhibiting the sharp capability jumps sometimes attributed to large language models, so claims of LLM-like emergent behavior in robotics are not yet supported by this evidence. [1]

## Related

[Robot foundation model](/wiki/robot_foundation_model) · [Foundation model](/wiki/foundation_model) · [Embodied AI](/wiki/embodied_ai) · [Robot learning](/wiki/robot_learning) · [Robot manipulation](/wiki/robot_manipulation) · [Robotics models](/wiki/robotics_models) · [Diffusion Policy](/wiki/diffusion_policy) · [Diffusion model](/wiki/diffusion_model) · [Vision-language-action](/wiki/vision_language_action_model) · [pi-zero](/wiki/pi_zero) · [RT-2](/wiki/rt_2) · [Gemini Robotics](/wiki/gemini_robotics) · [GR00T](/wiki/gr00t) · [Mobile ALOHA](/wiki/mobile_aloha) · [Toyota Research Institute](/wiki/toyota_research_institute) · [Atlas](/wiki/atlas_robot)

## References

1. "A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation." arXiv (TRI LBM Team), July 7, 2025. https://arxiv.org/abs/2507.05331
2. "A careful examination of large behavior models for multitask dexterous manipulation." Science Robotics, 2026. https://www.science.org/doi/10.1126/scirobotics.aea6201
3. "A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation." TRI project page. https://toyotaresearchinstitute.github.io/lbm1/
4. "Large Behavior Models." Toyota Research Institute. https://www.tri.global/our-work/large-behavior-models
5. "Toyota Research Institute Unveils Breakthrough in Teaching Robots New Behaviors." Toyota USA Newsroom, September 19, 2023. https://pressroom.toyota.com/toyota-research-institute-unveils-breakthrough-in-teaching-robots-new-behaviors/
6. "Toyota Research Institute makes generative AI breakthrough to teach robots new behaviors; Diffusion Policy." Green Car Congress, September 20, 2023. https://www.greencarcongress.com/2023/09/20230920-tri.html
7. "TRI: pretrained large behavior models accelerate robot learning." The Robot Report, July 2025. https://www.therobotreport.com/tri-pretrained-large-behavior-models-accelerate-robot-learning/
8. "Toyota Research Institute reveals major advance in Large Behavior Models for robotics." Robotics & Automation News, July 14, 2025. https://roboticsandautomationnews.com/2025/07/14/toyota-research-institute-unveils-breakthrough-in-large-behavior-models-that-requires-80-percent-less-data/93039/
9. "AI-Powered Robot by Boston Dynamics and Toyota Research Institute Takes a Key Step Towards General-Purpose Humanoids." Toyota USA Newsroom, August 20, 2025. https://pressroom.toyota.com/ai-powered-robot-by-boston-dynamics-and-toyota-research-institute-takes-a-key-step-towards-general-purpose-humanoids/
10. "Boston Dynamics and Toyota Research Institute show Atlas humanoid robot powered by large behavior model." Robotics & Automation News, August 26, 2025. https://roboticsandautomationnews.com/2025/08/26/boston-dynamics-and-toyota-research-institute-demonstrate-humanoid-robot-powered-by-large-behaviour-model/93922/
11. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." arXiv (Chi et al.), 2023. https://arxiv.org/abs/2303.04137
12. "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." Project website, Columbia University. https://diffusion-policy.cs.columbia.edu/
13. "Vision-language-action model." Wikipedia. https://en.wikipedia.org/wiki/Vision-language-action_model
14. "pi0 and pi0-FAST: Vision-Language-Action Models for General Robot Control." Hugging Face blog. https://huggingface.co/blog/pi0
15. "RT-2: New model translates vision and language into action." Google DeepMind, 2023. https://deepmind.google/blog/rt-2-new-model-translates-vision-and-language-into-action/
16. "Open X-Embodiment: Robotic Learning Datasets and RT-X Models." arXiv, 2023. https://arxiv.org/html/2310.08864v5
17. "NVIDIA Announces Isaac GR00T N1, the World's First Open Humanoid Robot Foundation Model." NVIDIA Newsroom, March 2025. https://nvidianews.nvidia.com/news/nvidia-isaac-gr00t-n1-open-humanoid-robot-foundation-model-simulation-frameworks
18. "Large Behavior Models and Atlas Find New Footing." Boston Dynamics blog, August 2025. https://bostondynamics.com/blog/large-behavior-models-atlas-find-new-footing/
19. "GR00T N1: An Open Foundation Model for Generalist Humanoid Robots." arXiv, March 2025. https://arxiv.org/abs/2503.14734