Isaac GR00T N1.5
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v5 · 2,070 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
10 citations
Review status
Source-backed
Revision
v5 · 2,070 words
Add missing citations, update stale details, or suggest a clearer explanation.
Isaac GR00T N1.5 is an open vision-language-action model from NVIDIA built as a generalist foundation model for humanoid robots. It is the first major update to Isaac GR00T N1, which NVIDIA introduced at its GTC conference in March 2025 and described as the first open foundation model for generalized humanoid reasoning and skills. NVIDIA announced N1.5 in May 2025 at the Computex trade show in Taipei and released the open weights soon after. The update keeps the dual-system design of the original model and adds a stronger vision-language backbone, a frozen language model during training, and changes aimed at adapting the model to new robots and new tasks with less data. [1][2][3]
GR00T N1.5 sits inside a wider NVIDIA program for what the company calls physical AI. The model ships alongside synthetic-data tooling, the Isaac Sim and Isaac Lab simulation stack, and the GR00T-Dreams and Isaac GR00T-Mimic data generators. The combination is meant to let a robotics team take a pretrained checkpoint, generate large amounts of training data in simulation or from a few human demonstrations, and then fine-tune the model for a specific humanoid. [2][4]
NVIDIA framed GR00T N1 as a base model that any developer could download, inspect, and customize, rather than a closed system tied to one robot. The first release came with model weights, training scripts, and evaluation tools, and it was paired with a published technical report describing the architecture and training recipe. NVIDIA positioned this as a foundation layer for the humanoid industry, similar to how a large language model serves as a starting point that downstream teams adapt. [1][5]
The pitch rests on the idea of a single model that controls more than one kind of robot. Humanoid hardware varies a lot, with different arms, hands, sensors, and degrees of freedom, and collecting enough teleoperation data for each new machine is slow and expensive. A cross-embodiment model trained on a mixture of many robots, human video, and simulation aims to give each new humanoid a useful starting point before any robot-specific data is added. NVIDIA gave early access to GR00T N1 to several humanoid makers, including 1X Technologies, Agility Robotics, Boston Dynamics, Mentee Robotics, and Neura Robotics. At the GTC keynote, a 1X NEO Gamma humanoid ran a GR00T N1 policy to tidy household objects. [1][6]
GR00T N1 and N1.5 use a two-part design that NVIDIA describes with an analogy to fast and slow human thinking. System 2 is the slow, deliberate part, and System 1 is the fast, reflexive part. The two run at different rates and are trained together end to end. [1][5]
System 2 is a vision-language model that reads camera images and a language instruction, then forms a plan for what the robot should do. It runs at a lower frequency, about 10 hertz, because reasoning about a scene and a goal does not need to happen on every motor cycle. In GR00T N1 this module was built on NVIDIA's Eagle-2 vision-language model, itself fine-tuned from a SmolLM2 language model and a SigLIP-2 image encoder. [1][5]
System 1 is the action part. It is a diffusion transformer that turns the plan from System 2 into a stream of continuous motor commands. It uses flow matching, a generative method related to diffusion, to produce smooth action sequences, and it runs at a much higher rate, about 120 hertz in GR00T N1, so the robot can move in real time. The action module cross-attends to the tokens that System 2 produces, which is how the slow plan steers the fast controller. To handle robots with different bodies, the model wraps the shared core in per-embodiment encoders and decoders that map each robot's specific state and action format into and out of the common representation. [1][5]
The original GR00T N1 had roughly 2 billion to 2.2 billion parameters, and N1.5 is distributed as a model of about 3 billion parameters under the name GR00T-N1.5-3B. [5][3]
The headline change in N1.5 is how the language model is treated during training. In N1, the vision-language module was tuned along with the rest of the model. In N1.5, NVIDIA froze the vision-language model during both pretraining and fine-tuning. Keeping the language backbone fixed preserved the grounding it already had and improved how reliably the robot followed written instructions. [2][3]
N1.5 also upgraded the backbone itself to the Eagle 2.5 vision-language model, which NVIDIA reports has better visual grounding, meaning a tighter link between what the model sees and the actions it picks. The team simplified the adapter that connects the vision-language module to the action module, and it added a training objective called FLARE, short for future latent representation alignment. FLARE asks the model to predict future latent states, which lets it learn from human first-person video in addition to robot data, a useful way to scale up training material. [2][3]
NVIDIA reports concrete gains from these changes. On a GR-1 pick-and-place task, the rate at which the robot correctly followed the language command rose from 46.6 percent with N1 to 93.3 percent with N1.5. The ability to handle objects it had never seen, measured zero-shot, went from 0 percent to 15 percent. Across a set of 12 tasks generated with the DreamGen pipeline, success climbed from 13.1 percent with N1 to 38.3 percent with N1.5. The upgraded Eagle 2.5 backbone scored 40.4 on a grounding metric for the GR-1 robot, against 35.5 for a Qwen2.5-VL baseline, where a higher score means a tighter match between the language and the region of the image it refers to. The company also reports better data efficiency, so the model reaches good performance with fewer demonstrations, and improved results on new embodiments such as the low-cost SO-100 robotic arm. [2][3]
The table below summarizes the main differences as NVIDIA describes them.
| Aspect | GR00T N1 | GR00T N1.5 |
|---|---|---|
| Announced | GTC, March 2025 | Computex, May 2025 |
| Vision-language backbone | Eagle-2 with SmolLM | Eagle 2.5 |
| Language model during training | Tuned | Frozen |
| Extra training objective | None reported | FLARE future latent alignment |
| Reported language following on GR-1 pick-and-place | 46.6 percent | 93.3 percent |
| Distributed parameter size | About 2 billion to 2.2 billion | About 3 billion |
| Action module | Diffusion transformer with flow matching | Diffusion transformer with flow matching |
A large part of the GR00T story is about data rather than the network itself. NVIDIA organizes training material as a pyramid, with broad internet-scale video at the base, simulation and synthetic data in the middle, and real robot teleoperation at the top. The two data tools that feed this pyramid are Isaac GR00T-Mimic and GR00T-Dreams. [2][5]
Isaac GR00T-Mimic takes a small number of human demonstrations and expands them into a much larger set of synthetic motion trajectories inside Isaac Sim and Isaac Lab, NVIDIA's robotics simulation and reinforcement learning environments. GR00T-Dreams is a blueprint that generates what NVIDIA calls neural trajectories, which are synthetic action sequences produced with the help of the Cosmos world foundation models. GR00T-Dreams is meant to teach a robot new skills and help it adapt to new settings without collecting fresh teleoperation data for every case. NVIDIA reports that it used this synthetic data, generated with the DreamGen method behind GR00T-Dreams, to develop GR00T N1.5 in about 36 hours of data generation, work it says would have taken close to three months by manual collection. [2][4]
The original GR00T N1 technical report describes evaluations in simulation and on a real Fourier GR-1 humanoid, where the dual-system model outperformed an imitation-learning baseline across a set of manipulation tasks. The report presents GR00T N1 as a strong starting point for imitation learning and robot learning rather than a finished product for any single robot. [5]
For N1.5, the public results center on the instruction-following jump on GR-1 pick-and-place, the DreamGen task gains, the data-efficiency improvements, and the better transfer to the SO-100 arm and to new objects. These numbers come from NVIDIA's own blog and model documentation, and as of the 2025 release they had not been independently reproduced at scale by outside groups, so they are best read as vendor-reported figures. NVIDIA listed a broad set of humanoid and robotics developers as part of the ecosystem around the Isaac platform and the GR00T models, including 1X Technologies, Agility Robotics, Boston Dynamics, Fourier, Galbot, Mentee Robotics, Neura Robotics, Skild AI, and XPENG Robotics. [2][3][6]
GR00T N1.5 is published on Hugging Face as GR00T-N1.5-3B, and the code lives in the public NVIDIA Isaac-GR00T repository on GitHub. The weights for this version are released under an NVIDIA non-commercial license, so the checkpoint can be downloaded and studied but not used in a commercial product without separate terms. The accompanying code in the repository is offered under the Apache 2.0 license. NVIDIA later moved newer checkpoints in the same family, starting with the N1.7 release, to the Apache 2.0 license for the weights as well, but the N1.5 weights themselves remain under the non-commercial terms. The repository provides scripts for fine-tuning the model on a new dataset and for running inference, built on the usual PyTorch tooling and distributed through Hugging Face. Running and tuning the model in practice depends on having a capable NVIDIA GPU. [3][7]
GR00T N1.5 matters mostly because of what it tries to standardize. Before the GR00T releases, most humanoid control policies were built per robot and per lab, which made progress hard to share. By shipping an open transformer-based multimodal model with weights, training code, and a data pipeline, NVIDIA gave the field a common reference point that smaller teams can build on instead of starting from scratch. The cross-embodiment design and the heavy use of synthetic data target the central bottleneck in humanoid robotics, which is the cost of collecting enough real-world demonstrations. [1][2]
N1.5 also fits a broader pattern in robot learning toward large pretrained models that follow language instructions, a direction shared by other vision-language-action systems. NVIDIA's particular bet is that a model trained across many robots and grounded in a strong vision-language backbone will transfer better to each new humanoid than a model trained narrowly on one. [2][5]
The model is an early foundation layer, not a turnkey controller. The strongest reported numbers, such as the rise in instruction following, come from specific pick-and-place settings and from NVIDIA's own testing, so they may not carry over to harder long-horizon tasks or to cluttered real homes. The dual-system design also assumes a steady supply of synthetic data and simulation, and the quality of a fine-tuned policy still depends on how well that data matches the target robot and environment. [2][5]
There are practical constraints as well. The model is tuned for NVIDIA hardware and the Isaac stack, which ties the most efficient workflows to NVIDIA GPUs and software. The N1.5 weights sit under a non-commercial license rather than a standard open-source one, which limits direct commercial deployment of this specific checkpoint, so teams that need to ship a product should read the terms or look to the later Apache-licensed releases. And like all current humanoid foundation models, GR00T N1.5 is aimed mainly at manipulation and reasoning over short skills, not yet at fully reliable general-purpose autonomy. [3][5]