NVIDIA Isaac GR00T N1
Last reviewed
May 16, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 · 3,991 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
19 citations
Review status
Source-backed
Revision
v1 · 3,991 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Isaac GR00T N1 is an open foundation model for humanoid robots developed by NVIDIA and unveiled by Jensen Huang on March 18, 2025 at the company's annual GTC conference in San Jose, California. NVIDIA describes it as the world's first open, fully customizable foundation model for generalized humanoid reasoning and skills, and the first of a family of pretrained checkpoints released to the wider robotics community [1][2]. The base GR00T N1 model has roughly 2 billion parameters and uses a dual-system architecture inspired by Daniel Kahneman's split between fast and slow thinking, pairing a Vision-Language-Action (VLA) backbone with a flow-matching diffusion transformer that produces continuous-value motor actions [3][4].
GR00T N1 was released alongside its training data and benchmark tasks on Hugging Face and GitHub on March 17 to 18, 2025, and the project has since received four major updates: N1.5 in June 2025, N1.6 in September 2025, and N1.7 in April 2026. The successive versions changed the vision-language model backbone twice (from Eagle 2 to Eagle 2.5 to NVIDIA's own Cosmos-Reason variants), doubled the size of the action transformer, replaced absolute joint targets with state-relative action chunks, and expanded training data from a few thousand hours of teleoperation to more than 20,000 hours of human egocentric video [5][6][7]. The model weights are distributed under the NVIDIA Open Model License Agreement, which permits commercial use with attribution, and the surrounding code is licensed under Apache 2.0 [8].
NVIDIA's interest in humanoid robotics predates GR00T N1 by exactly one year. At GTC 2024 on March 18, 2024, Jensen Huang devoted a portion of his keynote to a project called GR00T, an acronym for Generalist Robot 00 Technology. The initial framing was that humanoids were the most exciting open problem in AI and that NVIDIA would attempt to play the role of a horizontal platform supplier across competing humanoid programs rather than building its own robot. The 2024 announcement named 1X Neo maker 1X Technologies, Agility Robotics, Apptronik, Boston Dynamics, Figure AI, Fourier Intelligence, Sanctuary AI, Unitree Robotics, and XPENG Robotics as early collaborators on the project [9].
The 2024 announcement was vague on what GR00T actually contained, and IEEE Spectrum's coverage by Evan Ackerman noted that the demonstrations were mostly aspirational and that fundamental questions, including whether the foundation model was trained on real robot data or on simulation, had not been answered [10]. Over the following year NVIDIA filled in the gap with a string of smaller releases on the Isaac platform: Isaac Lab for parallel reinforcement learning, OSMO for compute orchestration, Isaac Manipulator and Isaac Perceptor for robot arms and mobile robots, and the GR00T-Mimic synthetic data blueprint. By the time Huang took the stage at GTC 2025 in March, the company had a real model to put behind the GR00T name.
The broader strategic context is also worth keeping in mind. By early 2025 the humanoid race had crowded considerably. Tesla had begun shipping Optimus prototypes inside its factories, Figure 03 and Apptronik Apollo were running pilot deployments at BMW and Mercedes, and Chinese makers such as Unitree, XPENG, and Fourier had pushed prices for entry-level humanoids below 20,000 USD. NVIDIA's pitch with GR00T N1 was that none of these companies wanted to build a generalist brain in-house and that a shared foundation model, trained on data from many embodiments and shipped with simulation tooling, would let them focus on hardware and on the last mile of task fine-tuning [2].
GR00T N1 is structured as a Vision-Language-Action model with two coupled networks. The vision-language module, labeled System 2 in NVIDIA's documentation, interprets camera images and natural-language instructions and produces tokens that summarize scene understanding and high-level intent. The diffusion module, labeled System 1, takes those tokens and the robot's current proprioceptive state and denoises them into continuous motor commands. The labels are a deliberate reference to Daniel Kahneman's Thinking, Fast and Slow, with System 2 doing deliberate reasoning and System 1 producing reflexive motion [3].
In the original N1 release, the vision-language module is built on Eagle 2, an open vision-language model that uses a SigLIP-2 image encoder and a small T5-style language encoder. Images and language are encoded into a shared token sequence and run through the VLM transformer to produce per-token embeddings that act as conditioning signals for the action head [3][4]. The proprioceptive state of the robot, including joint positions, joint velocities, and end-effector poses, is encoded by a separate multilayer perceptron whose weights are indexed by an embodiment tag, so one model can serve different robot bodies with their own joint counts and limb layouts.
The action head is a diffusion transformer trained with flow matching rather than the more common DDPM objective. During training, ground-truth action chunks are corrupted by random interpolation between the clean action and Gaussian noise; at inference, the model starts from Gaussian noise and uses a velocity prediction objective to denoise the noise into a clean trajectory. The transformer interleaves self-attention over proprioception and action tokens with cross-attention to the vision-language embeddings, and diffusion step conditioning is handled through adaptive layer normalization, a common pattern in DiT-style image generators [4]. The output is a sequence of action vectors mapped to the relevant robot's degrees of freedom, decoded by another per-embodiment MLP.
The design borrows several ideas from earlier VLA models, most notably Physical Intelligence's π0.5 and ALOHA-style imitation learning systems built around action chunking. What is distinct about GR00T N1 is that the entire stack is open and that NVIDIA explicitly designed it for cross-embodiment use: a single checkpoint can be fine-tuned on a Fourier GR-1 humanoid, a Franka Panda arm, a Galaxea R1 mobile manipulator, or a WidowX research arm without changing the core model code [3][8].
The data strategy for GR00T N1 is what NVIDIA calls a heterogeneous mixture. The model is trained on three classes of trajectories: real robot teleoperation collected on Fourier GR-1 humanoids and other partner platforms; egocentric video of humans performing manipulation tasks; and synthetic robot data generated in Isaac Lab with the GR00T-Mimic and GR00T-Dreams blueprints. NVIDIA reports that 780,000 synthetic trajectories were produced in 11 hours of GPU time using GR00T-Mimic, which the company says is equivalent to about nine months of human teleoperation, and that adding synthetic data to a real-data baseline improved performance on internal benchmarks by roughly 40 percent [1][2].
GR00T-Mimic, released alongside N1 in March 2025, is a teleoperation amplifier. A human operator wearing an Apple Vision Pro headset teleoperates a simulated robot in NVIDIA Isaac Lab, and GR00T-Mimic takes a small number of those demonstrations and procedurally generates many more variants by randomizing scene geometry, lighting, friction, and object placements. GR00T-Dreams, announced two months later at Computex 2025, goes further: it uses the NVIDIA Cosmos family of world foundation models, specifically Cosmos Predict and Cosmos Reason, to generate completely novel manipulation trajectories from a single image and language prompt rather than augmenting an existing demonstration [11][12].
The full pretraining run for the first N1 checkpoint was carried out on H100 GPUs. NVIDIA has not published the exact compute budget for the original N1 release, but the more detailed N1.5 model card lists 250,000 steps on roughly 1,000 H100s with a global batch size of 16,384 tokens, which gives a rough sense of the scale at which subsequent versions have been trained [5]. Embodiment-specific post-training is typically much cheaper, on the order of 10,000 to 30,000 steps with a smaller batch, and NVIDIA recommends starting from as few as 20 to 40 demonstrations of a new task.
GR00T N1 has gone through four numbered releases since launch. Each one swapped or upgraded major architectural components while keeping the embodiment tags and dataset format stable so existing post-training pipelines kept working.
| Version | Release date | Headline changes |
|---|---|---|
| GR00T N1 | March 17 to 18, 2025 (GTC) | Initial 2B-parameter VLA on Eagle 2 VLM and a 16-layer flow-matching DiT, trained on GR-1 teleop, human egocentric video, and Isaac Lab synthetic data [1][3] |
| GR00T N1.5 | June 11, 2025 | VLM upgraded to Eagle 2.5 (2.1B parameters) and frozen during training, simplified MLP adapter with layer norm, new Future Latent Representation Alignment (FLARE) objective added to flow matching, large jumps in language following on real GR-1 (46.6 to 93.3 percent) [5][13] |
| GR00T N1.6 | September 29, 2025 | VLM replaced with an internal Cosmos-Reason-2B variant supporting native aspect ratios, DiT doubled in size to 32 layers, action space switched to state-relative action chunks for most embodiments, top 4 VLM layers unfrozen during pretraining [6][14] |
| GR00T N1.7 | April 17, 2026 (Early Access) | VLM upgraded to Cosmos-Reason2-2B, EgoScale pretraining on 20,854 hours of human egocentric video across 20+ task categories, relative end-effector action space shared between humans and robots, support for 22 degree-of-freedom dexterous hands [7][8] |
The N1.5 update is interesting because of how it was made rather than what it added. NVIDIA Research reported that the entire N1.5 development cycle, including new data generation, took about 36 hours of wall-clock time using the GR00T-Dreams blueprint. The same work, the team argued, would have taken close to three months with manual teleoperation collection. That figure became one of the central talking points at Computex 2025 and has been repeated by Huang in several subsequent keynotes as evidence that synthetic data has reached a tipping point in robotics [11][12].
N1.6 is the first version that does not use any third-party VLM at the front of the stack. NVIDIA's Cosmos-Reason model family was originally released for video reasoning and scene description, but a 2-billion parameter variant of Cosmos-Reason was repurposed as the System 2 component in N1.6, which let the team support flexible image resolutions natively and unfreeze the top layers of the VLM during pretraining without destabilizing training [6]. The action head was doubled in depth and switched to predicting state-relative chunks rather than absolute joint targets, which the team says produces less jittery motion and adapts better to imperfect starting positions.
N1.7 then pushed in two directions: deeper reasoning by upgrading the VLM to Cosmos-Reason2-2B, and dramatically more pretraining data by adding 20,854 hours of human egocentric video. NVIDIA calls this EgoScale pretraining, and the accompanying technical blog reports what the team describes as the first scaling law for robot dexterity: average task completion rates rise approximately linearly when the egocentric video budget goes from one thousand to twenty thousand hours, with a roughly 2x improvement across that range [7]. N1.7 also introduces a relative end-effector action space that is shared between human videos and robot bodies, which is what makes large-scale egocentric video usable as training data in the first place.
NVIDIA positions GR00T N1 and its successors as generalist robot models that can pick objects up, put them down, transfer items between hands, follow language instructions, and chain those primitives into longer multi-step tasks. The reference scenarios in the launch materials are warehouse-style: bin picking, sorting, packaging, kitting, and inspection. The internal benchmarks the company reports against include RoboCasa (24 simulated mobile manipulation tasks), Digital Cousin GR-1 (24 GR-1 humanoid manipulation tasks), Language Table, DexMG (dexterous manipulation), and DreamGen (12 new manipulation verbs introduced specifically to stress generalization) [3][5].
The original N1 paper reported that GR00T N1 outperformed several state-of-the-art imitation learning baselines, including ACT, Diffusion Policy, and Open-VLA, on these benchmarks when fine-tuned with comparable amounts of data, and that the model showed reasonable zero-shot transfer to embodiments it had seen during pretraining [3]. With N1.5, NVIDIA reported that the success rate on Language Table climbed from 52.8 percent to 93.2 percent, that real GR-1 language following went from 46.6 percent to 93.3 percent, and that RoboCasa improved from 17.4 percent to 47.5 percent with 30 demonstrations. The DreamGen benchmark, which is designed to test new verbs, went from 13.1 percent to 38.3 percent [5][13].
Live demos at GTC and Computex tended to feature 1X Neo and Fourier GR-1 humanoids running short manipulation sequences in lightly cluttered environments rather than fully autonomous open-ended tasks. The most-cited public demo from GTC 2025 used a Disney BDX-style robot, similar in form factor to the Star Wars droids that the BDX series is modeled on, walking onto the stage with Huang as a deliberate reference to the project name. The on-stage segment was a tele-controlled demonstration rather than full autonomy, a fact that NVIDIA was upfront about in its press materials [1][10].
The public partner list has shifted over the four releases. The 2024 Project GR00T launch named nine humanoid programs as collaborators. The March 2025 GR00T N1 announcement listed 1X Technologies, Agility Robotics, Boston Dynamics, Mentee Robotics, and NEURA Robotics as the early-access partners with hands-on integration of the N1 weights. By Computex 2025, the list of companies adopting Isaac and GR00T technologies had expanded to include Fourier, Foxlink, Galbot, General Robotics, Skild AI, XPENG Robotics, AeiRobot, and Lightwheel in addition to the original partners. By GTC fall 2025, when N1.6 was announced, NVIDIA was naming Figure AI, Franka Robotics, Hexagon, Solomon, and Techman Robot as additional adopters of Isaac Lab and Cosmos tooling [11][14].
It is worth distinguishing two kinds of partnership. The first is companies that explicitly use GR00T N1 weights as a starting point for their robots' policies, which from public statements appears to be a smaller and more academic group, currently led by Fourier Intelligence (whose GR-1 humanoid is NVIDIA's main internal test platform), 1X Technologies, Agility Robotics, and Mentee Robotics. The second is companies that use the broader NVIDIA Isaac stack, including Isaac Sim for simulation, Isaac Lab for reinforcement learning training, Newton for physics, and Jetson Thor as the on-robot inference computer, without necessarily building on top of GR00T weights. The second group is much larger and includes most of the major humanoid programs, including Figure AI, whose own Helix (VLA model) is a competing in-house VLA system [1][11][14].
| Robot platform | Maker | Relationship to GR00T |
|---|---|---|
| Fourier GR-1 | Fourier Intelligence | Primary internal test platform; benchmarks for every release run on GR-1 [1][5] |
| 1X Neo | 1X Technologies | Early-access partner since N1; collaborates on Cosmos and Isaac Lab integration [1][11] |
| Agility Digit | Agility Robotics | Early-access partner; commercially deployed at GXO warehouses on a separate policy stack [1][9] |
| Boston Dynamics Atlas Electric | Boston Dynamics | Project GR00T collaborator since 2024; uses Isaac Sim and Cosmos [9][14] |
| Mentee MenteeBot | Mentee Robotics | Early-access partner for N1 [1] |
| NEURA 4NE-1 | NEURA Robotics | Early-access partner for N1 [1] |
| Galaxea R1 Pro | Galaxea | Included in N1.6 training mix via BEHAVIOR suite [6] |
| Unitree G1 | Unitree Robotics | Included in N1.6 and N1.7 training mix [6][7] |
| AGIBot Genie 1 | AGIBot | Included in N1.6 and N1.7 training mix [6][7] |
| Bimanual YAM | Open hardware | Included in N1.6 and N1.7 training mix [6][7] |
| Franka Panda | Franka Robotics | Supported via LIBERO benchmark and DROID dataset checkpoints [8] |
Figure AI is named as an Isaac Lab adopter but its own VLA work runs on a separate stack. Tesla has not been publicly involved with GR00T at any point, and Apple, Samsung, and the major Korean and Japanese robotics firms have not been listed as partners either, although NVIDIA Jetson hardware shows up in many of those programs.
GR00T N1 is one piece of a wider NVIDIA stack for what the company has started calling physical AI. The most important sibling components are Isaac Sim, Isaac Lab, Cosmos, Newton, GR00T-Mimic, GR00T-Dreams, and Jetson Thor.
Isaac Sim is the underlying simulation environment, built on Omniverse and capable of running thousands of simulated robots in parallel on a single GPU node. Isaac Lab is the higher-level reinforcement learning and imitation learning framework that uses Isaac Sim as its physics backend, and Isaac Lab 2.3 added a dexterous grasping workflow specifically aimed at humanoid hands and a policy evaluation framework called Isaac Lab Arena [14]. Newton, announced jointly by NVIDIA, Google DeepMind, and Disney Research at GTC 2025, is an open-source GPU-accelerated physics engine designed to be more accurate than the older MuJoCo and PhysX simulators on contact-rich manipulation. NVIDIA also released MuJoCo-Warp around the same time, claiming a 70x speedup on robotics machine learning workloads compared with the reference CPU MuJoCo implementation [1][14].
GR00T-Mimic and GR00T-Dreams are the two synthetic data blueprints discussed earlier. GR00T-Mimic amplifies a small number of human teleoperation demonstrations into a much larger synthetic set inside Isaac Lab; GR00T-Dreams generates entirely new trajectories from a single image and a text prompt using Cosmos world models. Cosmos itself is a family of NVIDIA foundation models for physical AI, separate from GR00T, that includes Cosmos Predict (a world model that predicts future video frames from past frames), Cosmos Reason (a reasoning VLM for scene description and synthetic data curation), and Cosmos Transfer (a sim-to-real photorealism model). Cosmos Predict 2.5 and Cosmos Transfer 2.5 were released alongside N1.6, with claims of longer 30-second video horizons and a 3.5x smaller Transfer model [12][14].
On the inference side, Jetson AGX Thor is the on-robot computer that NVIDIA sells for running the complete GR00T stack at runtime. Thor is a Blackwell-class chip in a small form factor, and NVIDIA recommends running the System 1 diffusion head locally on Thor while optionally offloading System 2 reasoning to a nearby cloud GPU when latency budgets allow. The base GR00T N1.7 model supports NVIDIA Ampere, Hopper, Lovelace, Blackwell, and Jetson hardware, which covers basically the full generational range that NVIDIA currently sells [7][11].
GR00T N1 was the first major robot foundation model to be released under what is effectively a commercial open license. The model weights ship under the NVIDIA Open Model License Agreement, which permits commercial use with attribution and a small set of restrictions around model identification and acceptable use. The surrounding training, inference, and fine-tuning code in the NVIDIA/Isaac-GR00T GitHub repository is licensed under Apache 2.0 [8].
The one wrinkle is that the original GR00T-N1-2B model card on Hugging Face was initially published under NVIDIA's older non-commercial license before being relicensed to the Open Model License Agreement. Later checkpoints, including N1.5, N1.6, and N1.7, were published under the Open Model License from day one. NVIDIA also publishes evaluation datasets and synthetic training corpora, most prominently the PhysicalAI-Robotics-GR00T-X-Embodiment-Sim dataset on Hugging Face, under release-specific data licenses that are generally permissive for research and commercial use [4][8].
The licensing posture is the obvious contrast with the competition. Tesla's Optimus stack is closed; Figure's Helix (VLA model) is closed; Physical Intelligence released checkpoints for some of its earlier models including π0 and π0.5, but its newer policies are gated; and most academic VLAs, including OpenVLA, are open under research licenses that complicate commercial deployment. GR00T is the one mainstream commercial humanoid VLA that anyone can download, fine-tune, and ship in a product, which is much of the reason it has been so widely adopted as a starting point in the field [2][8].
Reaction to GR00T N1 has been broadly positive but not uncritical. The initial 2024 Project GR00T announcement was widely seen as more aspirational than substantive; IEEE Spectrum and several other outlets pointed out that the demos were largely tele-operated and that the underlying model was not yet public [10]. The March 2025 N1 release, with weights, training data, and benchmarks on Hugging Face and GitHub, changed that conversation substantially. The Robot Report, Hackster.io, and most of the major robotics newsletters treated the release as a serious technical contribution and the first concrete evidence that NVIDIA was willing to commit to the platform role it had described a year earlier [2][12].
Within the academic robot learning community, the response has been more nuanced. The dual-system VLM plus diffusion-head architecture is not new (Physical Intelligence's π0 had used a similar split a few months earlier), and several researchers noted that GR00T N1's benchmark wins were narrow or depended on data mixes that overlapped with the test sets. The successive updates have addressed some of those criticisms. N1.7's EgoScale pretraining and the dexterity scaling law in particular have been treated as one of the more interesting empirical results in robot learning in 2026, since they suggest that adding more human video produces predictable improvements rather than diminishing returns [7][15].
The deeper structural reaction has been about NVIDIA's positioning. The company is simultaneously the dominant supplier of training hardware (H100 and Blackwell GPUs), the dominant supplier of inference hardware for robots (Jetson Thor), the publisher of the leading open simulation stack (Isaac), and now the publisher of the leading open foundation model. That vertical position worries some observers, who note that even competitors who would prefer not to standardize on NVIDIA tooling have very limited alternatives, especially for the simulation and synthetic data half of the stack. Others view it as a useful counterweight to closed efforts by Tesla, Figure, and Physical Intelligence, especially given that GR00T weights are genuinely downloadable rather than just "open" in the more diluted sense some other large companies use [2][11].
The model has also become a common starting point in research papers. By early 2026 dozens of arXiv submissions cited GR00T N1 or N1.5 as a base, and several follow-on systems including SmolVLA and various open-source clones from university labs were explicitly framed as smaller or specialized variants. Whether GR00T N1 ends up being remembered as the BERT moment for humanoid robotics or just as a useful intermediate step depends mainly on whether the deployment claims attached to it, particularly around warehouse and manufacturing labor, hold up in production over the next several years.