NVIDIA Cosmos 3
Last reviewed
Jun 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,390 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 2, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 2,390 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Cosmos 3 is an open family of "world foundation models" for physical AI that NVIDIA announced on June 1, 2026 at GTC Taipei, held alongside COMPUTEX 2026. NVIDIA describes it as the first fully open "omnimodel," meaning a single model that can both reason about the physical world and generate it across text, images, video, ambient sound and action. Cosmos 3 is built on a Mixture-of-Transformers (MoT) architecture that pairs a reasoning transformer with a generation transformer, so the model can understand object interactions, motion and spatial-temporal relationships before it generates video and action trajectories.[1][2][3] It is the third major generation of the NVIDIA Cosmos platform, which NVIDIA positions as the data and simulation layer for robots, autonomous vehicles and other machines that perceive and act in the real world.
The launch consolidated what had previously been several separate Cosmos models into one. NVIDIA also used the announcement to introduce the NVIDIA Cosmos Coalition, a group of AI labs and robotics companies committed to advancing open world models.[1][4]
| Attribute | Detail |
|---|---|
| Developer | NVIDIA (Cosmos research lab) |
| Announced | June 1, 2026, at GTC Taipei / COMPUTEX 2026 (Jensen Huang keynote) |
| Type | Open world foundation model / "omnimodel" for physical AI |
| Architecture | Mixture-of-Transformers (autoregressive reasoner tower + diffusion generator tower) |
| Variants | Cosmos 3 Super (64B), Cosmos 3 Nano (16B), Cosmos 3 Edge (coming soon) |
| Modalities | Text, image, video, ambient sound (audio), action |
| License | OpenMDW 1.1 (Open Model Development and Weights License), Linux Foundation |
| Availability | build.nvidia.com, Hugging Face, GitHub; deployable as NVIDIA NIM microservices |
| Released checkpoints | nvidia/Cosmos3-Nano, nvidia/Cosmos3-Super (May 31, 2026) |
| Tested precision | BF16 (FP4, FP8 and FP16 not officially supported) |
| Supported GPUs | NVIDIA Ampere, Hopper and Blackwell |
Most large language models learn from text written from a human point of view. Robots and self-driving cars need something different: data captured from their own perspective, including how objects move, how surfaces behave and what happens when a gripper closes on something. In his keynote, Jensen Huang called this one of the hardest data problems in computing, because the real world rarely hands you labeled examples of every edge case a machine might encounter.[2] Cosmos 3 is NVIDIA's attempt to attack that problem with a single model that can understand a scene and then synthesize new, physically grounded versions of it.
NVIDIA frames Cosmos 3 as an "omnimodel" because it folds together capabilities that used to require a small stack of specialized models. Earlier in the Cosmos line, developers worked with separate components: Cosmos Predict for world generation, Cosmos Transfer for controlled generation, Cosmos Reason for scene understanding, and Cosmos Policy for action and policy generation. Cosmos 3 does all of that in one model that can reason and generate across modalities in a unified forward pass, which removes a lot of the orchestration glue that linking four models together normally demands.[3][5] For the history of those earlier components and the broader platform, see the main NVIDIA Cosmos article.
The headline pitch is "think before it acts." The model first interprets what is happening in a scene, then uses that understanding to produce outputs, whether that output is a predicted video of what comes next or a set of numerical actions for a robot to execute.[2] Jensen Huang summed up the ambition this way: "The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models. The Cosmos 3 family of open, frontier omnimodels gives developers a generational leap in ability to build robots, autonomous vehicles and vision AI that perceive, reason, plan and act in the physical world."[1]
The technical core of Cosmos 3 is a Mixture-of-Transformers (MoT) backbone built from two complementary "towers" that live inside one model.[3][6]
The first is the Reasoner, an autoregressive vision-language model. It reads multimodal observations such as images, video and text and produces text: captions, plans, spatial and temporal reasoning, and judgments about physical plausibility. It works the way a language model does, predicting the next token in a sequence, which lets it describe motion, object interactions and other physical context.[3][6]
The second is the Generator, a diffusion transformer. Conditioned on the Reasoner's understanding, it produces the continuous, non-text outputs: images, video, audio and action sequences, generated through iterative denoising rather than token-by-token decoding. In practice that means the Generator can handle text-to-image, text-to-video, image-to-video, video-to-video, audio synthesis and action generation, including forward dynamics, inverse dynamics and policy outputs for robots.[3][6]
What makes this a single model rather than two stitched together is how the towers share a sequence. NVIDIA's technical report describes splitting the input into two subsequences: an autoregressive (AR) subsequence that handles reasoning via next-token prediction, and a diffusion (DM) subsequence that handles generation via iterative denoising. The AR and DM tokens use separate parameter sets within each transformer layer, but they interact through joint attention. So the two towers keep their own weights yet can attend to each other inside the same forward pass.[7] Every modality first passes through a dedicated encoder, a vision transformer (ViT) for visual understanding, a variational autoencoder (VAE) for visual and audio generation, and domain-aware vectors for actions, before being projected into a shared representation space.[8]
This design is a deliberate bet against forcing every output through the same decoder. Reasoning and language are well suited to autoregressive prediction; high-fidelity pixels, audio and smooth trajectories are better served by diffusion. By keeping both mechanisms in one framework, Cosmos 3 can reason and generate heterogeneous modalities without pretending they should all be produced the same way.
The two towers expose different interfaces. The published model card for Cosmos 3 Nano lists the following.[8]
| Tower | Inputs | Outputs |
|---|---|---|
| Reasoner (understanding) | Text; text plus image; text plus video | Text (reasoning, planning, spatial and temporal reasoning) |
| Generator (generation) | Text; image; video (with or without audio); action trajectory | Image (JPG), video (MP4), audio (stereo AAC, 48 kHz), action (JSON), text |
Action data is represented numerically, for example as joint angles, gripper positions and trajectory points, which is the format a robot controller can actually consume.[2]
NVIDIA released Cosmos 3 in a tiered family so the same architecture can run from a workstation up to a data center, with a third tier aimed at the edge still to come.[1][9]
| Variant | Total parameters | Tower split | Positioning | Status |
|---|---|---|---|---|
| Cosmos 3 Super | 64B | ~32B reasoner + ~32B generator | Highest physics accuracy; for post-training robotics and AV models | Available |
| Cosmos 3 Nano | 16B | ~8B reasoner + ~8B generator | High-quality video and action reasoning in fractions of a second | Available |
| Cosmos 3 Edge | Not disclosed | Not disclosed | Real-time inference at the edge | Coming soon |
Cosmos 3 Super targets data-center Hopper and Blackwell GPUs, while Nano is sized for workstation-class hardware such as the RTX PRO 6000 Blackwell card.[5][9] NVIDIA's testing was done at BF16 precision on GB200 and H100 systems; the model card notes that FP4, FP8 and FP16 are not officially supported at launch.[8]
Beyond the two base models, the GitHub release also includes task-specialized checkpoints in the same family, such as text-to-image and image-to-video variants of Super and a vision-language robot policy model (Cosmos3-Nano-Policy-DROID) trained for manipulation.[6]
Cosmos 3 is trained on large-scale multimodal data rather than text alone. According to the Cosmos 3 Nano model card, the model was trained on roughly 1.3 billion data points drawn from 393 dataset entries collected between 2024 and 2026. The breakdown spans the modalities the model has to master: about 22 million reasoning (text) samples, 767 million images, 348 million videos, 139 million audio samples and 8 million action samples.[8]
The two towers learn in complementary ways. The Reasoner is trained on paired vision-language data such as image-text and video-text pairs to support question answering, spatial grounding, temporal reasoning and action understanding. The Generator is trained on large multimodal corpora of images, video, audio and action using reconstruction-based objectives, so it learns to synthesize the world rather than to label it.[7] To address the shortage of robot-perspective data, NVIDIA says Cosmos 3 also learns from teleoperation recordings, simulation, and third-person video that is re-projected into a first-person view.[4]
A widely repeated figure of "20 trillion training tokens" circulated in some press coverage of the launch, but that number does not appear in NVIDIA's official model cards, technical blog or technical report, so it is omitted here. NVIDIA's own documentation describes the data in terms of samples per modality, as above, rather than a single token count.
NVIDIA presents Cosmos 3 as a leaderboard-topping open model across the three things it is built to do: physical-AI reasoning, world simulation and action generation. Performance claims below are NVIDIA's own.
The company says Cosmos 3 ranks first among open models on a spread of physical-AI and generation leaderboards, including Artificial Analysis (for text-to-image and image-to-video), Physics-IQ, PAI-Bench, R-Bench, RoboLab, RoboArena, VANTAGE-Bench and TAR.[1] NVIDIA's developer materials add that Cosmos 3 leads VANTAGE-Bench at both the 32B tier (Super) and the 8B tier (Nano), framing the comparison by the size of the individual towers rather than the combined model.[5] NVIDIA has not published a full table of numerical scores in its launch blogs; the detailed results live in the accompanying technical report.[7]
Practically, the model is aimed at a few concrete jobs. It can generate synthetic training data and rare edge cases for robots and self-driving systems, predict how a scene will unfold, reason about what a camera is seeing for video-analytics agents in factories and cities, and output the actual control actions for robot manipulation such as pick-and-place.[2][9] Because the Reasoner and Generator share one model, a developer can ask Cosmos 3 to look at a scene, decide what should happen, and then produce both the predicted video and the action trajectory to get there.
Cosmos 3 Super and Cosmos 3 Nano are available now; Cosmos 3 Edge is listed as coming soon.[1] The models can be tried as hosted endpoints on build.nvidia.com, downloaded as open weights from Hugging Face (the nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super collections, both dated May 31, 2026), and built on through the NVIDIA Cosmos repository on GitHub, which also ships training and post-training scripts.[1][6][8] For production, the models can be deployed as NVIDIA NIM microservices, and NVIDIA names cloud and inference partners including Baseten, CoreWeave, Microsoft Azure, Nebius, Deep Infra and Classmethod for hosting them.[1]
The weights are released under the OpenMDW 1.1 license (the Open Model Development and Weights License) stewarded by the Linux Foundation, which is a permissive license intended to allow commercial use and the distribution of derivative models.[2][9] NVIDIA also lists a contact address (cosmos-license@nvidia.com) for custom arrangements.[6]
Developers typically adapt Cosmos 3 to their own robots or vehicles through supervised fine-tuning on custom datasets and through action post-training that covers forward dynamics, inverse dynamics and policy generation. NVIDIA's tooling supports quantization (BF16, FP8 and NVFP4) for NIM deployment and integration with the vLLM inference engine for higher throughput, although the base checkpoints themselves were validated at BF16.[5]
Alongside the model, NVIDIA announced the NVIDIA Cosmos Coalition, described as a group of leading AI labs and robotics companies working together to advance the next generation of open world models. The founding members named at launch are Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI.[1][4] The grouping is notable for mixing robotics specialists (Agile Robots, Generalist, Skild AI) with generative-media labs (Black Forest Labs, LTX, Runway), which fits Cosmos 3's dual nature as both a robot-action model and a high-fidelity video generator.
Cosmos 3 is the model layer in a larger physical-AI strategy NVIDIA has been assembling for several years. The company connects it to its simulation and robotics software, NVIDIA Omniverse and the Isaac platform, where synthetic worlds are built and robot policies are tested before deployment. On the robotics side it lines up with Isaac GR00T, NVIDIA's foundation-model effort for humanoid robots, and with simulation tools like Isaac Sim and Isaac Lab. For autonomous vehicles it connects to the NVIDIA DRIVE Hyperion platform, and for on-robot inference it pairs with the Jetson Thor edge computer.[2]
That positioning maps onto NVIDIA's "three computers" view of robotics: one computer (in the data center, on Blackwell and the upcoming Vera Rubin generation) to train the models, a second running Omniverse and Cosmos to generate data and simulate, and a third (Jetson Thor) inside the robot to run the policy in the real world. Cosmos 3 is designed to feed the middle of that loop, turning scarce real-world data into the abundant, physically grounded data that the other two computers need.