NVIDIA Cosmos 3

Generative AI NVIDIA Robotics

14 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v2 · 2,860 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NVIDIA Cosmos 3 is an open family of "world foundation models" for physical AI that NVIDIA launched on June 1, 2026 at GTC Taipei, held alongside COMPUTEX 2026. NVIDIA describes it as the first fully open "omnimodel," meaning a single model that can both reason about the physical world and generate it across text, images, video, ambient sound and action, which the company says reduces physical-AI training and evaluation cycles "from months to days."^[1]^[2]^[3] Cosmos 3 is built on a Mixture-of-Transformers (MoT) architecture that pairs a reasoning transformer with an expert generation transformer, so the model can understand object interactions, motion and spatial-temporal relationships before it generates video and action trajectories.^[1]^[2]^[3] It is the third major generation of the NVIDIA Cosmos platform, which NVIDIA positions as the data and simulation layer for robots, autonomous vehicles and other machines that perceive and act in the real world.

The launch consolidated what had previously been several separate Cosmos models into one. NVIDIA also used the announcement to introduce the NVIDIA Cosmos Coalition, a group of AI labs and robotics companies committed to advancing open world models.^[1]^[4]

Key facts

Attribute	Detail
Developer	NVIDIA (Cosmos research lab)
Announced	June 1, 2026, at GTC Taipei / COMPUTEX 2026 (Jensen Huang keynote)
Type	Open world foundation model / "omnimodel" for physical AI
Architecture	Mixture-of-Transformers (autoregressive reasoner tower + diffusion generator tower)
Variants	Cosmos 3 Super (64B), Cosmos 3 Nano (16B), Cosmos 3 Edge (coming soon)
Modalities	Text, image, video, ambient sound (audio), action
License	OpenMDW 1.1 (Open Model Development and Weights License), Linux Foundation
Availability	build.nvidia.com, Hugging Face, GitHub; deployable as NVIDIA NIM microservices
Released checkpoints	nvidia/Cosmos3-Nano, nvidia/Cosmos3-Super (May 31, 2026)
Tested precision	BF16 (FP4, FP8 and FP16 not officially supported)
Supported GPUs	NVIDIA Ampere, Hopper and Blackwell

What is NVIDIA Cosmos 3?

Most large language models learn from text written from a human point of view. Robots and self-driving cars need something different: data captured from their own perspective, including how objects move, how surfaces behave and what happens when a gripper closes on something. In his keynote, Jensen Huang called this one of the hardest data problems in computing, because the real world rarely hands you labeled examples of every edge case a machine might encounter.^[2] Cosmos 3 is NVIDIA's attempt to attack that problem with a single model that can understand a scene and then synthesize new, physically grounded versions of it.

NVIDIA frames Cosmos 3 as an "omnimodel" because it folds together capabilities that used to require a small stack of specialized models. Earlier in the Cosmos line, developers worked with separate components: Cosmos Predict for world generation, Cosmos Transfer for controlled generation, Cosmos Reason for scene understanding, and Cosmos Policy for action and policy generation. Cosmos 3 does all of that in one model that can reason and generate across modalities in a unified forward pass, which removes a lot of the orchestration glue that linking four models together normally demands.^[3]^[5] For the history of those earlier components and the broader platform, see the main NVIDIA Cosmos article.

The headline pitch is "think before it acts." The model first interprets what is happening in a scene, then uses that understanding to produce outputs, whether that output is a predicted video of what comes next or a set of numerical actions for a robot to execute.^[2] Ming-Yu Liu, NVIDIA's vice president of research who leads the Cosmos Lab, describes the design as one that "first harnesses a reasoning block to interpret what is happening in a scene, then harnesses a generation block to use that context to create physically grounded outputs."^[2] Jensen Huang summed up the ambition this way: "The big bang of physical AI is just around the corner thanks to breakthroughs in multimodal reasoning language, vision and world models. The Cosmos 3 family of open, frontier omnimodels gives developers a generational leap in ability to build robots, autonomous vehicles and vision AI that perceive, reason, plan and act in the physical world."^[1]

What is a world foundation model?

A world foundation model (WFM) is a model pretrained to understand and generate the dynamics of the physical world, so that other systems can be built on top of it the way applications are built on a large language model. NVIDIA positions Cosmos as a WFM platform whose job is to supply the scarce ingredient in robotics and autonomy: realistic, physics-aware data. Rather than collecting and labeling every real-world scenario, developers can ask a WFM to generate diverse, physically plausible scenes, including rare edge cases, and use that synthetic output to train and evaluate their machines.^[4] What makes Cosmos 3 unusual among WFMs is that it unifies world generation, physical reasoning and action generation inside one open model, where earlier approaches kept those as separate models.^[5]

Mixture-of-Transformers architecture

The technical core of Cosmos 3 is a Mixture-of-Transformers (MoT) backbone built from two complementary "towers" that live inside one model.^[3]^[6]

The first is the Reasoner, an autoregressive vision-language model. It reads multimodal observations such as images, video and text (with a context window of up to 256K tokens) and produces text: captions, plans, spatial and temporal reasoning, and judgments about physical plausibility. It works the way a language model does, predicting the next token in a sequence, which lets it describe motion, object interactions and other physical context.^[3]^[6]^[8]

The second is the Generator, a diffusion transformer. Conditioned on the Reasoner's understanding, it produces the continuous, non-text outputs: images, video, audio and action sequences, generated through iterative denoising rather than token-by-token decoding. In practice that means the Generator can handle text-to-image, text-to-video, image-to-video, video-to-video, audio synthesis and action generation, including forward dynamics, inverse dynamics and policy outputs for robots.^[3]^[6]

What makes this a single model rather than two stitched together is how the towers share a sequence. NVIDIA's technical report describes splitting the input into two subsequences: an autoregressive (AR) subsequence that handles reasoning via next-token prediction, and a diffusion (DM) subsequence that handles generation via iterative denoising. The AR and DM tokens use separate parameter sets within each transformer layer, but they interact through joint attention, and information flows one way, from the Reasoner to the Generator. So the two towers keep their own weights yet can attend to each other inside the same forward pass.^[7]^[10] A 3D multimodal rotary position embedding (mRoPE) aligns video, audio and action tokens on a single temporal axis so the model can keep them in sync.^[10] Every modality first passes through a dedicated encoder, a vision transformer (ViT) for visual understanding, a variational autoencoder (VAE) for visual and audio generation, and domain-aware vectors for actions, before being projected into a shared representation space.^[8]

This design is a deliberate bet against forcing every output through the same decoder. Reasoning and language are well suited to autoregressive prediction; high-fidelity pixels, audio and smooth trajectories are better served by diffusion. By keeping both mechanisms in one framework, Cosmos 3 can reason and generate heterogeneous modalities without pretending they should all be produced the same way.

What are the inputs and outputs?

The two towers expose different interfaces. The published model card for Cosmos 3 Nano lists the following.^[8]

Tower	Inputs	Outputs
Reasoner (understanding)	Text; text plus image; text plus video (up to 256K-token context)	Text (reasoning, planning, spatial and temporal reasoning)
Generator (generation)	Text; image; video (with or without audio); action trajectory	Image (JPG), video (MP4), audio (stereo AAC, 48 kHz), action (JSON), text

Action data is represented numerically, for example as joint angles, gripper positions and trajectory points, which is the format a robot controller can actually consume.^[2] On the action side, Cosmos 3 supports multiple embodiments, including camera, vehicle, egocentric (first-person), single-arm, dual-arm and humanoid robots.^[10]

What are the Cosmos 3 model variants?

NVIDIA released Cosmos 3 in a tiered family so the same architecture can run from a workstation up to a data center, with a third tier aimed at the edge still to come.^[1]^[9]

Variant	Total parameters	Tower split	Positioning	Status
Cosmos 3 Super	64B	~32B reasoner + ~32B generator	Highest physics accuracy; for post-training robotics and AV models	Available
Cosmos 3 Nano	16B	~8B reasoner + ~8B generator	High-quality video and action reasoning in fractions of a second	Available
Cosmos 3 Edge	Not disclosed	Not disclosed	Real-time inference at the edge	Coming soon

Cosmos 3 Super targets data-center Hopper and Blackwell GPUs, while Nano is sized for workstation-class hardware such as the RTX PRO 6000 Blackwell card.^[5]^[9] NVIDIA's testing was done at BF16 precision on GB200 and H100 systems; the model card notes that FP4, FP8 and FP16 are not officially supported at launch.^[8]

Beyond the two base models, the GitHub release also includes task-specialized checkpoints in the same family, such as text-to-image and image-to-video variants of Super and a vision-language robot policy model (Cosmos3-Nano-Policy-DROID) trained for manipulation.^[6]

How is Cosmos 3 trained, and on what data?

Cosmos 3 is trained on large-scale multimodal data rather than text alone. According to the Cosmos 3 Nano model card, the model was trained on roughly 1.3 billion data points drawn from 393 dataset entries collected between 2024 and 2026. The breakdown spans the modalities the model has to master: about 22 million reasoning (text) samples, 767 million images, 348 million videos, 139 million audio samples and 8 million action samples.^[8]

The two towers learn in complementary ways. The Reasoner is trained on paired vision-language data such as image-text and video-text pairs to support question answering, spatial grounding, temporal reasoning and action understanding. The Generator is trained on large multimodal corpora of images, video, audio and action using reconstruction-based objectives, so it learns to synthesize the world rather than to label it.^[7] To address the shortage of robot-perspective data, NVIDIA says Cosmos 3 also learns from teleoperation recordings, simulation, and third-person video that is re-projected into a first-person view.^[4]

A widely repeated figure of "20 trillion training tokens" circulated in some press coverage of the launch, but that number does not appear in NVIDIA's official model cards, technical blog or technical report, so it is omitted here. NVIDIA's own documentation describes the data in terms of samples per modality, as above, rather than a single token count.

What can Cosmos 3 do?

NVIDIA presents Cosmos 3 as a leaderboard-topping open model across the three things it is built to do: physical-AI reasoning, world simulation and action generation. Performance claims below are NVIDIA's own.

The company says Cosmos 3 ranks first among open models on a spread of physical-AI and generation leaderboards, including Artificial Analysis (for text-to-image and image-to-video), Physics-IQ, PAI-Bench, R-Bench, RoboLab, RoboArena, VANTAGE-Bench and TAR.^[1] NVIDIA's developer materials add that Cosmos 3 leads VANTAGE-Bench at both the 32B tier (Super) and the 8B tier (Nano), framing the comparison by the size of the individual towers rather than the combined model.^[5] NVIDIA has not published a full table of numerical scores in its launch blogs; the detailed results live in the accompanying technical report.^[7]

Practically, the model is aimed at a few concrete jobs. It can generate synthetic training data and rare edge cases for robots and self-driving systems, predict how a scene will unfold, reason about what a camera is seeing for video-analytics agents in factories and cities, and output the actual control actions for robot manipulation such as pick-and-place.^[2]^[9] Because the Reasoner and Generator share one model, a developer can ask Cosmos 3 to look at a scene, decide what should happen, and then produce both the predicted video and the action trajectory to get there. NVIDIA's pitch is that this compresses the physical-AI development loop "from months to days," by replacing demonstrations that would otherwise have to be captured by hand.^[1]^[2]

Is Cosmos 3 open source, and how do you get it?

Cosmos 3 Super and Cosmos 3 Nano are available now; Cosmos 3 Edge is listed as coming soon.^[1] The models can be tried as hosted endpoints on build.nvidia.com, downloaded as open weights from Hugging Face (the nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super collections, both dated May 31, 2026), and built on through the NVIDIA Cosmos repository on GitHub, which also ships training and post-training scripts.^[1]^[6]^[8] For production, the models can be deployed as NVIDIA NIM microservices, and NVIDIA names cloud and inference partners including Baseten, CoreWeave, Microsoft Azure, Nebius, Deep Infra and Classmethod for hosting them.^[1]

The weights are released under the OpenMDW 1.1 license (the Open Model Development and Weights License) stewarded by the Linux Foundation, which is a permissive license intended to allow commercial use and the distribution of derivative models.^[2]^[8]^[9] NVIDIA also lists a contact address (cosmos-license@nvidia.com) for custom arrangements.^[6]

Developers typically adapt Cosmos 3 to their own robots or vehicles through supervised fine-tuning on custom datasets and through action post-training that covers forward dynamics, inverse dynamics and policy generation. NVIDIA's tooling supports quantization (BF16, FP8 and NVFP4) for NIM deployment and integration with the vLLM inference engine for higher throughput, although the base checkpoints themselves were validated at BF16.^[5]

What is the NVIDIA Cosmos Coalition?

Alongside the model, NVIDIA announced the NVIDIA Cosmos Coalition, described as a group of leading AI labs and robotics companies working together to advance the next generation of open world models. The founding members named at launch are Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI.^[1]^[4] The grouping is notable for mixing robotics specialists (Agile Robots, Generalist, Skild AI) with generative-media labs (Black Forest Labs, LTX, Runway), which fits Cosmos 3's dual nature as both a robot-action model and a high-fidelity video generator.

Where does Cosmos 3 fit in NVIDIA's physical-AI stack?

Cosmos 3 is the model layer in a larger physical-AI strategy NVIDIA has been assembling for several years. The company connects it to its simulation and robotics software, NVIDIA Omniverse and the Isaac platform, where synthetic worlds are built and robot policies are tested before deployment. On the robotics side it lines up with Isaac GR00T, NVIDIA's foundation-model effort for humanoid robots, and with simulation tools like Isaac Sim and Isaac Lab. For autonomous vehicles it connects to the NVIDIA DRIVE Hyperion platform, and for on-robot inference it pairs with the Jetson Thor edge computer.^[2]

That positioning maps onto NVIDIA's "three computers" view of robotics: one computer (in the data center, on Blackwell and the upcoming Vera Rubin generation) to train the models, a second running Omniverse and Cosmos to generate data and simulate, and a third (Jetson Thor) inside the robot to run the policy in the real world. Cosmos 3 is designed to feed the middle of that loop, turning scarce real-world data into the abundant, physically grounded data that the other two computers need.

ELI5: what is Cosmos 3?

Imagine you want to teach a robot to pick things up, but you do not have the time to show it a million examples by hand. Cosmos 3 is like a very good imagination for machines. You can describe or show it a scene, it figures out what is going on and what should happen next, and then it draws a realistic video of it, or even writes down the exact movements a robot arm should make. Because it can dream up endless practice scenes, including weird ones that rarely happen in real life, robots and self-driving cars can learn faster and more safely. And because NVIDIA gave the model away openly, anyone can download it and build on it.

References

NVIDIA Newsroom. "NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI." June 1, 2026. https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-3-the-open-frontier-foundation-model-for-physical-ai ↩
NVIDIA Blog. "How Cosmos 3 Helps Physical AI Think Before It Acts." June 1, 2026. https://blogs.nvidia.com/blog/cosmos-3-physical-ai-open-world-foundation-model/ ↩
NVIDIA Technical Blog. "Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3." June 1, 2026. https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/ ↩
NVIDIA. "NVIDIA Cosmos: World Foundation Models Powering Physical AI" (product page). https://www.nvidia.com/en-us/ai/cosmos/ ↩
Hugging Face. "Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action." 2026. https://huggingface.co/blog/nvidia/cosmos-3-for-physical-ai ↩
GitHub. "NVIDIA/Cosmos" repository. https://github.com/nvidia/cosmos ↩
NVIDIA Research, Cosmos Lab. "Cosmos 3: Omnimodal World Models for Physical AI" (technical report). 2026. https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf ↩
Hugging Face. "nvidia/Cosmos3-Nano" model card. May 31, 2026. https://huggingface.co/nvidia/Cosmos3-Nano ↩
NVIDIA Blog. "NVIDIA GTC Taipei at COMPUTEX: Live Updates on What's Next in AI." June 1, 2026. https://blogs.nvidia.com/blog/nvidia-gtc-taipei-computex-2026-news/ ↩
MarkTechPost. "NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation." June 3, 2026. https://www.marktechpost.com/2026/06/03/nvidia-releases-cosmos-3-a-two-tower-mixture-of-transformers-foundation-model-unifying-physical-reasoning-world-generation-and-action-generation/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

HY-World 2.0 Isaac GR00T NVIDIA Cosmos NVIDIA Cosmos Reason

Key facts

What is NVIDIA Cosmos 3?

What is a world foundation model?

Mixture-of-Transformers architecture

What are the inputs and outputs?

What are the Cosmos 3 model variants?

How is Cosmos 3 trained, and on what data?

What can Cosmos 3 do?

Is Cosmos 3 open source, and how do you get it?

What is the NVIDIA Cosmos Coalition?

Where does Cosmos 3 fit in NVIDIA's physical-AI stack?

ELI5: what is Cosmos 3?

See also

References

Improve this article

Related Articles

NVIDIA Cosmos

NVIDIA Picasso

EDM (Elucidating Diffusion Models)

Jetson Thor

NVIDIA Omniverse

NVIDIA Isaac Sim

What links here

Related Articles

NVIDIA Cosmos

NVIDIA Picasso

EDM (Elucidating Diffusion Models)

Jetson Thor

NVIDIA Omniverse

NVIDIA Isaac Sim

What links here