NVIDIA Cosmos
Last reviewed
May 7, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 4,411 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
16 citations
Review status
Source-backed
Revision
v2 ยท 4,411 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Cosmos is a world foundation model platform developed by NVIDIA for physical AI applications, including autonomous vehicles and robotics. Announced by CEO Jensen Huang at CES on January 6, 2025, Cosmos provides a suite of pre-trained generative models, video tokenizers, safety guardrails, and a data processing pipeline that together allow developers to generate physics-grounded synthetic training data at scale. Models are released under the NVIDIA Open Model License and are publicly accessible on Hugging Face and the NVIDIA NGC catalog.
The platform centers on three families of fine-tunable models: Cosmos Predict for simulating future world states as video, Cosmos Transfer for converting structured simulation data into photorealistic footage, and Cosmos Reason for chain-of-thought physical reasoning from video. Together these address the long-standing shortage of diverse, physics-accurate training data that has slowed progress in embodied AI.
Training capable embodied AI systems requires video that is dense, diverse, and physically consistent. Real-world collection is slow, expensive, and difficult to scale to the range of edge cases a robot or autonomous vehicle will encounter. Simulation environments such as NVIDIA Omniverse can generate controlled scenarios, but historically the resulting footage looked synthetic enough to cause a domain gap when policies trained on it were deployed on real hardware. This sim-to-real gap remains one of the central challenges in robotics.
World foundation models offer a potential solution: train a large generative model on massive quantities of real video so that it learns the visual statistics and physical dynamics of the actual world, then use that model to produce synthetic footage that is both controllable and photoreal. The generated data can augment or replace costly real-world collection, letting developers cover rare scenarios, vary lighting and weather conditions, and label trajectories automatically.
World models as a research concept go back to David Ha and Jurgen Schmidhuber's 2018 paper introducing models that learn compressed representations of environment dynamics. The subsequent decade saw reinforcement learning researchers use smaller learned world models for planning, but these operated in low-dimensional or game-like settings rather than on high-resolution video of the physical world.
Between 2023 and 2024 a new generation of large-scale video generation models emerged, and several groups began framing them explicitly as world models for physical AI. Genie 3 from Google DeepMind, World Labs' Marble, and Decart's Oasis were among the concurrent efforts that NVIDIA Cosmos entered alongside. NVIDIA's approach differed in its explicit physical-AI focus, its open release of model weights, and its integration with NVIDIA's broader hardware and software stack.
NVIDIA announced Cosmos at its CES 2025 keynote on January 6, 2025. Jensen Huang demonstrated three use cases: video search and scenario identification, physics-based synthetic data generation from 3D scenes built in NVIDIA Omniverse, and a "multiverse" simulation mode in which the model generates multiple plausible continuations of a scenario to help robots or vehicles plan under uncertainty.
On January 7, 2025, NVIDIA released the initial Cosmos 1.0 model weights on Hugging Face and the NGC catalog under the NVIDIA Open Model License. This initial release included four autoregressive models (4B, 5B, 12B, 13B parameters) and four diffusion models (7B and 14B parameters, in both Text2World and Video2World configurations), along with the Cosmos Tokenizer.
On March 18, 2025, NVIDIA announced a major release that added the Cosmos Predict and Cosmos Transfer model families, introduced Cosmos Reason in early access, and expanded the platform's integration with Google Cloud Vertex AI and additional robotics partner toolchains.
Subsequent releases in late 2025 brought Cosmos-Predict2 and Cosmos-Predict2.5 (a 2B/14B flow-based model that unifies text-to-world and video-to-world generation), as well as Cosmos-Transfer2.5 (a multi-controlnet variant accepting simultaneous RGB, depth, and segmentation inputs).
Cosmos is a platform rather than a single model. Its components work together in a pipeline:
The models are designed to be post-trained. NVIDIA reports that domain-specific post-training can achieve up to 10x higher accuracy on downstream tasks compared to using the base models directly.
The Cosmos Tokenizer converts raw images and video into compact token representations that the world foundation models consume. It supports both continuous tokens (for diffusion-based models) and discrete tokens (for autoregressive models), and its causal design means it can process streaming video without needing the entire sequence in advance.
The tokenizer achieves spatial compression ratios of 8x8 or 16x16 for images, and spatio-temporal compression ratios of 4x8x8, 8x8x8, or 8x16x16 for video. The most aggressive compression (8x temporal combined with 16x16 spatial) results in a total compression factor of up to 2048x. NVIDIA benchmarks show the tokenizer delivers +4 dB PSNR improvement on the DAVIS video dataset compared to prior methods, runs 12x faster, and uses fewer parameters than competing approaches.
On MS-COCO and ImageNet-1K image benchmarks, Cosmos image tokenizers outperform FLUX and LlamaGen baselines. On video, continuous tokenizers outperform CogVideoX and Omni-tokenizer across PSNR, SSIM, and rFVD metrics. Discrete tokenizers show better compression-quality tradeoffs than alternatives at high compression rates.
The tokenizer is released separately on Hugging Face under the NVIDIA Cosmos Tokenizer collection, allowing developers to use it independently of the full world foundation models.
The Cosmos 1.0 models were pre-trained on approximately 20 million hours of raw video representing roughly 9,000 trillion tokens. NVIDIA does not fully disclose the specific sources of this data. The paper accompanying the models describes the curated training set as approximately 100 million clips drawn from the following broad categories:
| Category | Share of training clips |
|---|---|
| Nature dynamics | 20% |
| Hand and object manipulation | 16% |
| Spatial awareness | 16% |
| Driving | 11% |
| Human motion | 10% |
| First-person point of view | 8% |
| Dynamic camera | 8% |
| Synthetic rendering | 4% |
| Other | 7% |
This distribution reflects the physical-AI focus: driving, manipulation, and spatial awareness collectively account for over 40% of training data. The strong representation of manipulation and first-person footage is intended to make the models useful for robotic arm control and humanoid motion tasks.
Data curation used NVIDIA NeMo Curator, a CUDA-accelerated pipeline that filters, clips, and labels raw footage. Processing the full 20 million hours took 14 days on NVIDIA Blackwell GPUs, compared to more than three years if run on a CPU-only pipeline of equivalent power consumption. On Hopper GPUs the same task takes approximately 40 days.
The Cosmos-Predict2.5 generation (released October 2025) expanded the post-training data to 200 million high-quality clips.
Cosmos Predict is the world simulation component of the platform. It models future world states as video from multimodal inputs: text descriptions, single images, video sequences, or start-and-end frame pairs. The model can predict what happens next in a scene, interpolate between two keyframes, or generate video continuations from a text prompt alone.
Autoregressive Cosmos Predict models use a GPT-style decoder architecture trained to predict the next discrete video token given the preceding sequence. The architecture is built on Llama3-style transformer blocks with:
The autoregressive models use the discrete variant of the Cosmos Tokenizer (DV8x16x16), which maps video to integer tokens. Because pure discrete decoding limits visual quality, NVIDIA trains a diffusion decoder that maps discrete DV8x16x16 tokens back to the higher-fidelity continuous CV8x8x8 token space before final pixel rendering.
The four autoregressive model sizes released in Cosmos 1.0 are:
| Model | Parameters | Type | Conditioning |
|---|---|---|---|
| Cosmos-1.0-Autoregressive-4B | 4B | Base (video-only) | Video in, video out |
| Cosmos-1.0-Autoregressive-5B-Video2World | 5B | Text + Video | Text + video in, video out |
| Cosmos-1.0-Autoregressive-12B | 12B | Base (video-only) | Video in, video out |
| Cosmos-1.0-Autoregressive-13B-Video2World | 13B | Text + Video | Text + video in, video out |
The 5B and 13B Video2World variants are derived from the 4B and 12B base models by adding cross-attention layers and performing additional Stage 2 training on text-video pairs. They bear no language understanding from pre-training; all textual information enters only through T5 embeddings at inference time.
Generation throughput for the 4B model on eight H100 GPUs at 320x512 resolution (10 FPS) is approximately 806 tokens per second, producing a 24-frame (2.4-second) clip from a 9-frame context in about 2.38 seconds.
Cosmos Predict diffusion models use a latent diffusion architecture derived from the DiT (Diffusion Transformer) design. The forward process progressively adds noise to latent video tokens, and the reverse process denoises using a transformer that is conditioned on text.
Key architectural choices include:
Joint image-video training proceeds with domain normalization, progressing from 512p to 720p resolution using multi-aspect training buckets (1:1 and 16:9 ratios). Training uses BF16/FP32 mixed precision.
The four diffusion model sizes released in Cosmos 1.0 are:
| Model | Parameters | Output | Frames | Resolution |
|---|---|---|---|---|
| Cosmos-1.0-Diffusion-7B-Text2World | 7B | Video from text | 121 frames | 1280x704 @ 24 FPS |
| Cosmos-1.0-Diffusion-14B-Text2World | 14B | Video from text | 121 frames | 1280x704 @ 24 FPS |
| Cosmos-1.0-Diffusion-7B-Video2World | 7B | Video continuation | 120 frames | 1280x704 @ 24 FPS |
| Cosmos-1.0-Diffusion-14B-Video2World | 14B | Video continuation | 120 frames | 1280x704 @ 24 FPS |
Text2World models generate a full 121-frame (~5 second) clip from a text description alone. Video2World models take an initial image frame plus a text description and predict the subsequent 120 frames, which is well-suited to simulation use cases where a starting state is known.
On 3D consistency benchmarks, Cosmos Diffusion Text2World 7B achieves a Sampson error of 0.355 and a pose estimation success rate of 62.60%, compared to VideoLDM's 0.841 and 4.40%. On physics alignment metrics, the Video2World 7B model with 9-frame conditioning achieves a PSNR of 21.06, SSIM of 0.69, and IoU of 0.592.
Cosmos Transfer addresses the sim-to-real domain gap. Simulation engines like NVIDIA Omniverse can generate precise, labeled 3D scenes quickly, but the rendered footage looks visibly synthetic. Policies trained on purely synthetic data often fail when deployed on real hardware because the visual distribution shifts.
Cosmos Transfer takes structured inputs such as segmentation maps, depth maps, edge maps, LiDAR scans, pose estimation data, trajectory maps, and HD maps, and generates photorealistic video that matches the structure and physics of the simulation while looking like real-world footage.
The architecture uses a ControlNet approach: control signals are processed by an encoder that injects them into the main diffusion backbone without overwriting its pre-trained visual knowledge. This preserves the realism the backbone learned from 20 million hours of real video while forcing the output to conform to the structural constraints of the simulation.
For robotics, Cosmos Transfer is integrated into the Isaac GR00T Blueprint for synthetic manipulation motion generation, where it converts simulator-rendered arm trajectories into photorealistic training footage. For autonomous vehicle development, it plugs into the Omniverse Blueprint for AV Simulation, transforming geometric driving scenarios into realistic urban environments with varied lighting, weather, and surface textures.
Cosmos-Transfer2.5 (released October 2025) extended the design to a multi-controlnet that accepts simultaneous inputs of RGB, depth, segmentation, and other modalities configured via JSON-based controlnet_specs, enabling more fine-grained control over the output.
Cosmos Reason is a vision-language model that applies chain-of-thought reasoning to physical scenarios. Where Cosmos Predict generates video and Cosmos Transfer converts simulation to reality, Cosmos Reason understands what is happening in video and predicts whether actions or events are physically plausible.
The model processes video at 604x480 resolution and generates step-by-step textual reasoning before producing a final decision or annotation. It understands object motion, affordances, spatial constraints, and multi-step interactions across humans, objects, and environments.
Training proceeds in three stages:
Cosmos Reason 2 (released mid-2025) added extended context support up to 256K input tokens and introduced 2D/3D point localization with bounding box coordinates. It achieves an average score of 65.7 across robotics video question answering benchmarks including BridgeData V2, RoboVQA, and Agibot.
Primary uses of Cosmos Reason within the platform include:
Cosmos Reason supports extended context inputs of up to 256K tokens, which lets it process long video sequences or reason over an entire episode of robot behavior at once.
Robots need enormous volumes of demonstration data to learn generalizable manipulation and locomotion skills. Human teleoperation is slow and expensive; Cosmos Predict and Transfer together offer a scalable alternative.
NVIDIA's GR00T-Dreams blueprint, built on top of Cosmos Predict, generates synthetic robot trajectories from a single image and a language prompt. In one internal evaluation, the pipeline produced 780,000 synthetic trajectories in 11 hours (equivalent to roughly 6,500 hours of human demonstration data), and combining these with real data improved Isaac GR00T N1 policy performance by 40%.
The MimicGen NIM microservice integrates with Cosmos Transfer: developers record a small number of human demonstrations in NVIDIA Isaac Sim, use MimicGen to generate thousands of synthetic trajectory variants, and then run Cosmos Transfer to make those trajectories photorealistic. RoboCasa provides simulation-ready kitchen environments in OpenUSD format that serve as the starting geometry for this pipeline.
AV developers need rare edge cases such as unusual weather, pedestrian behavior, and sensor failure scenarios that are difficult and dangerous to collect from real driving. Cosmos allows teams to generate photorealistic footage of these scenarios at scale.
The Cosmos Three-Computer solution announced at CES 2025 integrates Cosmos models with NVIDIA Drive Hyperion (in-vehicle sensing), NVIDIA Drive AGX (real-time in-vehicle inference), and NVIDIA DGX (data center training). Cosmos generates synthetic camera, LiDAR, and radar sensor data that feeds the training loop, while Cosmos Reason annotates edge case clips automatically.
Waabi and Uber both cited Cosmos as part of their pipeline for accelerating autonomous driving development. Foretellix uses Cosmos to stress-test their AV simulation scenarios with rare events.
Beyond robotics and autonomous driving, Cosmos models can be used for understanding and searching large video collections. Milestone Systems, a video analytics platform, uses Cosmos to search for specific scenario patterns across large sensor networks. Linker Vision and Nexar apply it to traffic analysis and driver behavior monitoring.
Cosmos includes a two-stage safety system.
Pre-guard: Input text prompts are first screened against a blocklist of prohibited terms, then passed through NVIDIA's Aegis AI Content Safety model, which classifies prompts for harmful content. Prompts that pass both checks are forwarded to the generation model.
Post-guard: Generated video frames are evaluated by a content classifier. Faces in output footage are detected using RetinaFace and automatically blurred for privacy. Content classified as harmful is filtered before the video is returned to the caller.
The license terms also require that users must not bypass, disable, or reduce the efficacy of any guardrail or safety mechanism. Circumventing these controls terminates the license.
Generated videos carry invisible watermarks. NVIDIA has not published the details of the watermarking scheme, but the stated purpose is to allow identification of synthetically generated footage if it is redistributed.
Cosmos models are distributed under the NVIDIA Open Model License (NOML). The license permits commercial use, modification, and redistribution subject to several conditions:
NVIDIA has not disclosed the specific datasets used to train Cosmos, nor has it made the full training pipeline publicly available. Critics have noted that this limits reproducibility and makes it impossible to audit the training data for copyright or consent concerns. NVIDIA refers to the models as "open" based on the availability of weights, rather than on full process transparency.
The world model space attracted multiple major players around the same time as the Cosmos announcement. The following table summarizes the main alternatives as of mid-2025:
| System | Developer | Release | Access | Parameters | Physical AI focus | License |
|---|---|---|---|---|---|---|
| Cosmos | NVIDIA | Jan 2025 | Public weights | 4B to 14B (1.0); 2B/14B (2.5) | Yes (robotics, AV) | NVIDIA Open Model License |
| Genie 3 | Google DeepMind | Aug 2025 | Research preview only | Not disclosed | Partial | Proprietary |
| Marble | World Labs | Nov 2025 | API ($20/month+) | Not disclosed | No (3D environment creation) | Proprietary SaaS |
| Oasis | Decart | Oct 2024 | API | Not disclosed | No (interactive game worlds) | Proprietary |
| Wan | Alibaba | Feb 2025 | Public weights | 1.3B to 14B | Limited | Apache 2.0 |
Cosmos is distinct from this peer group in three ways. First, it is the only platform that explicitly targets physical AI throughout its design: the training data distribution, the structured control inputs in Transfer, and the physics-reasoning capabilities of Reason are all oriented toward robotics and autonomous vehicles rather than general creative video generation. Second, it ships with an integrated toolchain (Tokenizer, NeMo Curator, Isaac Sim integration) that the others lack. Third, it is the only competitor in this group to release model weights publicly at launch with a commercially permissive license.
Genie 3 from Google DeepMind pursues real-time interactive generation of navigable 3D worlds at 24 FPS, a different objective than Cosmos's batch synthetic data generation. Genie 3 remained in limited research preview through mid-2026 and has not been released under a public license.
World Labs' Marble focuses on generating persistent, downloadable 3D environments from diverse inputs including text, photos, and panoramic images. It is a commercial product with API pricing rather than an open platform.
Decart's Oasis was originally demonstrated as a playable Minecraft-style world generated in real-time. Decart has explored porting Oasis to custom inference hardware to reduce latency but has not positioned the system for physical AI training.
NVIDIA announced a broad set of industry partners at CES 2025 and at the March 2025 major release.
| Partner | Domain | Use of Cosmos |
|---|---|---|
| 1X | Humanoid robotics | Cosmos Predict and Transfer for training NEO Gamma humanoid |
| Agility Robotics | Humanoid robotics | Scaling photorealistic training data beyond real-world collection |
| Figure AI | Humanoid robotics | Synthetic training data generation |
| Skild AI | Robot brain models | Cosmos Transfer to augment synthetic datasets |
| Uber | Autonomous vehicles | Accelerating autonomous driving model development |
| Waabi | Autonomous vehicles | Synthetic data for long-haul AV |
| Foretellix | AV simulation | Stress-testing with rare scenarios |
| Parallel Domain | Synthetic data | Photorealistic AV data generation |
| Nexar | Traffic AI | Driver and traffic pattern analysis |
| Virtual Incision | Surgical robotics | Surgical simulation data |
| XPENG | EVs and robots | AV and humanoid training data |
| Agile Robots | Industrial robotics | Manipulation training data |
| Fourier | Humanoid robotics | General training data generation |
| Neura Robotics | Humanoid robotics | Synthetic scenario generation |
| Oxa | Autonomous vehicles | Unstructured environment simulation |
| Wayve | Embodied driving AI | Synthetic data augmentation |
Foxconn announced that it is using the NVIDIA Omniverse blueprint (which integrates Cosmos Transfer) to simulate industrial manipulators, humanoids, and mobile robots in its manufacturing facilities, though the company's direct use of Cosmos model weights has not been separately confirmed.
Cosmos is designed to slot into NVIDIA's broader physical AI stack:
Cosmos models are also available in the Vertex AI Model Garden on Google Cloud and through the NVIDIA API catalog as NIM microservices, allowing deployment without managing GPU infrastructure directly.
Several limitations of Cosmos have been noted:
Training data opacity: NVIDIA has not disclosed the specific sources used to build the 20-million-hour training dataset. This limits independent audits for copyright issues, privacy violations, or demographic bias in the training distribution.
License restrictions: Although NVIDIA markets Cosmos as "open," the NOML includes the guardrail circumvention clause, the attribution display requirement, and the patent litigation termination clause that are absent from true open-source licenses such as Apache 2.0. Bypassing safety guardrails instantly terminates all rights under the license.
Domain gap not fully eliminated: Cosmos Transfer reduces but does not eliminate the sim-to-real gap. The quality of photorealization depends on the quality of the underlying simulation geometry and the fidelity of the control signals. Poorly specified segmentation maps or inaccurate depth information produce unrealistic output.
Long video coherence: Like most video generation models, Cosmos Predict models lose temporal coherence over very long sequences. The 1.0 autoregressive models can attend to at most 34 frames (about 3 seconds at 10 FPS), which limits their utility for tasks requiring understanding of long episodes.
Hardware requirements: Running the 14B diffusion models at full resolution requires multiple high-end GPUs. The 4B autoregressive base model requires at least eight H100 GPUs for reasonable throughput. This limits accessibility for smaller research groups and startups without access to high-end GPU clusters.
Physical accuracy: Cosmos models learn statistical regularities in video rather than explicit physics. The generated footage looks physically plausible in most cases, but the models can produce physically incorrect events in unusual scenarios. Cosmos Reason's physics-filtering capability partially mitigates this, but does not guarantee physical correctness of generated data.