# NVIDIA Cosmos

> Source: https://aiwiki.ai/wiki/nvidia_cosmos
> Updated: 2026-06-21
> Categories: AI Models, Embodied AI, Generative AI, NVIDIA, Robotics, World Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**NVIDIA Cosmos** is a [world foundation model](/wiki/world_model) platform developed by [NVIDIA](/wiki/nvidia) for [physical AI](/wiki/physical_ai) applications, including autonomous vehicles and robotics. Announced by CEO Jensen Huang at CES on January 6, 2025, Cosmos provides a suite of pre-trained generative models, video tokenizers, safety guardrails, and a data processing pipeline that together let developers generate physics-grounded synthetic training data at scale. [1][3] The first models were pre-trained on roughly 20 million hours of real-world video (about 9,000 trillion tokens), and are released under the NVIDIA Open Model License with public weights on Hugging Face and the NVIDIA NGC catalog. [2][3] At launch Huang framed the goal directly: "We created Cosmos to democratize physical AI and put general robotics in reach of every developer." [1]

The platform centers on three families of fine-tunable models: **Cosmos Predict** for simulating future world states as video, **Cosmos Transfer** for converting structured simulation data into photorealistic footage, and **Cosmos Reason** for chain-of-thought physical reasoning from video. [4] Together these address the long-standing shortage of diverse, physics-accurate training data that has slowed progress in [embodied AI](/wiki/embodied_ai). In June 2026, NVIDIA consolidated all three roles into a single unified model, [NVIDIA Cosmos 3](/wiki/nvidia_cosmos_3). [17]

## What is NVIDIA Cosmos used for?

Cosmos exists to solve the data bottleneck in physical AI: it generates controllable, photoreal video that can train robots and autonomous vehicles without the cost and risk of collecting equivalent real-world footage. Developers use it to manufacture rare edge cases, augment scarce demonstration data, vary lighting and weather, and automatically label trajectories. NVIDIA reports that domain-specific post-training of the base models can deliver up to 10x higher accuracy on downstream tasks compared with using the base models directly. [3][5]

## Background

### Why is training data the bottleneck in physical AI?

Training capable [embodied AI](/wiki/embodied_ai) systems requires video that is dense, diverse, and physically consistent. Real-world collection is slow, expensive, and difficult to scale to the range of edge cases a robot or autonomous vehicle will encounter. Simulation environments such as [NVIDIA Omniverse](/wiki/nvidia_omniverse) can generate controlled scenarios, but historically the resulting footage looked synthetic enough to cause a domain gap when policies trained on it were deployed on real hardware. This sim-to-real gap remains one of the central challenges in robotics.

World foundation models offer a potential solution: train a large generative model on massive quantities of real video so that it learns the visual statistics and physical dynamics of the actual world, then use that model to produce synthetic footage that is both controllable and photoreal. The generated data can augment or replace costly real-world collection, letting developers cover rare scenarios, vary lighting and weather conditions, and label trajectories automatically. NVIDIA positions the approach as the physical-AI analog of large language models: "Like large language models, world foundation models are fundamental to advancing robot and AV development, yet not all developers have the expertise and resources to train their own," Huang said at CES. [1]

### Prior work

World models as a research concept go back to David Ha and Jurgen Schmidhuber's 2018 paper introducing models that learn compressed representations of environment dynamics. The subsequent decade saw reinforcement learning researchers use smaller learned world models for planning, but these operated in low-dimensional or game-like settings rather than on high-resolution video of the physical world.

Between 2023 and 2024 a new generation of large-scale video generation models emerged, and several groups began framing them explicitly as world models for physical AI. [Genie 3](/wiki/genie_3) from Google DeepMind, World Labs' Marble, and Decart's Oasis were among the concurrent efforts that NVIDIA Cosmos entered alongside. NVIDIA's approach differed in its explicit physical-AI focus, its open release of model weights, and its integration with NVIDIA's broader hardware and software stack.

## When was NVIDIA Cosmos released?

NVIDIA announced Cosmos at its CES 2025 keynote on January 6, 2025. [1] Jensen Huang demonstrated three use cases: video search and scenario identification, physics-based synthetic data generation from 3D scenes built in NVIDIA Omniverse, and a "multiverse" simulation mode in which the model generates multiple plausible continuations of a scenario to help robots or vehicles plan under uncertainty. He set the stakes in market terms, predicting that "physical AI will revolutionize the $50 trillion manufacturing and logistics industries" and that "everything that moves, from cars and trucks to factories and warehouses, will be robotic and embodied by AI." [1]

On January 7, 2025, NVIDIA released the initial Cosmos 1.0 model weights on Hugging Face and the NGC catalog under the NVIDIA Open Model License. [2] This initial release included four autoregressive models (4B, 5B, 12B, 13B parameters) and four diffusion models (7B and 14B parameters, in both Text2World and Video2World configurations), along with the Cosmos Tokenizer. [3]

On March 18, 2025, NVIDIA announced a major release that added the Cosmos Predict and Cosmos Transfer model families, introduced Cosmos Reason in early access, and expanded the platform's integration with Google Cloud Vertex AI and additional robotics partner toolchains. [4]

Subsequent releases in late 2025 brought Cosmos-Predict2 and Cosmos-Predict2.5 (a 2B/14B flow-based model that unifies text-to-world and video-to-world generation), as well as Cosmos-Transfer2.5 (a multi-controlnet variant accepting simultaneous RGB, depth, and segmentation inputs). [16] The following table summarizes the release timeline:

| Date | Release | Highlights |
|---|---|---|
| Jan 6, 2025 | Cosmos announced at CES 2025 | Platform unveiled with Predict, Transfer, Reason concept and Cosmos Three-Computer AV solution [1] |
| Jan 7, 2025 | Cosmos 1.0 weights | 4 autoregressive (4B-13B) + 4 diffusion (7B/14B) models + Tokenizer, public under NOML [2][3] |
| Mar 18, 2025 | Major release | Cosmos Predict and Transfer families, Cosmos Reason early access, Vertex AI integration [4] |
| Oct 2025 | Cosmos-Predict2.5 / Transfer2.5 | Unified 2B/14B flow model; multi-controlnet Transfer; post-training data expanded to 200M clips [16] |
| Jun 1, 2026 | Cosmos 3 | Unified open omnimodel (Nano 16B / Super 64B); OpenMDW license; Cosmos Coalition [17][20] |

## Platform architecture

Cosmos is a platform rather than a single model. Its components work together in a pipeline:

1. Raw video is ingested and tokenized by the **Cosmos Tokenizer**.
2. Pre-trained **world foundation models** (autoregressive or diffusion) generate new video conditioned on text, images, video, or structured control signals.
3. **Cosmos Guardrails** filter inputs and outputs for safety.
4. **NVIDIA NeMo Curator** handles large-scale data curation and labeling.
5. Developers use **NeMo Framework** to fine-tune models on proprietary or domain-specific data.

The models are designed to be post-trained. NVIDIA reports that domain-specific post-training can achieve up to 10x higher accuracy on downstream tasks compared to using the base models directly. [3][5]

## Cosmos Tokenizer

The Cosmos Tokenizer converts raw images and video into compact token representations that the world foundation models consume. It supports both continuous tokens (for diffusion-based models) and discrete tokens (for autoregressive models), and its causal design means it can process streaming video without needing the entire sequence in advance.

The tokenizer achieves spatial compression ratios of 8x8 or 16x16 for images, and spatio-temporal compression ratios of 4x8x8, 8x8x8, or 8x16x16 for video. The most aggressive compression (8x temporal combined with 16x16 spatial) results in a total compression factor of up to 2048x. NVIDIA reports that the Cosmos tokenizers deliver 8x more total compression and run 12x faster than prior state-of-the-art methods, and benchmarks show a +4 dB PSNR improvement on the DAVIS video dataset while using fewer parameters than competing approaches. [2][3]

On MS-COCO and ImageNet-1K image benchmarks, Cosmos image tokenizers outperform FLUX and LlamaGen baselines. On video, continuous tokenizers outperform CogVideoX and Omni-tokenizer across PSNR, SSIM, and rFVD metrics. Discrete tokenizers show better compression-quality tradeoffs than alternatives at high compression rates. [3]

The tokenizer is released separately on Hugging Face under the NVIDIA Cosmos Tokenizer collection, allowing developers to use it independently of the full world foundation models. [9]

## Training data

The Cosmos 1.0 models were pre-trained on approximately 20 million hours of raw video representing roughly 9,000 trillion tokens. [2][3] NVIDIA does not fully disclose the specific sources of this data. The paper accompanying the models describes the curated training set as approximately 100 million clips drawn from the following broad categories: [3]

| Category | Share of training clips |
|---|---|
| Nature dynamics | 20% |
| Hand and object manipulation | 16% |
| Spatial awareness | 16% |
| Driving | 11% |
| Human motion | 10% |
| First-person point of view | 8% |
| Dynamic camera | 8% |
| Synthetic rendering | 4% |
| Other | 7% |

This distribution reflects the physical-AI focus: driving, manipulation, and spatial awareness collectively account for over 40% of training data. The strong representation of manipulation and first-person footage is intended to make the models useful for robotic arm control and humanoid motion tasks.

Data curation used NVIDIA NeMo Curator, a CUDA-accelerated pipeline that filters, clips, and labels raw footage. Processing the full 20 million hours took 14 days on NVIDIA Blackwell GPUs, compared to more than three years if run on a CPU-only pipeline of equivalent power consumption. On Hopper GPUs the same task takes approximately 40 days. [2][5]

The Cosmos-Predict2.5 generation (released October 2025) expanded the post-training data to 200 million high-quality clips. [16]

## Cosmos Predict

Cosmos Predict is the world simulation component of the platform. It models future world states as video from multimodal inputs: text descriptions, single images, video sequences, or start-and-end frame pairs. The model can predict what happens next in a scene, interpolate between two keyframes, or generate video continuations from a text prompt alone. [4]

### Autoregressive variants

Autoregressive Cosmos Predict models use a GPT-style decoder architecture trained to predict the next discrete video token given the preceding sequence. The architecture is built on Llama3-style transformer blocks with:

- Absolute positional embeddings combined with 3D Rotary Position Embeddings (RoPE) that encode spatial and temporal dimensions separately.
- Self-attention layers over the video token sequence.
- Cross-attention layers that inject T5-XXL text embeddings, allowing text conditioning without requiring the text tokens to be part of the main autoregressive sequence.
- QK-normalization using RMSNorm for training stability.
- Progressive context extension from 17 frames (Stage 1) to 34 frames (Stage 1.1) via YaRN.

The autoregressive models use the discrete variant of the Cosmos Tokenizer (DV8x16x16), which maps video to integer tokens. Because pure discrete decoding limits visual quality, NVIDIA trains a diffusion decoder that maps discrete DV8x16x16 tokens back to the higher-fidelity continuous CV8x8x8 token space before final pixel rendering. [3]

The four autoregressive model sizes released in Cosmos 1.0 are:

| Model | Parameters | Type | Conditioning |
|---|---|---|---|
| Cosmos-1.0-Autoregressive-4B | 4B | Base (video-only) | Video in, video out |
| Cosmos-1.0-Autoregressive-5B-Video2World | 5B | Text + Video | Text + video in, video out |
| Cosmos-1.0-Autoregressive-12B | 12B | Base (video-only) | Video in, video out |
| Cosmos-1.0-Autoregressive-13B-Video2World | 13B | Text + Video | Text + video in, video out |

The 5B and 13B Video2World variants are derived from the 4B and 12B base models by adding cross-attention layers and performing additional Stage 2 training on text-video pairs. They bear no language understanding from pre-training; all textual information enters only through T5 embeddings at inference time.

Generation throughput for the 4B model on eight H100 GPUs at 320x512 resolution (10 FPS) is approximately 806 tokens per second, producing a 24-frame (2.4-second) clip from a 9-frame context in about 2.38 seconds. [3]

### Diffusion variants

Cosmos Predict diffusion models use a latent diffusion architecture derived from the DiT (Diffusion Transformer) design. The forward process progressively adds noise to latent video tokens, and the reverse process denoises using a transformer that is conditioned on text.

Key architectural choices include:

- **3D patchification** of the latent token volume, which preserves spatial and temporal structure throughout the transformer stack.
- **FPS-aware 3D RoPE** that handles variable resolutions, aspect ratios, and frame rates within a single model.
- **T5-XXL text encoding** with embeddings zero-padded to a fixed length of 512 tokens.
- **AdaLN-LoRA** (adaptive layer normalization combined with low-rank adaptation): this replaces full adaptive layer normalization and achieves a 36% reduction in parameter count (from a naive 11B to the released 7B) while maintaining generation quality.
- **Query-key RMSNorm** for attention stability during training.

Joint image-video training proceeds with domain normalization, progressing from 512p to 720p resolution using multi-aspect training buckets (1:1 and 16:9 ratios). Training uses BF16/FP32 mixed precision. [3]

The four diffusion model sizes released in Cosmos 1.0 are:

| Model | Parameters | Output | Frames | Resolution |
|---|---|---|---|---|
| Cosmos-1.0-Diffusion-7B-Text2World | 7B | Video from text | 121 frames | 1280x704 @ 24 FPS |
| Cosmos-1.0-Diffusion-14B-Text2World | 14B | Video from text | 121 frames | 1280x704 @ 24 FPS |
| Cosmos-1.0-Diffusion-7B-Video2World | 7B | Video continuation | 120 frames | 1280x704 @ 24 FPS |
| Cosmos-1.0-Diffusion-14B-Video2World | 14B | Video continuation | 120 frames | 1280x704 @ 24 FPS |

Text2World models generate a full 121-frame (~5 second) clip from a text description alone. Video2World models take an initial image frame plus a text description and predict the subsequent 120 frames, which is well-suited to simulation use cases where a starting state is known.

On 3D consistency benchmarks, Cosmos Diffusion Text2World 7B achieves a Sampson error of 0.355 and a pose estimation success rate of 62.60%, compared to VideoLDM's 0.841 and 4.40%. On physics alignment metrics, the Video2World 7B model with 9-frame conditioning achieves a PSNR of 21.06, SSIM of 0.69, and IoU of 0.592. [3]

## Cosmos Transfer

Cosmos Transfer addresses the sim-to-real domain gap. Simulation engines like [NVIDIA Omniverse](/wiki/nvidia_omniverse) can generate precise, labeled 3D scenes quickly, but the rendered footage looks visibly synthetic. Policies trained on purely synthetic data often fail when deployed on real hardware because the visual distribution shifts.

Cosmos Transfer takes structured inputs such as segmentation maps, depth maps, edge maps, LiDAR scans, pose estimation data, trajectory maps, and HD maps, and generates photorealistic video that matches the structure and physics of the simulation while looking like real-world footage. [4][13]

The architecture uses a ControlNet approach: control signals are processed by an encoder that injects them into the main diffusion backbone without overwriting its pre-trained visual knowledge. This preserves the realism the backbone learned from 20 million hours of real video while forcing the output to conform to the structural constraints of the simulation. [13]

For robotics, Cosmos Transfer is integrated into the Isaac GR00T Blueprint for synthetic manipulation motion generation, where it converts simulator-rendered arm trajectories into photorealistic training footage. For autonomous vehicle development, it plugs into the Omniverse Blueprint for AV Simulation, transforming geometric driving scenarios into realistic urban environments with varied lighting, weather, and surface textures. [11]

Cosmos-Transfer2.5 (released October 2025) extended the design to a multi-controlnet that accepts simultaneous inputs of RGB, depth, segmentation, and other modalities configured via JSON-based controlnet_specs, enabling more fine-grained control over the output. [16]

## Cosmos Reason

Cosmos Reason is a vision-language model that applies chain-of-thought reasoning to physical scenarios. Where Cosmos Predict generates video and Cosmos Transfer converts simulation to reality, Cosmos Reason understands what is happening in video and predicts whether actions or events are physically plausible. [6][7]

The model processes video at 604x480 resolution and generates step-by-step textual reasoning before producing a final decision or annotation. It understands object motion, affordances, spatial constraints, and multi-step interactions across humans, objects, and environments.

Training proceeds in three stages:

1. **Pre-training**: A Vision Transformer processes video frames into embeddings aligned with text.
2. **Supervised fine-tuning (SFT)**: The model is specialized on curated datasets covering object affordances, action sequences, and spatial reasoning. SFT boosts base benchmark performance by approximately 10%.
3. **Reinforcement learning**: The model is trained with verifiable physical rewards such as "arrow-of-time" dynamics (detecting whether a video is physically plausible or time-reversed). RL adds approximately 5% on top of the SFT baseline. [6][7]

Cosmos Reason 2 (released mid-2025) added extended context support up to 256K input tokens and introduced 2D/3D point localization with bounding box coordinates. It achieves an average score of 65.7 across robotics video question answering benchmarks including BridgeData V2, RoboVQA, and Agibot. [7]

Primary uses of Cosmos Reason within the platform include:

- Critiquing the quality of synthetically generated clips before they enter a training set.
- Filtering and curating large video datasets by text-based queries.
- Generating natural-language annotations for robot demonstration data.
- Serving as a reasoning backbone in vision-language-action (VLA) models.

Cosmos Reason supports extended context inputs of up to 256K tokens, which lets it process long video sequences or reason over an entire episode of robot behavior at once. [7]

## NVIDIA Cosmos 3 (2026)

*Main article: [NVIDIA Cosmos 3](/wiki/nvidia_cosmos_3)*

**NVIDIA Cosmos 3** is a major new generation of the platform announced by NVIDIA at GTC Taipei at COMPUTEX 2026, with the official launch dated June 1, 2026. NVIDIA describes it as an open frontier world foundation model for physical AI and "the world's first fully open omnimodel that can natively understand and generate text, images, video, ambient sound and actions." [17] Rather than shipping the three earlier model families as separate components, Cosmos 3 consolidates them into a single model: NVIDIA notes that previous Cosmos releases separated world generation, physical understanding, and controlled scene generation (the roles of Cosmos Predict, Cosmos Reason, and Cosmos Transfer) into different models and workflows, and that this release unifies those capabilities. [18][19]

The model is built on a Mixture-of-Transformers (MoT) architecture organized around two towers. A Reasoner tower is a vision-language model that interprets multimodal observations such as images, video, and text autoregressively to understand object interactions, motion, and spatial-temporal relationships. A Generator tower then uses a diffusion-based process to produce physics-aware video and action outputs conditioned on the reasoner's understanding, an approach NVIDIA frames as letting physical AI "think before it acts." [18][19] Cosmos 3 was trained on what NVIDIA calls one of the largest multimodal physical AI datasets, including billions of samples across text, image, video, sound, and action trajectories. [17]

Cosmos 3 launched in two open variants, with a third positioned for the edge:

| Variant | Parameters | Role |
|---|---|---|
| Cosmos 3 Nano | 16B | Compact model optimized for efficient inference and high-quality video and action reasoning in fractions of a second [19] |
| Cosmos 3 Super | 64B | Larger model for maximum quality and capability, aimed at post-training robotics and autonomous-vehicle models [19] |
| Cosmos 3 Edge | Coming soon | Variant targeted at real-time inference at the edge [17] |

NVIDIA reports that the unified design reduces physical AI training and evaluation cycles from months to days. [17] The company is open-sourcing the Cosmos 3 models, training scripts, deployment tools, and datasets, with checkpoints for Nano and Super available on Hugging Face, code and examples on GitHub, hosted experiences on build.nvidia.com, and packaged deployment through NVIDIA NIM microservices. [19] The release uses a single model-centric open license, OpenMDW, intended to keep weights, architecture, documentation, datasets, benchmarks, and code under one set of terms. [20]

Alongside the model, NVIDIA introduced the **NVIDIA Cosmos Coalition**, a collaboration among world-model builders, AI developers, and physical AI leaders to advance open world models. Founding members include Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI, who can contribute models, research, and evaluation techniques while building on Cosmos 3 technologies, NVIDIA training tools, and DGX Cloud infrastructure. [17][21]

## Use cases

### Robotics training

Robots need enormous volumes of demonstration data to learn generalizable manipulation and locomotion skills. Human teleoperation is slow and expensive; Cosmos Predict and Transfer together offer a scalable alternative.

NVIDIA's GR00T-Dreams blueprint, built on top of Cosmos Predict, generates synthetic robot trajectories from a single image and a language prompt. In one internal evaluation, the pipeline produced 780,000 synthetic trajectories in 11 hours (equivalent to roughly 6,500 hours of human demonstration data), and combining these with real data improved [Isaac GR00T](/wiki/isaac_gr00t) N1 policy performance by 40%. [11][12] NVIDIA Research later used the same blueprint to generate the synthetic training data for GR00T N1.5 in about 36 hours, a process the company says would have taken nearly three months by manual human data collection. [12]

The [MimicGen](/wiki/mimicgen) NIM microservice integrates with Cosmos Transfer: developers record a small number of human demonstrations in [NVIDIA Isaac Sim](/wiki/nvidia_omniverse), use MimicGen to generate thousands of synthetic trajectory variants, and then run Cosmos Transfer to make those trajectories photorealistic. [RoboCasa](/wiki/robocasa) provides simulation-ready kitchen environments in OpenUSD format that serve as the starting geometry for this pipeline.

### Autonomous vehicle development

AV developers need rare edge cases such as unusual weather, pedestrian behavior, and sensor failure scenarios that are difficult and dangerous to collect from real driving. Cosmos allows teams to generate photorealistic footage of these scenarios at scale.

The Cosmos Three-Computer solution announced at CES 2025 integrates Cosmos models with NVIDIA Drive Hyperion (in-vehicle sensing), NVIDIA Drive AGX (real-time in-vehicle inference), and NVIDIA DGX (data center training). [1] Cosmos generates synthetic camera, LiDAR, and radar sensor data that feeds the training loop, while Cosmos Reason annotates edge case clips automatically.

Waabi and Uber both cited Cosmos as part of their pipeline for accelerating autonomous driving development. Foretellix uses Cosmos to stress-test their AV simulation scenarios with rare events. [4]

### Video analytics and surveillance

Beyond robotics and autonomous driving, Cosmos models can be used for understanding and searching large video collections. Milestone Systems, a video analytics platform, uses Cosmos to search for specific scenario patterns across large sensor networks. Linker Vision and Nexar apply it to traffic analysis and driver behavior monitoring. [4]

## Guardrails and safety

Cosmos includes a two-stage safety system.

**Pre-guard**: Input text prompts are first screened against a blocklist of prohibited terms, then passed through NVIDIA's Aegis AI Content Safety model, which classifies prompts for harmful content. Prompts that pass both checks are forwarded to the generation model. [3]

**Post-guard**: Generated video frames are evaluated by a content classifier. Faces in output footage are detected using RetinaFace and automatically blurred for privacy. Content classified as harmful is filtered before the video is returned to the caller. [3]

The license terms also require that users must not bypass, disable, or reduce the efficacy of any guardrail or safety mechanism. Circumventing these controls terminates the license. [10]

Generated videos carry invisible watermarks. NVIDIA has not published the details of the watermarking scheme, but the stated purpose is to allow identification of synthetically generated footage if it is redistributed.

## Is NVIDIA Cosmos open source?

Cosmos models are distributed under the NVIDIA Open Model License (NOML). The license permits commercial use, modification, and redistribution subject to several conditions: [10]

- Products or services that incorporate Cosmos models must display "Built on NVIDIA Cosmos" in a visible location such as a website, about page, or product documentation.
- Guardrails must not be circumvented; doing so automatically terminates the license.
- If a licensee initiates patent or copyright litigation against any entity, all licenses granted under the agreement terminate on the date the suit is filed.
- The license does not require disclosure of training data or model code, which means it does not meet the Open Source Initiative's definition of open source.

NVIDIA has not disclosed the specific datasets used to train Cosmos, nor has it made the full training pipeline publicly available. Critics have noted that this limits reproducibility and makes it impossible to audit the training data for copyright or consent concerns. NVIDIA refers to the models as "open" based on the availability of weights, rather than on full process transparency. [10][14] The 2026 Cosmos 3 release moves to the OpenMDW license, which packages weights, architecture, documentation, datasets, benchmarks, and code under one set of terms. [20]

## How does NVIDIA Cosmos differ from other world models?

The world model space attracted multiple major players around the same time as the Cosmos announcement. The following table summarizes the main alternatives as of mid-2025:

| System | Developer | Release | Access | Parameters | Physical AI focus | License |
|---|---|---|---|---|---|---|
| Cosmos | NVIDIA | Jan 2025 | Public weights | 4B to 14B (1.0); 2B/14B (2.5) | Yes (robotics, AV) | NVIDIA Open Model License |
| [Genie 3](/wiki/genie_3) | Google DeepMind | Aug 2025 | Research preview only | Not disclosed | Partial | Proprietary |
| Marble | World Labs | Nov 2025 | API ($20/month+) | Not disclosed | No (3D environment creation) | Proprietary SaaS |
| Oasis | Decart | Oct 2024 | API | Not disclosed | No (interactive game worlds) | Proprietary |
| Wan | Alibaba | Feb 2025 | Public weights | 1.3B to 14B | Limited | Apache 2.0 |

Cosmos is distinct from this peer group in three ways. First, it is the only platform that explicitly targets physical AI throughout its design: the training data distribution, the structured control inputs in Transfer, and the physics-reasoning capabilities of Reason are all oriented toward robotics and autonomous vehicles rather than general creative video generation. Second, it ships with an integrated toolchain (Tokenizer, NeMo Curator, Isaac Sim integration) that the others lack. Third, it is the only competitor in this group to release model weights publicly at launch with a commercially permissive license.

[Genie 3](/wiki/genie_3) from Google DeepMind pursues real-time interactive generation of navigable 3D worlds at 24 FPS, a different objective than Cosmos's batch synthetic data generation. Genie 3 remained in limited research preview through mid-2026 and has not been released under a public license.

World Labs' Marble focuses on generating persistent, downloadable 3D environments from diverse inputs including text, photos, and panoramic images. It is a commercial product with API pricing rather than an open platform.

Decart's Oasis was originally demonstrated as a playable Minecraft-style world generated in real-time. Decart has explored porting Oasis to custom inference hardware to reduce latency but has not positioned the system for physical AI training.

## Early adopters and partners

NVIDIA announced a broad set of industry partners at CES 2025 and at the March 2025 major release. [1][4]

| Partner | Domain | Use of Cosmos |
|---|---|---|
| [1X](/wiki/1x_technologies) | Humanoid robotics | Cosmos Predict and Transfer for training NEO Gamma humanoid |
| Agility Robotics | Humanoid robotics | Scaling photorealistic training data beyond real-world collection |
| Figure AI | Humanoid robotics | Synthetic training data generation |
| Skild AI | Robot brain models | Cosmos Transfer to augment synthetic datasets |
| Uber | Autonomous vehicles | Accelerating autonomous driving model development |
| Waabi | Autonomous vehicles | Synthetic data for long-haul AV |
| Foretellix | AV simulation | Stress-testing with rare scenarios |
| Parallel Domain | Synthetic data | Photorealistic AV data generation |
| Nexar | Traffic AI | Driver and traffic pattern analysis |
| Virtual Incision | Surgical robotics | Surgical simulation data |
| XPENG | EVs and robots | AV and humanoid training data |
| Agile Robots | Industrial robotics | Manipulation training data |
| Fourier | Humanoid robotics | General training data generation |
| Neura Robotics | Humanoid robotics | Synthetic scenario generation |
| Oxa | Autonomous vehicles | Unstructured environment simulation |
| Wayve | Embodied driving AI | Synthetic data augmentation |

Foxconn announced that it is using the NVIDIA Omniverse blueprint (which integrates Cosmos Transfer) to simulate industrial manipulators, humanoids, and mobile robots in its manufacturing facilities, though the company's direct use of Cosmos model weights has not been separately confirmed.

## Integration with NVIDIA ecosystem

Cosmos is designed to slot into NVIDIA's broader physical AI stack:

- **NVIDIA Omniverse**: Provides the 3D simulation environment and USD asset pipeline that generates the structured inputs Cosmos Transfer consumes. Photorealistic output from Cosmos Transfer feeds back into Omniverse for rendering or further simulation.
- **Isaac GR00T**: The humanoid robot foundation model uses Cosmos Transfer to convert Isaac Sim trajectories into photorealistic training footage, and Cosmos Reason for data annotation. The GR00T-Dreams blueprint is built directly on Cosmos Predict.
- **[MimicGen](/wiki/mimicgen)**: Generates synthetic robot trajectories that are then passed through Cosmos Transfer to add visual realism.
- **[RoboCasa](/wiki/robocasa)**: Provides OpenUSD kitchen environments used as starting geometry for the Cosmos Transfer pipeline.
- **NeMo Framework**: Handles distributed fine-tuning of Cosmos models on proprietary datasets with dataset sharding, deterministic data loading, and bandwidth optimization across GPU clusters.
- **NeMo Curator**: Curates, clips, and labels the large video datasets used for pre-training or domain-specific post-training.
- **DGX Cloud**: Provides cloud computing infrastructure for running Cosmos workloads without on-premise hardware.
- **NVIDIA AI Enterprise**: Offers enterprise support and compliance tooling for production deployments of Cosmos models.

Cosmos models are also available in the Vertex AI Model Garden on Google Cloud and through the NVIDIA API catalog as NIM microservices, allowing deployment without managing GPU infrastructure directly. [4]

## Limitations

Several limitations of Cosmos have been noted:

**Training data opacity**: NVIDIA has not disclosed the specific sources used to build the 20-million-hour training dataset. This limits independent audits for copyright issues, privacy violations, or demographic bias in the training distribution. [14]

**License restrictions**: Although NVIDIA markets Cosmos as "open," the NOML includes the guardrail circumvention clause, the attribution display requirement, and the patent litigation termination clause that are absent from true open-source licenses such as Apache 2.0. Bypassing safety guardrails instantly terminates all rights under the license. [10]

**Domain gap not fully eliminated**: Cosmos Transfer reduces but does not eliminate the sim-to-real gap. The quality of photorealization depends on the quality of the underlying simulation geometry and the fidelity of the control signals. Poorly specified segmentation maps or inaccurate depth information produce unrealistic output.

**Long video coherence**: Like most video generation models, Cosmos Predict models lose temporal coherence over very long sequences. The 1.0 autoregressive models can attend to at most 34 frames (about 3 seconds at 10 FPS), which limits their utility for tasks requiring understanding of long episodes.

**Hardware requirements**: Running the 14B diffusion models at full resolution requires multiple high-end GPUs. The 4B autoregressive base model requires at least eight H100 GPUs for reasonable throughput. This limits accessibility for smaller research groups and startups without access to high-end GPU clusters.

**Physical accuracy**: Cosmos models learn statistical regularities in video rather than explicit physics. The generated footage looks physically plausible in most cases, but the models can produce physically incorrect events in unusual scenarios. Cosmos Reason's physics-filtering capability partially mitigates this, but does not guarantee physical correctness of generated data.

## See also

- [World model](/wiki/world_model)
- [Embodied AI](/wiki/embodied_ai)
- [Physical AI](/wiki/physical_ai)
- [Isaac GR00T](/wiki/isaac_gr00t)
- [NVIDIA Omniverse](/wiki/nvidia_omniverse)
- [MimicGen](/wiki/mimicgen)
- [RoboCasa](/wiki/robocasa)
- [Genie 3](/wiki/genie_3)
- [Diffusion model](/wiki/diffusion_model)
- [Transformer](/wiki/transformer)
- [Synthetic data](/wiki/synthetic_data)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [Generative AI](/wiki/generative_ai)

## References

1. NVIDIA Newsroom. "NVIDIA Launches Cosmos World Foundation Model Platform to Accelerate Physical AI Development." January 6, 2025. https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-world-foundation-model-platform-to-accelerate-physical-ai-development
2. NVIDIA Blog. "NVIDIA Makes Cosmos World Foundation Models Openly Available to Physical AI Developer Community." https://blogs.nvidia.com/blog/cosmos-world-foundation-models/
3. Alhaija, Hassan Abu, et al. "Cosmos World Foundation Model Platform for Physical AI." arXiv:2501.03575, January 2025. https://arxiv.org/abs/2501.03575
4. NVIDIA Newsroom. "NVIDIA Announces Major Release of Cosmos World Foundation Models and Physical AI Data Tools." March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-announces-major-release-of-cosmos-world-foundation-models-and-physical-ai-data-tools
5. NVIDIA Technical Blog. "Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform." https://developer.nvidia.com/blog/advancing-physical-ai-with-nvidia-cosmos-world-foundation-model-platform/
6. NVIDIA Technical Blog. "Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models." https://developer.nvidia.com/blog/scale-synthetic-data-and-physical-ai-reasoning-with-nvidia-cosmos-world-foundation-models/
7. NVIDIA Technical Blog. "Curating Synthetic Datasets to Train Physical AI Models with NVIDIA Cosmos Reason." https://developer.nvidia.com/blog/curating-synthetic-datasets-to-train-physical-ai-models-with-nvidia-cosmos-reason/
8. Hugging Face Blog. "Announcing NVIDIA Cosmos World Foundation Models." https://huggingface.co/blog/mingyuliutw/nvidia-cosmos
9. NVIDIA Cosmos GitHub Repository. https://github.com/nvidia-cosmos
10. NVIDIA Open Model License Agreement. https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
11. NVIDIA Technical Blog. "Building a Synthetic Motion Generation Pipeline for Humanoid Robot Learning." https://developer.nvidia.com/blog/building-a-synthetic-motion-generation-pipeline-for-humanoid-robot-learning/
12. NVIDIA Technical Blog. "Enhance Robot Learning with Synthetic Trajectory Data Generated by World Foundation Models." https://developer.nvidia.com/blog/enhance-robot-learning-with-synthetic-trajectory-data-generated-by-world-foundation-models/
13. VentureBeat. "Nvidia's Cosmos-Transfer1 makes robot training freakishly realistic." https://venturebeat.com/ai/nvidias-cosmos-transfer1-makes-robot-training-freakishly-realistic-and-that-changes-everything
14. TechCrunch. "Nvidia releases its own brand of world models." January 6, 2025. https://techcrunch.com/2025/01/06/nvidia-releases-its-own-brand-of-world-models/
15. Robotics 24/7. "CES 2025: NVIDIA launches Cosmos world foundation model, expands Omniverse." https://www.robotics247.com/article/ces_2025_nvidia_launches_cosmos_world_foundation_model_expands_omniverse
16. NVIDIA Research. "Cosmos-Predict2.5: Improved World Simulation with Video Foundation Models for Physical AI." https://research.nvidia.com/labs/cosmos-lab/cosmos-predict2.5/
17. NVIDIA Newsroom. "NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI." June 1, 2026. https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-3-the-open-frontier-foundation-model-for-physical-ai
18. NVIDIA Blog. "How Cosmos 3 Helps Physical AI Think Before It Acts." June 1, 2026. https://blogs.nvidia.com/blog/cosmos-3-physical-ai-open-world-foundation-model/
19. NVIDIA Technical Blog. "Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3." June 1, 2026. https://developer.nvidia.com/blog/develop-physical-ai-reasoning-world-and-action-models-with-nvidia-cosmos-3/
20. WinBuzzer. "NVIDIA Launches Cosmos 3 With OpenMDW for Physical AI." June 1, 2026. https://winbuzzer.com/2026/06/01/nvidia-launches-cosmos-3-with-openmdw-for-physical-ai-xcxwbn/
21. NVIDIA Blog. "NVIDIA GTC Taipei at COMPUTEX: Live Updates on What's Next in AI." May 31, 2026. https://blogs.nvidia.com/blog/nvidia-gtc-taipei-computex-2026-news/