NVIDIA Cosmos

NVIDIA Cosmos is a world foundation model platform developed by NVIDIA for physical AI applications, including autonomous vehicles and robotics. Announced by CEO Jensen Huang at CES on January 6, 2025, Cosmos provides a suite of pre-trained generative models, video tokenizers, safety guardrails, and a data processing pipeline that together allow developers to generate physics-grounded synthetic training data at scale. Models are released under the NVIDIA Open Model License and are publicly accessible on Hugging Face and the NVIDIA NGC catalog.

The platform centers on three families of fine-tunable models: Cosmos Predict for simulating future world states as video, Cosmos Transfer for converting structured simulation data into photorealistic footage, and Cosmos Reason for chain-of-thought physical reasoning from video. Together these address the long-standing shortage of diverse, physics-accurate training data that has slowed progress in embodied AI.

Background

The data bottleneck in physical AI

Training capable embodied AI systems requires video that is dense, diverse, and physically consistent. Real-world collection is slow, expensive, and difficult to scale to the range of edge cases a robot or autonomous vehicle will encounter. Simulation environments such as NVIDIA Omniverse can generate controlled scenarios, but historically the resulting footage looked synthetic enough to cause a domain gap when policies trained on it were deployed on real hardware. This sim-to-real gap remains one of the central challenges in robotics.

World foundation models offer a potential solution: train a large generative model on massive quantities of real video so that it learns the visual statistics and physical dynamics of the actual world, then use that model to produce synthetic footage that is both controllable and photoreal. The generated data can augment or replace costly real-world collection, letting developers cover rare scenarios, vary lighting and weather conditions, and label trajectories automatically.

Prior work

World models as a research concept go back to David Ha and Jurgen Schmidhuber's 2018 paper introducing models that learn compressed representations of environment dynamics. The subsequent decade saw reinforcement learning researchers use smaller learned world models for planning, but these operated in low-dimensional or game-like settings rather than on high-resolution video of the physical world.

Between 2023 and 2024 a new generation of large-scale video generation models emerged, and several groups began framing them explicitly as world models for physical AI. Genie 3 from Google DeepMind, World Labs' Marble, and Decart's Oasis were among the concurrent efforts that NVIDIA Cosmos entered alongside. NVIDIA's approach differed in its explicit physical-AI focus, its open release of model weights, and its integration with NVIDIA's broader hardware and software stack.

Announcement and release history

NVIDIA announced Cosmos at its CES 2025 keynote on January 6, 2025. Jensen Huang demonstrated three use cases: video search and scenario identification, physics-based synthetic data generation from 3D scenes built in NVIDIA Omniverse, and a "multiverse" simulation mode in which the model generates multiple plausible continuations of a scenario to help robots or vehicles plan under uncertainty.

On January 7, 2025, NVIDIA released the initial Cosmos 1.0 model weights on Hugging Face and the NGC catalog under the NVIDIA Open Model License. This initial release included four autoregressive models (4B, 5B, 12B, 13B parameters) and four diffusion models (7B and 14B parameters, in both Text2World and Video2World configurations), along with the Cosmos Tokenizer.

On March 18, 2025, NVIDIA announced a major release that added the Cosmos Predict and Cosmos Transfer model families, introduced Cosmos Reason in early access, and expanded the platform's integration with Google Cloud Vertex AI and additional robotics partner toolchains.

Subsequent releases in late 2025 brought Cosmos-Predict2 and Cosmos-Predict2.5 (a 2B/14B flow-based model that unifies text-to-world and video-to-world generation), as well as Cosmos-Transfer2.5 (a multi-controlnet variant accepting simultaneous RGB, depth, and segmentation inputs).

Platform architecture

Cosmos is a platform rather than a single model. Its components work together in a pipeline:

Raw video is ingested and tokenized by the Cosmos Tokenizer.
Pre-trained world foundation models (autoregressive or diffusion) generate new video conditioned on text, images, video, or structured control signals.
Cosmos Guardrails filter inputs and outputs for safety.
NVIDIA NeMo Curator handles large-scale data curation and labeling.
Developers use NeMo Framework to fine-tune models on proprietary or domain-specific data.

The models are designed to be post-trained. NVIDIA reports that domain-specific post-training can achieve up to 10x higher accuracy on downstream tasks compared to using the base models directly.

Cosmos Tokenizer

The Cosmos Tokenizer converts raw images and video into compact token representations that the world foundation models consume. It supports both continuous tokens (for diffusion-based models) and discrete tokens (for autoregressive models), and its causal design means it can process streaming video without needing the entire sequence in advance.

The tokenizer achieves spatial compression ratios of 8x8 or 16x16 for images, and spatio-temporal compression ratios of 4x8x8, 8x8x8, or 8x16x16 for video. The most aggressive compression (8x temporal combined with 16x16 spatial) results in a total compression factor of up to 2048x. NVIDIA benchmarks show the tokenizer delivers +4 dB PSNR improvement on the DAVIS video dataset compared to prior methods, runs 12x faster, and uses fewer parameters than competing approaches.

On MS-COCO and ImageNet-1K image benchmarks, Cosmos image tokenizers outperform FLUX and LlamaGen baselines. On video, continuous tokenizers outperform CogVideoX and Omni-tokenizer across PSNR, SSIM, and rFVD metrics. Discrete tokenizers show better compression-quality tradeoffs than alternatives at high compression rates.

The tokenizer is released separately on Hugging Face under the NVIDIA Cosmos Tokenizer collection, allowing developers to use it independently of the full world foundation models.

Training data

The Cosmos 1.0 models were pre-trained on approximately 20 million hours of raw video representing roughly 9,000 trillion tokens. NVIDIA does not fully disclose the specific sources of this data. The paper accompanying the models describes the curated training set as approximately 100 million clips drawn from the following broad categories:

Category	Share of training clips
Nature dynamics	20%
Hand and object manipulation	16%
Spatial awareness	16%
Driving	11%
Human motion	10%
First-person point of view	8%
Dynamic camera	8%
Synthetic rendering	4%
Other	7%

This distribution reflects the physical-AI focus: driving, manipulation, and spatial awareness collectively account for over 40% of training data. The strong representation of manipulation and first-person footage is intended to make the models useful for robotic arm control and humanoid motion tasks.

Data curation used NVIDIA NeMo Curator, a CUDA-accelerated pipeline that filters, clips, and labels raw footage. Processing the full 20 million hours took 14 days on NVIDIA Blackwell GPUs, compared to more than three years if run on a CPU-only pipeline of equivalent power consumption. On Hopper GPUs the same task takes approximately 40 days.

The Cosmos-Predict2.5 generation (released October 2025) expanded the post-training data to 200 million high-quality clips.

Cosmos Predict

Cosmos Predict is the world simulation component of the platform. It models future world states as video from multimodal inputs: text descriptions, single images, video sequences, or start-and-end frame pairs. The model can predict what happens next in a scene, interpolate between two keyframes, or generate video continuations from a text prompt alone.

Autoregressive variants

Autoregressive Cosmos Predict models use a GPT-style decoder architecture trained to predict the next discrete video token given the preceding sequence. The architecture is built on Llama3-style transformer blocks with:

Absolute positional embeddings combined with 3D Rotary Position Embeddings (RoPE) that encode spatial and temporal dimensions separately.
Self-attention layers over the video token sequence.
Cross-attention layers that inject T5-XXL text embeddings, allowing text conditioning without requiring the text tokens to be part of the main autoregressive sequence.
QK-normalization using RMSNorm for training stability.
Progressive context extension from 17 frames (Stage 1) to 34 frames (Stage 1.1) via YaRN.

The autoregressive models use the discrete variant of the Cosmos Tokenizer (DV8x16x16), which maps video to integer tokens. Because pure discrete decoding limits visual quality, NVIDIA trains a diffusion decoder that maps discrete DV8x16x16 tokens back to the higher-fidelity continuous CV8x8x8 token space before final pixel rendering.

The four autoregressive model sizes released in Cosmos 1.0 are:

Model	Parameters	Type	Conditioning
Cosmos-1.0-Autoregressive-4B	4B	Base (video-only)	Video in, video out
Cosmos-1.0-Autoregressive-5B-Video2World	5B	Text + Video	Text + video in, video out
Cosmos-1.0-Autoregressive-12B	12B	Base (video-only)	Video in, video out
Cosmos-1.0-Autoregressive-13B-Video2World	13B	Text + Video	Text + video in, video out

The 5B and 13B Video2World variants are derived from the 4B and 12B base models by adding cross-attention layers and performing additional Stage 2 training on text-video pairs. They bear no language understanding from pre-training; all textual information enters only through T5 embeddings at inference time.

Generation throughput for the 4B model on eight H100 GPUs at 320x512 resolution (10 FPS) is approximately 806 tokens per second, producing a 24-frame (2.4-second) clip from a 9-frame context in about 2.38 seconds.

Diffusion variants

Cosmos Predict diffusion models use a latent diffusion architecture derived from the DiT (Diffusion Transformer) design. The forward process progressively adds noise to latent video tokens, and the reverse process denoises using a transformer that is conditioned on text.

Key architectural choices include:

3D patchification of the latent token volume, which preserves spatial and temporal structure throughout the transformer stack.
FPS-aware 3D RoPE that handles variable resolutions, aspect ratios, and frame rates within a single model.
T5-XXL text encoding with embeddings zero-padded to a fixed length of 512 tokens.
AdaLN-LoRA (adaptive layer normalization combined with low-rank adaptation): this replaces full adaptive layer normalization and achieves a 36% reduction in parameter count (from a naive 11B to the released 7B) while maintaining generation quality.
Query-key RMSNorm for attention stability during training.

Joint image-video training proceeds with domain normalization, progressing from 512p to 720p resolution using multi-aspect training buckets (1:1 and 16:9 ratios). Training uses BF16/FP32 mixed precision.

The four diffusion model sizes released in Cosmos 1.0 are:

Model	Parameters	Output	Frames	Resolution
Cosmos-1.0-Diffusion-7B-Text2World	7B	Video from text	121 frames	1280x704 @ 24 FPS
Cosmos-1.0-Diffusion-14B-Text2World	14B	Video from text	121 frames	1280x704 @ 24 FPS
Cosmos-1.0-Diffusion-7B-Video2World	7B	Video continuation	120 frames	1280x704 @ 24 FPS
Cosmos-1.0-Diffusion-14B-Video2World	14B	Video continuation	120 frames	1280x704 @ 24 FPS

Text2World models generate a full 121-frame (~5 second) clip from a text description alone. Video2World models take an initial image frame plus a text description and predict the subsequent 120 frames, which is well-suited to simulation use cases where a starting state is known.

On 3D consistency benchmarks, Cosmos Diffusion Text2World 7B achieves a Sampson error of 0.355 and a pose estimation success rate of 62.60%, compared to VideoLDM's 0.841 and 4.40%. On physics alignment metrics, the Video2World 7B model with 9-frame conditioning achieves a PSNR of 21.06, SSIM of 0.69, and IoU of 0.592.

Cosmos Transfer

Cosmos Transfer addresses the sim-to-real domain gap. Simulation engines like NVIDIA Omniverse can generate precise, labeled 3D scenes quickly, but the rendered footage looks visibly synthetic. Policies trained on purely synthetic data often fail when deployed on real hardware because the visual distribution shifts.

Cosmos Transfer takes structured inputs such as segmentation maps, depth maps, edge maps, LiDAR scans, pose estimation data, trajectory maps, and HD maps, and generates photorealistic video that matches the structure and physics of the simulation while looking like real-world footage.

The architecture uses a ControlNet approach: control signals are processed by an encoder that injects them into the main diffusion backbone without overwriting its pre-trained visual knowledge. This preserves the realism the backbone learned from 20 million hours of real video while forcing the output to conform to the structural constraints of the simulation.

For robotics, Cosmos Transfer is integrated into the Isaac GR00T Blueprint for synthetic manipulation motion generation, where it converts simulator-rendered arm trajectories into photorealistic training footage. For autonomous vehicle development, it plugs into the Omniverse Blueprint for AV Simulation, transforming geometric driving scenarios into realistic urban environments with varied lighting, weather, and surface textures.

Cosmos-Transfer2.5 (released October 2025) extended the design to a multi-controlnet that accepts simultaneous inputs of RGB, depth, segmentation, and other modalities configured via JSON-based controlnet_specs, enabling more fine-grained control over the output.

Cosmos Reason

Cosmos Reason is a vision-language model that applies chain-of-thought reasoning to physical scenarios. Where Cosmos Predict generates video and Cosmos Transfer converts simulation to reality, Cosmos Reason understands what is happening in video and predicts whether actions or events are physically plausible.

The model processes video at 604x480 resolution and generates step-by-step textual reasoning before producing a final decision or annotation. It understands object motion, affordances, spatial constraints, and multi-step interactions across humans, objects, and environments.

Training proceeds in three stages:

Pre-training: A Vision Transformer processes video frames into embeddings aligned with text.
Supervised fine-tuning (SFT): The model is specialized on curated datasets covering object affordances, action sequences, and spatial reasoning. SFT boosts base benchmark performance by approximately 10%.
Reinforcement learning: The model is trained with verifiable physical rewards such as "arrow-of-time" dynamics (detecting whether a video is physically plausible or time-reversed). RL adds approximately 5% on top of the SFT baseline.

Cosmos Reason 2 (released mid-2025) added extended context support up to 256K input tokens and introduced 2D/3D point localization with bounding box coordinates. It achieves an average score of 65.7 across robotics video question answering benchmarks including BridgeData V2, RoboVQA, and Agibot.

Primary uses of Cosmos Reason within the platform include:

Critiquing the quality of synthetically generated clips before they enter a training set.
Filtering and curating large video datasets by text-based queries.
Generating natural-language annotations for robot demonstration data.
Serving as a reasoning backbone in vision-language-action (VLA) models.

Cosmos Reason supports extended context inputs of up to 256K tokens, which lets it process long video sequences or reason over an entire episode of robot behavior at once.

Use cases

Robotics training

Robots need enormous volumes of demonstration data to learn generalizable manipulation and locomotion skills. Human teleoperation is slow and expensive; Cosmos Predict and Transfer together offer a scalable alternative.

NVIDIA's GR00T-Dreams blueprint, built on top of Cosmos Predict, generates synthetic robot trajectories from a single image and a language prompt. In one internal evaluation, the pipeline produced 780,000 synthetic trajectories in 11 hours (equivalent to roughly 6,500 hours of human demonstration data), and combining these with real data improved Isaac GR00T N1 policy performance by 40%.

The MimicGen NIM microservice integrates with Cosmos Transfer: developers record a small number of human demonstrations in NVIDIA Isaac Sim, use MimicGen to generate thousands of synthetic trajectory variants, and then run Cosmos Transfer to make those trajectories photorealistic. RoboCasa provides simulation-ready kitchen environments in OpenUSD format that serve as the starting geometry for this pipeline.

Autonomous vehicle development

AV developers need rare edge cases such as unusual weather, pedestrian behavior, and sensor failure scenarios that are difficult and dangerous to collect from real driving. Cosmos allows teams to generate photorealistic footage of these scenarios at scale.

The Cosmos Three-Computer solution announced at CES 2025 integrates Cosmos models with NVIDIA Drive Hyperion (in-vehicle sensing), NVIDIA Drive AGX (real-time in-vehicle inference), and NVIDIA DGX (data center training). Cosmos generates synthetic camera, LiDAR, and radar sensor data that feeds the training loop, while Cosmos Reason annotates edge case clips automatically.

Waabi and Uber both cited Cosmos as part of their pipeline for accelerating autonomous driving development. Foretellix uses Cosmos to stress-test their AV simulation scenarios with rare events.

Video analytics and surveillance

Beyond robotics and autonomous driving, Cosmos models can be used for understanding and searching large video collections. Milestone Systems, a video analytics platform, uses Cosmos to search for specific scenario patterns across large sensor networks. Linker Vision and Nexar apply it to traffic analysis and driver behavior monitoring.

Guardrails and safety

Cosmos includes a two-stage safety system.

Pre-guard: Input text prompts are first screened against a blocklist of prohibited terms, then passed through NVIDIA's Aegis AI Content Safety model, which classifies prompts for harmful content. Prompts that pass both checks are forwarded to the generation model.

Post-guard: Generated video frames are evaluated by a content classifier. Faces in output footage are detected using RetinaFace and automatically blurred for privacy. Content classified as harmful is filtered before the video is returned to the caller.

The license terms also require that users must not bypass, disable, or reduce the efficacy of any guardrail or safety mechanism. Circumventing these controls terminates the license.

Generated videos carry invisible watermarks. NVIDIA has not published the details of the watermarking scheme, but the stated purpose is to allow identification of synthetically generated footage if it is redistributed.

Licensing

Cosmos models are distributed under the NVIDIA Open Model License (NOML). The license permits commercial use, modification, and redistribution subject to several conditions:

Products or services that incorporate Cosmos models must display "Built on NVIDIA Cosmos" in a visible location such as a website, about page, or product documentation.
Guardrails must not be circumvented; doing so automatically terminates the license.
If a licensee initiates patent or copyright litigation against any entity, all licenses granted under the agreement terminate on the date the suit is filed.
The license does not require disclosure of training data or model code, which means it does not meet the Open Source Initiative's definition of open source.

NVIDIA has not disclosed the specific datasets used to train Cosmos, nor has it made the full training pipeline publicly available. Critics have noted that this limits reproducibility and makes it impossible to audit the training data for copyright or consent concerns. NVIDIA refers to the models as "open" based on the availability of weights, rather than on full process transparency.

Comparison with other world models

The world model space attracted multiple major players around the same time as the Cosmos announcement. The following table summarizes the main alternatives as of mid-2025:

System	Developer	Release	Access	Parameters	Physical AI focus	License
Cosmos	NVIDIA	Jan 2025	Public weights	4B to 14B (1.0); 2B/14B (2.5)	Yes (robotics, AV)	NVIDIA Open Model License
Genie 3	Google DeepMind	Aug 2025	Research preview only	Not disclosed	Partial	Proprietary
Marble	World Labs	Nov 2025	API ($20/month+)	Not disclosed	No (3D environment creation)	Proprietary SaaS
Oasis	Decart	Oct 2024	API	Not disclosed	No (interactive game worlds)	Proprietary
Wan	Alibaba	Feb 2025	Public weights	1.3B to 14B	Limited	Apache 2.0

Cosmos is distinct from this peer group in three ways. First, it is the only platform that explicitly targets physical AI throughout its design: the training data distribution, the structured control inputs in Transfer, and the physics-reasoning capabilities of Reason are all oriented toward robotics and autonomous vehicles rather than general creative video generation. Second, it ships with an integrated toolchain (Tokenizer, NeMo Curator, Isaac Sim integration) that the others lack. Third, it is the only competitor in this group to release model weights publicly at launch with a commercially permissive license.

Genie 3 from Google DeepMind pursues real-time interactive generation of navigable 3D worlds at 24 FPS, a different objective than Cosmos's batch synthetic data generation. Genie 3 remained in limited research preview through mid-2026 and has not been released under a public license.

World Labs' Marble focuses on generating persistent, downloadable 3D environments from diverse inputs including text, photos, and panoramic images. It is a commercial product with API pricing rather than an open platform.

Decart's Oasis was originally demonstrated as a playable Minecraft-style world generated in real-time. Decart has explored porting Oasis to custom inference hardware to reduce latency but has not positioned the system for physical AI training.

Early adopters and partners

NVIDIA announced a broad set of industry partners at CES 2025 and at the March 2025 major release.

Partner	Domain	Use of Cosmos
1X	Humanoid robotics	Cosmos Predict and Transfer for training NEO Gamma humanoid
Agility Robotics	Humanoid robotics	Scaling photorealistic training data beyond real-world collection
Figure AI	Humanoid robotics	Synthetic training data generation
Skild AI	Robot brain models	Cosmos Transfer to augment synthetic datasets
Uber	Autonomous vehicles	Accelerating autonomous driving model development
Waabi	Autonomous vehicles	Synthetic data for long-haul AV
Foretellix	AV simulation	Stress-testing with rare scenarios
Parallel Domain	Synthetic data	Photorealistic AV data generation
Nexar	Traffic AI	Driver and traffic pattern analysis
Virtual Incision	Surgical robotics	Surgical simulation data
XPENG	EVs and robots	AV and humanoid training data
Agile Robots	Industrial robotics	Manipulation training data
Fourier	Humanoid robotics	General training data generation
Neura Robotics	Humanoid robotics	Synthetic scenario generation
Oxa	Autonomous vehicles	Unstructured environment simulation
Wayve	Embodied driving AI	Synthetic data augmentation

Foxconn announced that it is using the NVIDIA Omniverse blueprint (which integrates Cosmos Transfer) to simulate industrial manipulators, humanoids, and mobile robots in its manufacturing facilities, though the company's direct use of Cosmos model weights has not been separately confirmed.

Integration with NVIDIA ecosystem

Cosmos is designed to slot into NVIDIA's broader physical AI stack:

NVIDIA Omniverse: Provides the 3D simulation environment and USD asset pipeline that generates the structured inputs Cosmos Transfer consumes. Photorealistic output from Cosmos Transfer feeds back into Omniverse for rendering or further simulation.
Isaac GR00T: The humanoid robot foundation model uses Cosmos Transfer to convert Isaac Sim trajectories into photorealistic training footage, and Cosmos Reason for data annotation. The GR00T-Dreams blueprint is built directly on Cosmos Predict.
MimicGen: Generates synthetic robot trajectories that are then passed through Cosmos Transfer to add visual realism.
RoboCasa: Provides OpenUSD kitchen environments used as starting geometry for the Cosmos Transfer pipeline.
NeMo Framework: Handles distributed fine-tuning of Cosmos models on proprietary datasets with dataset sharding, deterministic data loading, and bandwidth optimization across GPU clusters.
NeMo Curator: Curates, clips, and labels the large video datasets used for pre-training or domain-specific post-training.
DGX Cloud: Provides cloud computing infrastructure for running Cosmos workloads without on-premise hardware.
NVIDIA AI Enterprise: Offers enterprise support and compliance tooling for production deployments of Cosmos models.

Cosmos models are also available in the Vertex AI Model Garden on Google Cloud and through the NVIDIA API catalog as NIM microservices, allowing deployment without managing GPU infrastructure directly.

Limitations

Several limitations of Cosmos have been noted:

Training data opacity: NVIDIA has not disclosed the specific sources used to build the 20-million-hour training dataset. This limits independent audits for copyright issues, privacy violations, or demographic bias in the training distribution.

License restrictions: Although NVIDIA markets Cosmos as "open," the NOML includes the guardrail circumvention clause, the attribution display requirement, and the patent litigation termination clause that are absent from true open-source licenses such as Apache 2.0. Bypassing safety guardrails instantly terminates all rights under the license.

Domain gap not fully eliminated: Cosmos Transfer reduces but does not eliminate the sim-to-real gap. The quality of photorealization depends on the quality of the underlying simulation geometry and the fidelity of the control signals. Poorly specified segmentation maps or inaccurate depth information produce unrealistic output.

Long video coherence: Like most video generation models, Cosmos Predict models lose temporal coherence over very long sequences. The 1.0 autoregressive models can attend to at most 34 frames (about 3 seconds at 10 FPS), which limits their utility for tasks requiring understanding of long episodes.

Hardware requirements: Running the 14B diffusion models at full resolution requires multiple high-end GPUs. The 4B autoregressive base model requires at least eight H100 GPUs for reasonable throughput. This limits accessibility for smaller research groups and startups without access to high-end GPU clusters.

Physical accuracy: Cosmos models learn statistical regularities in video rather than explicit physics. The generated footage looks physically plausible in most cases, but the models can produce physically incorrect events in unusual scenarios. Cosmos Reason's physics-filtering capability partially mitigates this, but does not guarantee physical correctness of generated data.

References

NVIDIA Newsroom. "NVIDIA Launches Cosmos World Foundation Model Platform to Accelerate Physical AI Development." January 6, 2025. https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-world-foundation-model-platform-to-accelerate-physical-ai-development
NVIDIA Blog. "NVIDIA Makes Cosmos World Foundation Models Openly Available to Physical AI Developer Community." https://blogs.nvidia.com/blog/cosmos-world-foundation-models/
Alhaija, Hassan Abu, et al. "Cosmos World Foundation Model Platform for Physical AI." arXiv:2501.03575, January 2025. https://arxiv.org/abs/2501.03575
NVIDIA Newsroom. "NVIDIA Announces Major Release of Cosmos World Foundation Models and Physical AI Data Tools." March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-announces-major-release-of-cosmos-world-foundation-models-and-physical-ai-data-tools
NVIDIA Technical Blog. "Advancing Physical AI with NVIDIA Cosmos World Foundation Model Platform." https://developer.nvidia.com/blog/advancing-physical-ai-with-nvidia-cosmos-world-foundation-model-platform/
NVIDIA Technical Blog. "Scale Synthetic Data and Physical AI Reasoning with NVIDIA Cosmos World Foundation Models." https://developer.nvidia.com/blog/scale-synthetic-data-and-physical-ai-reasoning-with-nvidia-cosmos-world-foundation-models/
NVIDIA Technical Blog. "Curating Synthetic Datasets to Train Physical AI Models with NVIDIA Cosmos Reason." https://developer.nvidia.com/blog/curating-synthetic-datasets-to-train-physical-ai-models-with-nvidia-cosmos-reason/
Hugging Face Blog. "Announcing NVIDIA Cosmos World Foundation Models." https://huggingface.co/blog/mingyuliutw/nvidia-cosmos
NVIDIA Cosmos GitHub Repository. https://github.com/nvidia-cosmos
NVIDIA Open Model License Agreement. https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
NVIDIA Technical Blog. "Building a Synthetic Motion Generation Pipeline for Humanoid Robot Learning." https://developer.nvidia.com/blog/building-a-synthetic-motion-generation-pipeline-for-humanoid-robot-learning/
NVIDIA Technical Blog. "Enhance Robot Learning with Synthetic Trajectory Data Generated by World Foundation Models." https://developer.nvidia.com/blog/enhance-robot-learning-with-synthetic-trajectory-data-generated-by-world-foundation-models/
VentureBeat. "Nvidia's Cosmos-Transfer1 makes robot training freakishly realistic." https://venturebeat.com/ai/nvidias-cosmos-transfer1-makes-robot-training-freakishly-realistic-and-that-changes-everything
TechCrunch. "Nvidia releases its own brand of world models." January 6, 2025. https://techcrunch.com/2025/01/06/nvidia-releases-its-own-brand-of-world-models/
Robotics 24/7. "CES 2025: NVIDIA launches Cosmos world foundation model, expands Omniverse." https://www.robotics247.com/article/ces_2025_nvidia_launches_cosmos_world_foundation_model_expands_omniverse
NVIDIA Research. "Cosmos-Predict2.5: Improved World Simulation with Video Foundation Models for Physical AI." https://research.nvidia.com/labs/cosmos-lab/cosmos-predict2.5/

Background

The data bottleneck in physical AI

Prior work

Announcement and release history

Platform architecture

Cosmos Tokenizer

Training data

Cosmos Predict

Autoregressive variants

Diffusion variants

Cosmos Transfer

Cosmos Reason

Use cases

Robotics training

Autonomous vehicle development

Video analytics and surveillance

Guardrails and safety

Licensing

Comparison with other world models

Early adopters and partners

Integration with NVIDIA ecosystem

Limitations

See also

References

Improve this article

Related Articles

ERQA

Genie 3

GAIA-3 (Wayve)

GAIA-2 (Wayve)

V-JEPA 2

MimicGen

Background

The data bottleneck in physical AI

Prior work

Announcement and release history

Platform architecture

Cosmos Tokenizer

Training data

Cosmos Predict

Autoregressive variants

Diffusion variants

Cosmos Transfer

Cosmos Reason

Use cases

Robotics training

Autonomous vehicle development

Video analytics and surveillance

Guardrails and safety

Licensing

Comparison with other world models

Early adopters and partners

Integration with NVIDIA ecosystem

Limitations

See also

References

Related Articles

ERQA

Genie 3

GAIA-3 (Wayve)

GAIA-2 (Wayve)

V-JEPA 2

MimicGen