NVIDIA Cosmos is a platform of generative world foundation models, video tokenizers, safety guardrails, and an accelerated data curation pipeline built by NVIDIA for the development of physical AI. The platform was first unveiled by NVIDIA founder and CEO Jensen Huang during his CES 2025 keynote on January 6, 2025, and was positioned as a counterpart to large language models for the physical world. Where text-only models such as GPT predict the next token in a sentence, Cosmos models predict the next frame in a video conditioned on text, images, sensor data, or robot actions, letting them act as simulators for robotics and autonomous driving systems.[1][2]
The platform is released under the permissive NVIDIA Open Model License Agreement, with source code under Apache 2.0, and weights distributed through Hugging Face and the NVIDIA NGC catalog. As of early 2026, Cosmos sits alongside NVIDIA Isaac Sim, Isaac GR00T, and the DRIVE Hyperion stack as a central component of NVIDIA's physical AI strategy. The first release shipped diffusion and autoregressive video models in the 4 to 14 billion parameter range, and by CES 2026 the platform had expanded to include Cosmos Predict 2.5, Cosmos Transfer 2.5, Cosmos Reason 2, and the Cosmos Tokenizer suite.[3][4]
The origins of Cosmos trace back to NVIDIA's longer running interest in foundation models for visual data, including the StyleGAN family from NVIDIA Research and the Picasso generative media platform announced at GTC 2023. By late 2024, NVIDIA leadership had begun describing physical AI as the next major wave after generative AI for text and images, with Huang repeatedly using the phrase "the ChatGPT moment for robotics is coming." Cosmos was meant to deliver that moment by giving robotics and automotive companies a pretrained world model they could fine tune for their own embodiments, in much the same way startups fine tune Llama or Mistral for chat.[1]
The official launch took place at CES 2025 in Las Vegas, where Huang dedicated a major segment of his 90 minute keynote to physical AI and confirmed that the models would be released openly. The announcement was paired with an arXiv preprint titled "Cosmos World Foundation Model Platform for Physical AI," filed on January 7, 2025 under arXiv ID 2501.03575, with NVIDIA Research scientist Ming-Yu Liu among the lead authors and a credit list of 77 contributors from NVIDIA Research, NeMo, and Omniverse teams.[2][5]
NVIDIA expanded the Cosmos rollout at GTC 2025 on March 18, 2025, announcing Cosmos Transfer for controllable scene generation and an early access version of Cosmos Reason. Further updates landed at SIGGRAPH 2025, GTC Paris 2025, and the CES 2026 special presentation on January 5, 2026, where Cosmos Predict 2.5, Cosmos Transfer 2.5, and Cosmos Reason 2 were made openly available alongside expanded partnerships.[6][7]
Cosmos is structured as three families of generative models plus shared infrastructure. The model families are Cosmos Predict, Cosmos Transfer, and Cosmos Reason. The supporting infrastructure includes the Cosmos Tokenizer, the Cosmos Guardrail safety stack, and a data curation pipeline that runs on top of NVIDIA NeMo Curator.[3][8]
The central design idea is the world foundation model, or WFM. A WFM takes some combination of past video frames, an image, a text prompt, sensor data, or robot action commands and predicts how the world will evolve over the next few seconds. For an autonomous vehicle this could mean predicting the trajectory of a pedestrian about to cross the street; for a humanoid robot it could mean predicting how a held object will tilt under gravity; for a video analytics agent it could mean explaining why a sequence of events in a warehouse looks anomalous. Because the same backbone can be conditioned on many input types and fine tuned for many downstream tasks, NVIDIA frames Cosmos as the physical AI counterpart to a general purpose LLM.[2][9]
Cosmos Predict is the original world generation family. Cosmos Predict 1 (originally released as Cosmos 1.0) shipped in January 2025 with both diffusion based and autoregressive variants. The diffusion side included Cosmos-1.0-Diffusion-7B-Text2World and Cosmos-1.0-Diffusion-14B-Text2World, latent diffusion models on a Diffusion Transformer backbone, plus Video2World fine tunes that take an existing clip plus a text prompt and predict future frames. The autoregressive side included Cosmos-1.0-Autoregressive-4B and Cosmos-1.0-Autoregressive-12B, Llama 3 style transformer decoders trained from scratch on discrete video tokens. A 5B Video2World variant was derived from the 4B base by adding cross attention layers conditioned on T5 text embeddings.[10][11]
Cosmos Predict 2, released in mid 2025, shipped two scales (2B and 14B) each with Text2Image and Video2World variants. The 2B variant targets low latency robotics workflows and edge deployment on hardware like NVIDIA Jetson Thor, while the 14B variant is intended for higher fidelity synthetic data generation on H100 or B200 class GPUs.[12]
Cosmos Predict 2.5, announced at CES 2026, unified Text2World, Image2World, and Video2World into a single flow based model with one set of weights. Predict 2.5 added trajectory conditioning, allowing the model to receive waypoints, gripper positions, or steering commands as control input, which makes it usable as a physics aware policy simulator for vision-language-action models.[7][13]
Cosmos Transfer is the controllable style and domain transfer family. Whereas Predict generates futures from sparse inputs, Transfer takes a richly specified scene (typically from an NVIDIA Isaac Sim or NVIDIA Omniverse simulation) along with structural controls such as depth maps, segmentation masks, edge maps, or HD maps, and re renders it as photorealistic video. This is the workhorse for synthetic data generation: a developer can author one scenario in Omniverse and use Transfer to produce thousands of variations under different lighting, weather, vehicle types, road textures, and times of day while preserving the exact ground truth labels needed for supervised training.[6][14]
The first Cosmos Transfer model was a 7B parameter network announced at GTC 2025 with multi control conditioning. Cosmos Transfer 2.5, released in January 2026, is built on Cosmos Predict 2.5 and inherits its unified flow based backbone, with improved temporal coherence, additional control modalities, and better preservation of fine geometric detail under domain randomization.[7][15]
Cosmos Reason is the reasoning vision-language-action model family. Where Predict and Transfer are generative, Reason is discriminative and explanatory: it takes a video or image plus a natural language question and produces a long chain of thought trace that combines perception, prior knowledge, and physical common sense. NVIDIA positions it as the planning brain that pairs with Predict's simulation engine.[16][17]
Cosmos Reason 1 was released in March 2025 in early access and made fully available later that year as a 7B parameter open VLM on Hugging Face. Cosmos Reason 2 was announced at CES 2026 in 2B and 8B parameter scales. Reason 2 is trained with reinforcement learning from human feedback over physical AI tasks and reaches the top of the Physical AI Bench leaderboard with an average score of 65.7. Typical use cases include video data curation, robot task planning (decomposing a high level instruction such as "clean the kitchen" into subtasks), and video analytics agents.[18][19]
The Cosmos Tokenizer compresses raw video into the discrete or continuous token sequences that Predict, Transfer, and Reason consume. NVIDIA published a suite of tokenizers under the same open license, including continuous image (CI), continuous video (CV), discrete image (DI), and discrete video (DV) variants at multiple compression ratios. The flagship discrete video tokenizer, Cosmos-1.0-Tokenizer-DV8x16x16, achieves 16x spatial compression in each dimension and 8x temporal compression for a combined factor of 2,048x relative to raw RGB. NVIDIA reports that Cosmos Tokenizer delivers up to 8x more total compression than prior state of the art video tokenizers while maintaining higher reconstruction quality and running up to 12x faster.[20]
Architecturally, the encoder uses a 2 level Haar wavelet transform that downsamples by 4x in space and time, followed by causal temporal convolution and causal temporal attention layers. The causal mechanism preserves frame order so a single network can tokenize both still images and video. Continuous tokenizers pair with diffusion models; discrete tokenizers pair with autoregressive transformers trained with cross entropy.[20][2]
Cosmos Guardrail is the safety layer that wraps the generative models, split into pre guard and post guard stages. Pre guard runs on text input, combining a keyword blocklist with the NVIDIA Aegis content safety model (a fine tuned LLM that classifies prompts for categories such as violence, harassment, and profanity). Post guard applies a video content safety classifier to generated frames before they are returned. The components are distributed as nvidia/Cosmos-Guardrail1 on Hugging Face.[21]
The table below summarizes the principal Cosmos generative models released between January 2025 and January 2026. Sizes refer to parameter counts in the main backbone, not including the tokenizer or text encoder.
| Model | Family | Parameters | Architecture | Inputs | Output | First released |
|---|---|---|---|---|---|---|
| Cosmos-1.0-Diffusion-7B-Text2World | Predict 1 | 7B | Latent diffusion transformer | Text | Video | Jan 2025 |
| Cosmos-1.0-Diffusion-14B-Text2World | Predict 1 | 14B | Latent diffusion transformer | Text | Video | Jan 2025 |
| Cosmos-1.0-Diffusion-7B-Video2World | Predict 1 | 7B | Latent diffusion transformer | Text + video | Video | Jan 2025 |
| Cosmos-1.0-Diffusion-14B-Video2World | Predict 1 | 14B | Latent diffusion transformer | Text + video | Video | Jan 2025 |
| Cosmos-1.0-Autoregressive-4B | Predict 1 | 4B | Llama 3 style decoder | Discrete video tokens | Video | Jan 2025 |
| Cosmos-1.0-Autoregressive-12B | Predict 1 | 12B | Llama 3 style decoder | Discrete video tokens | Video | Jan 2025 |
| Cosmos-1.0-Autoregressive-5B-Video2World | Predict 1 | 5B | Llama 3 decoder + cross attention | Text + video | Video | Jan 2025 |
| Cosmos-Predict2-2B-Text2Image | Predict 2 | 2B | Diffusion transformer | Text | Image | 2025 |
| Cosmos-Predict2-14B-Text2Image | Predict 2 | 14B | Diffusion transformer | Text | Image | 2025 |
| Cosmos-Predict2-2B-Video2World | Predict 2 | 2B | Diffusion transformer | Text + image + video | Video | 2025 |
| Cosmos-Predict2-14B-Video2World | Predict 2 | 14B | Diffusion transformer | Text + image + video | Video | 2025 |
| Cosmos-Predict2.5-2B | Predict 2.5 | 2B | Unified flow based transformer | Text + image + video + trajectory | Video | Jan 2026 |
| Cosmos-Predict2.5-14B | Predict 2.5 | 14B | Unified flow based transformer | Text + image + video + trajectory | Video | Jan 2026 |
| Cosmos-Transfer1-7B | Transfer 1 | 7B | Diffusion with multi control | Text + structural controls | Video | Mar 2025 |
| Cosmos-Transfer2.5 | Transfer 2.5 | 14B class | Flow based on Predict 2.5 | Text + multi spatial controls | Video | Jan 2026 |
| Cosmos-Reason1-7B | Reason 1 | 7B | VLM with chain of thought | Video + text query | Text reasoning trace | Mar 2025 (EA), 2025 (GA) |
| Cosmos-Reason2-2B | Reason 2 | 2B | VLM with RLHF chain of thought | Video + text query | Text reasoning trace | Dec 2025 |
| Cosmos-Reason2-8B | Reason 2 | 8B | VLM with RLHF chain of thought | Video + text query | Text reasoning trace | Dec 2025 |
| Cosmos-1.0-Guardrail | Safety | n/a | Aegis + content classifier | Text and video | Safety verdict | Jan 2025 |
| Cosmos-1.0-Tokenizer suite | Tokenizer | n/a | Causal Haar + conv + attention | Image or video | Discrete or continuous tokens | Jan 2025 |
NVIDIA has been unusually open about Cosmos' training scale. The first generation models were trained on roughly 9,000 trillion tokens (9 quadrillion) extracted from approximately 20 million hours of curated real world video. The corpus combines internal NVIDIA datasets, partner contributions covering driving, manufacturing, surgical, and warehouse domains, and licensed third party video. The Cosmos data curation pipeline uses NeMo Curator to deduplicate, filter, and label this material, and on a Blackwell class system it can process the full 20 million hours in about 14 days, compared to 40 days on Hopper hardware and an estimated three years on a CPU only pipeline.[3][9]
The initial training runs used over 10,000 NVIDIA H100 GPUs in parallel, with later generations including Predict 2 and Predict 2.5 leveraging H200 and B200 (Blackwell) systems. Inference for the smaller models such as Cosmos Predict 2 2B and Cosmos Reason 2 2B targets single GPU or Jetson Thor edge configurations, while 14B class models typically require multi GPU inference. Cosmos-1.0-Autoregressive-4B is documented as supporting efficient inference on an 8x H100 node.[9][22]
Cosmos is a horizontal platform for physical AI. The principal application areas are synthetic data generation for autonomous driving, synthetic data and policy training for robotics, simulation based policy evaluation, sensor data curation, and reasoning agents for video analytics.[2][6]
For autonomous vehicles, Cosmos Transfer takes simulated driving scenarios authored in Omniverse and re renders them under endless variations of weather, lighting, traffic density, and road geometry, producing labeled training video at a scale that is impractical to collect from real fleets. In robotics, Cosmos Predict generates plausible video continuations from a handful of human teleoperated demonstrations, which become imitation learning targets for the GR00T humanoid foundation models. For data curation, Cosmos Reason filters, scores, and captions petabytes of raw sensor video. In analytics, Cosmos Reason is deployed as a video understanding agent that can answer free form questions about retail, security, traffic, or industrial footage.[14][16][23]
The table below summarizes how some of the major adopters disclosed by NVIDIA are using Cosmos as of CES 2026.
| Partner | Sector | Use of Cosmos |
|---|---|---|
| 1X Technologies | Humanoid robotics | Released the 1X World Model Challenge dataset using Cosmos Tokenizer; Cosmos used in policy training for the Neo humanoid |
| Figure AI | Humanoid robotics | Synthetic motion data generation for humanoid manipulation policies |
| Agility Robotics | Humanoid robotics | Sim to real training data for the Digit warehouse humanoid |
| Boston Dynamics | Robotics | Cosmos Transfer for synthetic data and Cosmos Reason for task planning on Atlas and Spot platforms (CES 2026) |
| Skild AI | General purpose robotics | Fast tracking development of a foundation model for arbitrary robot embodiments |
| Galbot, Hillbot, IntBot, Fourier, NEURA Robotics, Agile Robots | Robotics | Synthetic data and policy training for service and industrial robots |
| XPENG | Automotive and humanoid | Cosmos used to accelerate development of the XPENG humanoid robot |
| Waabi | Autonomous trucking | Evaluation of Cosmos for AV data curation and simulation |
| Aurora Innovation | Autonomous trucking | Cosmos used as part of the joint NVIDIA DRIVE stack for L4 truck deployment |
| Foretellix | AV simulation | Generation of safety critical scenario variations for AV verification |
| Uber | Ridesharing and AV | Joint AI data factory built on Cosmos to support a 100,000 vehicle robotaxi fleet starting 2027 |
| Toyota | Automotive | Standardizing future vehicles on NVIDIA DriveOS integrated with Cosmos (CES 2026) |
| Hyundai | Automotive and robotics | Exploring Alpamayo and Cosmos integration; humanoid manufacturing robots planned for Savannah EV plant by 2028 |
| Lucid Motors | Automotive | Full stack NVIDIA AV software including Cosmos workflows for next generation passenger vehicles |
| Stellantis | Automotive | DRIVE Hyperion customer with Cosmos based simulation in development pipeline |
| Mercedes-Benz | Automotive | First passenger car with Alpamayo on NVIDIA DRIVE in the all new CLA sedan |
| Foxconn | Manufacturing and EV | Hardware and systems integration partner with Stellantis and other OEMs in the Cosmos backed AV stack |
| Caterpillar | Heavy equipment | Cosmos based development of next generation autonomous machinery (CES 2026) |
| LG Electronics | Consumer and home robotics | Cosmos powered next generation home and service robots (CES 2026) |
| Franka Robotics | Manipulation | Cosmos based training pipelines for robotic arms (CES 2026) |
| Virtual Incision | Medical robotics | Cosmos for synthetic data in surgical robotics |
NVIDIA also announced a Halos Certified Program in 2025, the industry's first formal scheme to evaluate and certify physical AI safety for autonomous vehicles and robotics. Cosmos Guardrail is a building block of that certification program.[24]
Cosmos is one layer in NVIDIA's broader physical AI stack. The others are NVIDIA Isaac Sim and Isaac Lab for physics simulation and reinforcement learning, Isaac GR00T for humanoid robot foundation models, NVIDIA Omniverse for digital twin authoring, NVIDIA NeMo Curator for data curation, and NVIDIA DRIVE Hyperion and Halos for automotive deployment and safety.[14][23]
The canonical workflow is: (1) author a digital twin in Omniverse with Universal Scene Description; (2) use Isaac Sim or Isaac Lab to run physics simulation and capture trajectories from human demonstrations; (3) use Cosmos Transfer to convert those trajectories into photorealistic synthetic video; (4) use Cosmos Predict to generate additional plausible futures; (5) use Cosmos Reason to label and curate the dataset; (6) fine tune Isaac GR00T or the DRIVE Hyperion stack; (7) deploy the resulting policy. NVIDIA calls this loop a physical AI data factory, and Uber's October 2025 announcement describes a Cosmos powered data factory to support its planned 100,000 vehicle robotaxi network.[14][23][24]
Cosmos models are also distributed as NVIDIA NIM microservices: containerized inference endpoints that run on premise, in the cloud, or on Jetson Thor at the edge. The build.nvidia.com developer portal exposes them for free interactive trial, with paid commercial inference available through NVIDIA Enterprise.[16]
Cosmos is sometimes grouped with consumer facing video generation systems such as Sora from OpenAI, Veo from Google DeepMind, and Gen 4 from Runway, but the design goals differ. Sora, Veo, and Gen 4 are optimized for cinematic short videos with closed weights and hosted only APIs. Cosmos is optimized for physically plausible simulation of robotics and driving scenarios, with open weights, multiple input modalities including sensor and trajectory data, and licenses that permit commercial fine tuning and redistribution.
| Property | NVIDIA Cosmos | OpenAI Sora 2 | Google Veo 3 | Runway Gen 4.5 |
|---|---|---|---|---|
| Primary purpose | Physical AI simulation and synthetic data | Creative short video | Creative short video, advertising | Creative short video, filmmaking |
| License | NVIDIA Open Model License (commercial use OK) | Closed, hosted API only | Closed, hosted API only | Closed, hosted API only |
| Weights available | Yes, on Hugging Face and NGC | No | No | No |
| Source code | Apache 2.0 on GitHub | No | No | No |
| Input modalities | Text, image, video, sensor, trajectory, structural controls | Text, image | Text, image | Text, image, video |
| Typical output length | Multi second video clips, extendable | Up to ~60 seconds | Up to ~60 seconds | Up to ~20 seconds |
| Audio output | No | Yes (Sora 2) | Yes (Veo 3) | Limited |
| Physics emphasis | High (driving and robotics priority) | Improved in Sora 2 | Improved in Veo 3 | Cinematic over physical |
| Fine tuning | Supported via official toolchain | No public fine tuning | Limited via Vertex AI | Limited |
| Tokenizer released openly | Yes (Cosmos Tokenizer) | No | No | No |
| Reasoning companion model | Yes (Cosmos Reason) | No equivalent | No equivalent | No equivalent |
| Pricing | Free for self hosted use | Subscription tiers | Vertex AI usage based | Subscription and credit based |
In benchmarks tracked by independent video generation leaderboards, Sora 2, Veo 3, and Runway Gen 4.5 generally lead on cinematic quality, prompt adherence, and audio video alignment, while Cosmos is the only platform of the four that publishes Physical AI Bench scores and that is routinely deployed inside robotics and AV training pipelines. The closed video systems are not used by AV or humanoid programs as serious training data generators because their licenses forbid it, their weights are unavailable for fine tuning to specific embodiments, and their conditioning interfaces do not include sensor or trajectory inputs.[25][26]
Cosmos models are released under the NVIDIA Open Model License Agreement, which NVIDIA describes as a permissive license that allows commercial use and the creation and distribution of derivative models. The associated source code (Cosmos Tokenizer, training framework, inference code) is released under Apache 2.0. Both sets of artifacts are mirrored on GitHub under the NVIDIA and nvidia-cosmos organizations and on Hugging Face under the nvidia namespace. Developers can also access Cosmos through the build.nvidia.com NIM API catalog and the NVIDIA NGC catalog. Custom licensing terms beyond the open model license are available via cosmos-license@nvidia.com.[3][20][27]
The January 2025 launch was widely covered as one of the headline announcements of CES 2025. TechCrunch described Cosmos as Nvidia's bid to build the foundation models that robotics and AV industries had previously had to assemble piecemeal. The Robot Report and Robotics 24/7 framed it as the missing world model layer that would let robotics startups bypass years of data collection. Constellation Research analyst Ray Wang argued that Cosmos was a strategic move to lock in NVIDIA's central position in physical AI the same way CUDA had locked in its position in deep learning, by offering an open and free starting point that nonetheless runs best on NVIDIA hardware.[28][29][30]
The arXiv preprint and open weights were well received by the academic community, with several research groups publishing fine tunes within weeks. Critics noted that the term world model is contested and that early Cosmos models, while visually plausible, sometimes produced physics violations such as objects passing through each other. NVIDIA addressed many of these issues in later releases, with Cosmos Predict 2 and Predict 2.5 reporting improved temporal coherence and physics consistency on the NVIDIA Physical AI Bench, and Cosmos Reason 2 topping that leaderboard in December 2025.[18][19]
By CES 2026, financial media were describing Cosmos as the operating system for the physical world, with Toyota's decision to standardize its next generation vehicles on NVIDIA DriveOS plus Cosmos cited as evidence that the physical AI stack was consolidating around NVIDIA in the same way the training stack had consolidated around CUDA a decade earlier.[7][31]