NVIDIA Cosmos Reason
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,659 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 3, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,659 words
Add missing citations, update stale details, or suggest a clearer explanation.
NVIDIA Cosmos Reason is an open, customizable reasoning vision-language model (VLM) for physical AI and robotics developed by Nvidia. It is part of the NVIDIA Cosmos platform of world foundation models (WFMs) for physical AI, and it occupies the platform's "reasoning" role: rather than generating future video frames, Cosmos Reason interprets video and text, applies physical common sense and embodied knowledge, and produces structured chain-of-thought reasoning that culminates in decisions, plans, captions, or critiques expressed in natural language. The underlying models, released as Cosmos-Reason1, are designed to understand space, time, physics, and embodiment so that robots, autonomous vehicles, and vision AI agents can reason about the real world before acting in it.
Cosmos Reason was first unveiled at NVIDIA GTC in March 2025 as part of a major release of new Cosmos world foundation models, and the open model weights for the 7-billion-parameter variant were published in May 2025 under the NVIDIA Open Model License. It is documented in the research paper "Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning."
The Cosmos world-foundation-model platform was launched at CES in January 2025 as a suite of pre-trained models, tokenizers, guardrails, and a data-processing pipeline built to accelerate physical AI development. NVIDIA organized the generative members of the platform into named families that perform distinct functions, and Cosmos Reason is the reasoning member, distinguished from the two generative members:
| Cosmos model | Function | Typical input -> output |
|---|---|---|
| Cosmos Predict | World-state generation and prediction | Text, image, or video, plus motion or sensor data, to predicted future video frames and trajectories |
| Cosmos Transfer | Controllable augmentation and sim-to-real | Structured video (segmentation maps, depth maps, lidar scans, poses) to photoreal video |
| Cosmos Reason | Physical reasoning and decision-making | Video and text to chain-of-thought reasoning, decisions, plans, and captions |
Cosmos Predict and Cosmos Transfer generate or transform pixels; Cosmos Reason is the layer that understands what is happening in a scene and reasons about it. In NVIDIA's described workflows the three are complementary: Cosmos Reason can generate diverse, realistic prompts for Cosmos Predict, curate and caption the synthetic video those generative models produce, and act as a critic that filters physically implausible outputs, distinguishing, for example, a plausible accident from a physically impossible event. NVIDIA later introduced NVIDIA Cosmos 3, an omni-model that combines vision reasoning and multimodal generation in a single model, unifying capabilities that the earlier Cosmos Reason, Predict, and Transfer models provided separately.
Cosmos Reason is a spatiotemporally aware reasoning model. It accepts a video (or images) together with a text prompt, analyzes the visual content in the context of that prompt, runs an explicit chain-of-thought reasoning process, and emits an answer. The model is prompted to separate its reasoning from its conclusion, wrapping intermediate thinking in <think> tags and the final response in <answer> tags, which makes the reasoning trace inspectable. Because chain-of-thought traces can be long, the model card recommends allowing 4,096 or more output tokens to avoid truncating the reasoning.
The capability set spans three broad application areas that NVIDIA highlights:
The research release comprises two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B. Both use a decoder-only multimodal architecture in which a vision encoder feeds a projector that aligns visual tokens with text embeddings, after which a language-model backbone performs reasoning over the combined token stream. The two sizes differ substantially in their components:
| Variant | Vision encoder | LLM backbone | Layers | Model dimension | Public release |
|---|---|---|---|---|---|
| Cosmos-Reason1-7B | Built on Qwen2.5-VL | Standard Transformer | 28 | 3,584 | Yes (Hugging Face, GitHub) |
| Cosmos-Reason1-56B | InternViT-300M-V2.5 | Nemotron-H (hybrid Mamba-MLP-Transformer) | 118 | 8,192 | Described in paper |
The 7B model is built on the Qwen2.5-VL vision-language model, with the model card reporting a total of roughly 8.3 billion parameters once the vision transformer (about 0.68B), the language model (about 7.07B), and the output projection layer are counted. The 56B model adopts a hybrid Mamba-MLP-Transformer LLM backbone from NVIDIA's Nemotron-H line paired with an InternViT vision encoder. The publicly downloadable model is Cosmos-Reason1-7B; it accepts text plus video (mp4, recommended at 4 frames per second) or images, and produces text. Inference runs in BF16 on NVIDIA Hopper and Blackwell hardware (tested on H100, A100, and GB200) and is supported through runtimes such as vLLM and SGLang.
To represent physical common sense, the models use a hierarchical ontology that captures fundamental knowledge about space, time, and physics. For embodied reasoning, they rely on a two-dimensional ontology designed to generalize across different physical embodiments, so that reasoning learned on one robot form factor transfers to others.
Cosmos-Reason1 is trained in two stages on top of its pre-trained base: Physical AI supervised fine-tuning (SFT) followed by Physical AI reinforcement learning (RL). The SFT stage teaches the model to reason over physical-world video using curated demonstration data, and the RL stage further sharpens decision quality and the chain-of-thought reasoning process.
The supervised fine-tuning data, totaling roughly 300 GB, is drawn from a mix of robotics, egocentric, and driving sources collected through automatic, sensor-based, and human methods. Reported components include RoboVQA, BridgeData V2, AgiBot, HoloAssist, RoboFail, and an autonomous-vehicle (AV) set, with later additions covering human-centric video, dashcam video, dense captioning, and intelligent-transportation and warehouse footage. Smaller reinforcement-learning and benchmark datasets accompany the SFT corpus. NVIDIA released the SFT, RL, and benchmark datasets alongside the model.
Alongside the models, the authors constructed benchmarks aligned to their ontologies to measure progress in physical reasoning:
The paper reports that both the SFT and RL stages yield significant gains. Averaged across the physical-common-sense and the six embodied-reasoning datasets, the 7B model rises from 60.7 after supervised fine-tuning to 65.7 after Physical AI reinforcement learning, a gain of about 5.0 points. On the Intuitive Physics benchmark, which averages arrow-of-time, spatial-puzzle, and object-permanence tasks, the 7B model reaches roughly 74.5 after SFT, far above a random baseline near 41.7 that the authors note general VLMs barely exceed, and improves further with RL. The 56B model posts comparable or stronger results on the physical-common-sense and embodied-reasoning suites. (Reported figures vary slightly by checkpoint and by the exact mix of benchmark datasets averaged; the published Cosmos-Reason1-7B model card lists a 65.1 average across its six embodied-reasoning datasets.)
NVIDIA released Cosmos Reason as an open model. The Cosmos-Reason1-7B weights are distributed on Hugging Face under the NVIDIA Open Model License, which permits commercial use, while the accompanying source code is under the Apache 2.0 license, and the GitHub repository provides inference scripts, post-training tutorials for both SFT and reinforcement learning, and FP8 quantization utilities. The model was iteratively updated after launch, adding an enhanced critic for physical plausibility and temporal timestamp captioning, and gaining spatial-temporal reasoning aimed at city-scale and industrial operations.
Cosmos Reason is significant as one of the first openly released reasoning models purpose-built for physical AI rather than for text or general multimodal tasks. By packaging physical common sense, embodied reasoning, and inspectable chain-of-thought into a customizable VLM, it gives robotics and autonomous-vehicle developers a high-level "thinking" component that complements the generative Cosmos models and the broader physical AI tooling NVIDIA assembles around simulation, synthetic data, and policy training. Its role as a data curator and critic also addresses a central bottleneck in the field, the cost and quality of the video data needed to train embodied systems, positioning it within the larger pipeline that NVIDIA later consolidated into the omni-model design of Cosmos 3.