NVIDIA Cosmos Reason

Embodied AI Multimodal AI NVIDIA

9 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v2 · 1,835 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

NVIDIA Cosmos Reason is an open, customizable, 7-billion-parameter reasoning vision-language model (VLM) for physical AI and robotics developed by Nvidia. It enables robots, autonomous vehicles, and vision AI agents to reason about the physical world using prior knowledge, physics understanding, and common sense, producing structured chain-of-thought reasoning that ends in a decision, plan, caption, or critique expressed in natural language.^[1]^[3] It is part of the NVIDIA Cosmos platform of world foundation models (WFMs) for physical AI, where it occupies the platform's "reasoning" role: rather than generating future video frames, Cosmos Reason interprets video and text, applies physical common sense and embodied knowledge, and explains what is happening in a scene before an agent acts in it. The underlying models, released as Cosmos-Reason1, are designed to understand space, time, physics, and embodiment so that machines can reason about the real world before taking action.^[1]

Cosmos Reason was first unveiled at NVIDIA GTC on March 18, 2025, as part of a major release of new Cosmos world foundation models, and the open model weights for the 7-billion-parameter variant were published in May 2025 under the NVIDIA Open Model License.^[5]^[3] It is documented in the research paper "Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning."^[1]

What is Cosmos Reason?

Cosmos Reason is a spatiotemporally aware reasoning model. It accepts a video (or images) together with a text prompt, analyzes the visual content in the context of that prompt, runs an explicit chain-of-thought reasoning process, and emits an answer.^[7]^[3] NVIDIA describes it as "an open, fully customizable WFM with spatiotemporal awareness that uses chain-of-thought reasoning to understand video data and predict the outcomes of interactions, such as a person stepping into a crosswalk or a box falling from a shelf, in natural language."^[5] The model is prompted to separate its reasoning from its conclusion, wrapping intermediate thinking in <think> tags and the final response in <answer> tags, which makes the reasoning trace inspectable. Because chain-of-thought traces can be long, the model card recommends allowing 4,096 or more output tokens to avoid truncating the reasoning.^[3]

The capability set spans three broad application areas that NVIDIA highlights:

Robot and autonomous-vehicle planning. As a high-level planner, Cosmos Reason can reason about a scene using prior knowledge, physics understanding, and common sense, then decide on a next step or action in natural language.^[3] A canonical example posed to the model is asking whether it is safe to turn right given a dashcam video. Through post-training, developers can build vision-language-action style planners that direct a robot or vehicle toward completing a task.^[8] The model is relevant to humanoid robotics efforts such as Isaac GR00T and policy models like GR00T N1.
Data curation and annotation. Physical AI development is bottlenecked by the cost of curating and labeling large video datasets. Cosmos Reason can automatically caption video, generate temporal timestamp captions, annotate synthetic data, and critique data quality, building curated training sets for embodied systems and acting as a reasoning filter over massive video corpora.^[7]
Video analytics AI agents. The model can serve as the reasoning engine for agents that extract insights from large volumes of recorded or live video, answering questions about events and behaviors in a scene.^[9]

What is Physical AI reasoning?

Physical AI reasoning is the ability of a model to apply physical common sense and embodied knowledge, an understanding of space, time, physics, and how an agent's body interacts with its environment, to interpret real-world scenes and decide what to do next. To represent physical common sense, Cosmos-Reason1 uses a hierarchical ontology that captures fundamental knowledge about space, time, and physics.^[1] For embodied reasoning, the models rely on a two-dimensional ontology designed to generalize across different physical embodiments, so that reasoning learned on one robot form factor transfers to others.^[1] This focus distinguishes Cosmos Reason from general-purpose VLMs: NVIDIA reports that on intuitive-physics tasks, general VLMs barely exceed a random baseline near 41.7, whereas Cosmos-Reason1 reaches roughly 74.5 after supervised fine-tuning.^[1]

How does Cosmos Reason fit into the Cosmos platform?

The Cosmos world-foundation-model platform was launched at CES on January 6, 2025, as a suite of pre-trained models, tokenizers, guardrails, and a data-processing pipeline built to accelerate physical AI development.^[6] NVIDIA organized the generative members of the platform into named families that perform distinct functions, and Cosmos Reason is the reasoning member, distinguished from the two generative members:

Cosmos model	Function	Typical input -> output
Cosmos Predict	World-state generation and prediction	Text, image, or video, plus motion or sensor data, to predicted future video frames and trajectories
Cosmos Transfer	Controllable augmentation and sim-to-real	Structured video (segmentation maps, depth maps, lidar scans, poses) to photoreal video
Cosmos Reason	Physical reasoning and decision-making	Video and text to chain-of-thought reasoning, decisions, plans, and captions

Cosmos Predict and Cosmos Transfer generate or transform pixels; Cosmos Reason is the layer that understands what is happening in a scene and reasons about it.^[5] In NVIDIA's described workflows the three are complementary: Cosmos Reason can generate diverse, realistic prompts for Cosmos Predict, curate and caption the synthetic video those generative models produce, and act as a critic that filters physically implausible outputs, distinguishing, for example, a plausible accident from a physically impossible event.^[7] NVIDIA later introduced NVIDIA Cosmos 3, an omni-model that combines vision reasoning and multimodal generation in a single model, unifying capabilities that the earlier Cosmos Reason, Predict, and Transfer models provided separately.

What architecture does Cosmos-Reason1 use?

The research release comprises two multimodal large language models, Cosmos-Reason1-7B and Cosmos-Reason1-56B.^[1] Both use a decoder-only multimodal architecture in which a vision encoder feeds a projector that aligns visual tokens with text embeddings, after which a language-model backbone performs reasoning over the combined token stream. The two sizes differ substantially in their components:

Variant	Vision encoder	LLM backbone	Layers	Model dimension	Public release
Cosmos-Reason1-7B	Built on Qwen2.5-VL	Standard Transformer	28	3,584	Yes (Hugging Face, GitHub)
Cosmos-Reason1-56B	InternViT-300M-V2.5	Nemotron-H (hybrid Mamba-MLP-Transformer)	118	8,192	Described in paper

The 7B model is built on the Qwen2.5-VL vision-language model, with the model card reporting a total of roughly 8.3 billion parameters once the vision transformer (about 675.76M), the language model (about 7.07B), and the output projection layer (about 545.00M) are counted.^[3] The 56B model adopts a hybrid Mamba-MLP-Transformer LLM backbone from NVIDIA's Nemotron-H line paired with an InternViT vision encoder.^[1] The publicly downloadable model is Cosmos-Reason1-7B; it accepts text plus video (mp4, recommended at 4 frames per second) or images, and produces text.^[3] Inference runs in BF16 on NVIDIA Hopper and Blackwell hardware (tested on H100, A100, and GB200) and is supported through runtimes such as vLLM and SGLang.^[3]

How is Cosmos Reason trained?

Cosmos-Reason1 is trained in two stages on top of its pre-trained base: Physical AI supervised fine-tuning (SFT) followed by Physical AI reinforcement learning (RL).^[1] The SFT stage teaches the model to reason over physical-world video using curated demonstration data, and the RL stage further sharpens decision quality and the chain-of-thought reasoning process.

The supervised fine-tuning data, totaling roughly 300 GB, is drawn from a mix of robotics, egocentric, and driving sources collected through automatic, sensor-based, and human methods.^[1] Reported components include RoboVQA, BridgeData V2, AgiBot, HoloAssist, RoboFail, and an autonomous-vehicle (AV) set, with later additions covering human-centric video, dashcam video, dense captioning, and intelligent-transportation and warehouse footage. Smaller reinforcement-learning and benchmark datasets accompany the SFT corpus. NVIDIA released the SFT, RL, and benchmark datasets alongside the model.^[1]

How well does Cosmos Reason perform on benchmarks?

Alongside the models, the authors constructed benchmarks aligned to their ontologies to measure progress in physical reasoning:^[1]

A Physical Common Sense benchmark with Space, Time, and other-physics categories.
An Embodied Reasoning benchmark spanning robotics and autonomous-vehicle domains, evaluated across datasets such as RoboVQA, BridgeData V2, AgiBot, HoloAssist, RoboFail, and AV.
An Intuitive Physics benchmark probing the arrow of time, spatial puzzles, and object permanence.

The paper reports that both the SFT and RL stages yield significant gains. Averaged across the physical-common-sense and the six embodied-reasoning datasets, the 7B model rises from 60.7 after supervised fine-tuning to 65.7 after Physical AI reinforcement learning, a gain of about 5.0 points.^[1] On the Intuitive Physics benchmark, which averages arrow-of-time, spatial-puzzle, and object-permanence tasks, the 7B model reaches roughly 74.5 after SFT, far above a random baseline near 41.7 that the authors note general VLMs barely exceed, and improves further with RL.^[1] The 56B model posts comparable or stronger results on the physical-common-sense and embodied-reasoning suites. (Reported figures vary slightly by checkpoint and by the exact mix of benchmark datasets averaged; the published Cosmos-Reason1-7B model card lists a 65.1 average across its six embodied-reasoning datasets.)^[3]

Is Cosmos Reason open source?

NVIDIA released Cosmos Reason as an open model. The Cosmos-Reason1-7B weights are distributed on Hugging Face under the NVIDIA Open Model License, which permits commercial use, while the accompanying source code is under the Apache 2.0 license, and the GitHub repository provides inference scripts, post-training tutorials for both SFT and reinforcement learning, and FP8 quantization utilities.^[3]^[4] The model was iteratively updated after launch, adding an enhanced critic for physical plausibility and temporal timestamp captioning, and gaining spatial-temporal reasoning aimed at city-scale and industrial operations.^[7]

Why does Cosmos Reason matter?

Cosmos Reason is significant as one of the first openly released reasoning models purpose-built for physical AI rather than for text or general multimodal tasks.^[1] By packaging physical common sense, embodied reasoning, and inspectable chain-of-thought into a customizable VLM, it gives robotics and autonomous-vehicle developers a high-level "thinking" component that complements the generative Cosmos models and the broader physical AI tooling NVIDIA assembles around simulation, synthetic data, and policy training. Its role as a data curator and critic also addresses a central bottleneck in the field, the cost and quality of the video data needed to train embodied systems, positioning it within the larger pipeline that NVIDIA later consolidated into the omni-model design of Cosmos 3.

References

NVIDIA Cosmos research team. "Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning." arXiv:2503.15558, March 2025 (revised May 2025). https://arxiv.org/abs/2503.15558 ↩
NVIDIA Cosmos Lab. "Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning." NVIDIA Research. https://research.nvidia.com/labs/cosmos-lab/cosmos-reason1/
NVIDIA. "Cosmos-Reason1-7B" model card. Hugging Face. https://huggingface.co/nvidia/Cosmos-Reason1-7B ↩
NVIDIA. "cosmos-reason1" repository. GitHub. https://github.com/nvidia-cosmos/cosmos-reason1 ↩
NVIDIA Newsroom. "NVIDIA Announces Major Release of Cosmos World Foundation Models and Physical AI Data Tools." March 18, 2025. https://nvidianews.nvidia.com/news/nvidia-announces-major-release-of-cosmos-world-foundation-models-and-physical-ai-data-tools ↩
NVIDIA Newsroom. "NVIDIA Launches Cosmos World Foundation Model Platform to Accelerate Physical AI Development." January 6, 2025. https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-world-foundation-model-platform-to-accelerate-physical-ai-development ↩
NVIDIA Technical Blog. "Curating Synthetic Datasets to Train Physical AI Models with NVIDIA Cosmos Reason." https://developer.nvidia.com/blog/curating-synthetic-datasets-to-train-physical-ai-models-with-nvidia-cosmos-reason/ ↩
NVIDIA Technical Blog. "Maximize Robotics Performance by Post-Training NVIDIA Cosmos Reason." https://developer.nvidia.com/blog/maximize-robotics-performance-by-post-training-nvidia-cosmos-reason/ ↩
NVIDIA. "NVIDIA Cosmos: World Foundation Models Powering Physical AI." https://www.nvidia.com/en-us/ai/cosmos/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Imitation Learning Isaac GR00T Nemotron-H Sim-to-real transfer

What is Cosmos Reason?

What is Physical AI reasoning?

How does Cosmos Reason fit into the Cosmos platform?

What architecture does Cosmos-Reason1 use?

How is Cosmos Reason trained?

How well does Cosmos Reason perform on benchmarks?

Is Cosmos Reason open source?

Why does Cosmos Reason matter?

References

Improve this article

Related Articles

ERQA

Isaac GR00T

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

NVIDIA Cosmos

MimicGen

What links here

Related Articles

ERQA

Isaac GR00T

PaLM-E: An Embodied Multimodal Language Model

SmolVLA

NVIDIA Cosmos

MimicGen

What links here