Spatial intelligence

Computer Vision Embodied AI Generative AI

21 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

22 citations

Revision

v2 · 4,225 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Spatial intelligence is the ability of an AI system to perceive, understand, reason about, generate, and interact with three-dimensional space rather than just text or two-dimensional pixels. Where large language models operate on sequences of tokens that describe the world in words, spatially intelligent systems operate on representations of the world itself, including its geometry, physics, dynamics, and the way agents move and act within it. The phrase has been popularized since 2024 by computer scientist Fei-Fei Li, her startup World Labs, and a growing community of researchers who argue that genuine progress toward general-purpose AI requires models that understand worlds, not just words ^[1]^[2].

Li has framed spatial intelligence as the natural successor to the language-centric era of AI. In her April 2024 TED talk and her November 2025 essay "From Words to Worlds," she argues that perception and action, not language, were the original drivers of biological intelligence, and that machines will only become truly useful in the physical world when they can build, manipulate, and predict spatial structure as fluently as today's large language models manipulate text ^[1]^[3]. The concept overlaps with, but is distinct from, the closely related notion of a world model. A world model is a learned representation of environmental dynamics used for prediction and planning. Spatial intelligence is broader: it names the overall cognitive capability of which world models, 3D reconstruction, geometric reasoning, and embodied control are component parts.

By 2026, spatial intelligence had become one of the most active and well-funded frontiers in AI. World Labs raised approximately $230 million in a 2024 Series A and a further $1 billion in a February 2026 Series B, bringing total funding to roughly $1.23 billion to build foundation models for 3D environments ^[4]^[5]. Google DeepMind, Meta, NVIDIA, and Wayve all released world model systems with explicit spatial reasoning ambitions ^[6]^[7]^[8]^[9], while Niantic Spatial pushed visual positioning into mainstream augmented reality ^[10]. The shared bet is that the next leap in AI capability will come from grounding intelligence in three-dimensional space rather than two-dimensional text.

What is spatial intelligence?

Spatial intelligence has both an old meaning in cognitive science and a newer technical meaning in AI. The older usage traces to developmental psychologist Howard Gardner, who in his 1983 book Frames of Mind listed spatial intelligence as one of his eight types of human intelligence, defined as the capacity to think in images, perceive the visual world accurately, and mentally manipulate three-dimensional structures ^[11]. Architects, sculptors, surgeons, athletes, and chess players have long been described as spatially intelligent in this sense.

The AI usage adopted by Li and World Labs preserves the spirit of Gardner's definition but extends it to machines. In Li's framing, spatial intelligence in AI involves four interlinked capabilities ^[1]^[2]^[3]:

Perception of 3D structure. Building geometrically consistent representations of a scene from images, video, depth, or other sensor data.
Generative imagination of worlds. Synthesizing new 3D environments that respect geometry, physics, and semantics, conditioned on text, images, video, or other inputs.
Reasoning about space and dynamics. Predicting how the world will change under actions, how objects will collide, deform, or move, and how viewpoints relate.
Interaction with embodied agents. Supporting decision making by robots, vehicles, or virtual avatars that need to move and manipulate things in the world.

This is a deliberately broader formulation than "3D vision" or "world model." It treats geometry, physics, embodiment, and generation as facets of a single cognitive capability that current AI systems lack and that Li argues is required for AI to be useful outside the chat window ^[1]. In her TED talk, Li traces the capability back to the origins of vision in evolution: "Sight turned into insight; Seeing became understanding; Understanding led to action," and she stresses that "Simply seeing is not enough. Seeing is for doing and learning" ^[1].

Where does spatial intelligence come from?

Roots in computer vision

The technical foundations of spatial intelligence sit in classical and modern computer vision. Decades of research in structure-from-motion, multi-view stereo, simultaneous localization and mapping (SLAM), and photogrammetry created the mathematical machinery to recover 3D structure from 2D images. Li herself helped define an earlier era of the field by leading the creation of ImageNet in the late 2000s, a dataset that catalyzed the deep learning revolution in image recognition ^[12]. For much of the 2010s, vision research focused on recognition. Networks like AlexNet, ResNet, and later vision transformers learned to label images with increasing accuracy but did not generally produce explicit 3D scene representations. The shift toward spatially grounded vision accelerated in the late 2010s as researchers combined deep learning with classical geometry.

Neural radiance fields and Gaussian splatting

Two technical breakthroughs reshaped 3D representation in AI during the 2020s. The first was neural radiance fields, introduced by Ben Mildenhall and colleagues in 2020. A NeRF represents a 3D scene as a continuous function from spatial coordinates and viewing directions to color and density, parameterized by a small neural network. Given a handful of input photos, a NeRF can be optimized to render novel views with photorealistic quality. NeRFs demonstrated that neural networks could capture rich 3D structure without explicit meshes or point clouds, and they triggered an explosion of follow-up work on faster training, larger scenes, dynamic content, and editing.

The second breakthrough was 3D Gaussian splatting, introduced in 2023 by researchers at Inria and the Max Planck Institute. Rather than encoding a scene implicitly in a neural network, Gaussian splatting represents a scene as millions of small anisotropic 3D Gaussians with position, color, opacity, and orientation, rendered through highly efficient rasterization. Gaussian splatting matched or exceeded NeRF quality while enabling true real-time rendering on commodity GPUs, and it quickly became the dominant 3D representation for industrial applications, including World Labs's Marble product ^[13]^[14].

Mildenhall, the lead author of the original NeRF paper, is a co-founder of World Labs, alongside Christoph Lassner, another long-time graphics and vision researcher whose Pulsar differentiable renderer helped lay groundwork for Gaussian splatting. Their joint presence reflects how directly the spatial intelligence agenda builds on the NeRF and Gaussian splatting lineage ^[4]^[15].

Embodied AI and world models

The other major foundation is the field of embodied AI, which studies agents that learn by acting in physical or simulated environments. It established the importance of egocentric perception, manipulation, navigation, and the integration of vision with control, and revived interest in classical world models. In 2018, David Ha and Jurgen Schmidhuber popularized the modern term "world model" with a paper that trained a generative model of game environments and demonstrated agents that could plan inside their own dreams. By the early 2020s, world models had become central to model-based reinforcement learning and large-scale video prediction. The current wave of spatial intelligence work sits at the confluence of these two threads: NeRF-style and Gaussian-splatting-style 3D representations on the perception side, and world models on the dynamics side.

How does spatial intelligence differ from language models and world models?

A recurring theme in Li's writing is that language alone is an insufficient foundation for general intelligence. In her 2025 essay "From Words to Worlds," she describes current LLMs as "wordsmiths in the dark, eloquent but inexperienced, knowledgeable but ungrounded," able to manipulate symbols about the world but unable to estimate distances, mentally rotate objects, predict physical interactions, or maintain spatial memory in any reliable way ^[2]. Spatial intelligence is positioned as the complement that LLMs lack.

The relationship between spatial intelligence and world models is more subtle. World models are one technical implementation of part of spatial intelligence: specifically, the ability to predict future states given actions. In the same essay, Li defines world models through "three essential capabilities": they must be generative (able to "generate worlds with perceptual, geometrical, and physical consistency"), multimodal (able to process inputs across a wide range of forms), and interactive (able to "output the next states based on input actions") ^[2]. Spatial intelligence as a concept also includes perception, geometric reasoning, generation of static scenes, and the broader integration of all of these into agents. Many world-model systems in 2025-2026 are spatially intelligent by design, but some video-generation systems labeled as world models do not actually reason about 3D geometry and would not satisfy a strict definition of spatial intelligence.

The table below contrasts the three concepts as they are typically used.

Concept	Primary substrate	Core capability	Key examples
Large language model	Text tokens	Predict next token; manipulate symbols and concepts in language	GPT-class models, Claude, Gemini
World model	Learned environment dynamics	Predict future states given actions; support planning	Ha and Schmidhuber 2018, Dreamer, GAIA, Genie, Cosmos
Spatial intelligence	3D geometry, physics, embodiment	Perceive, generate, reason about, and act in 3D worlds	World Labs Marble, Genie 3, Gaussian splatting pipelines, embodied robot policies

Li has been careful to note that spatial intelligence is not in opposition to language models. The intended endpoint is multimodal systems that combine the symbolic fluency of LLMs with the spatial and physical grounding of world models and 3D representations, an architecture sometimes informally called "large world models" or "large multimodal models" ^[2]^[16].

What are the technical pillars of spatial intelligence?

Research in spatial intelligence draws on several distinct but converging technical pillars. The table below summarizes the main pillars and representative research directions associated with each.

Pillar	What it provides	Representative methods and systems
Scene understanding	Semantic parsing of objects, surfaces, and relations in 2D and 3D	Open-vocabulary segmentation, 3D scene graphs, vision-language models with spatial grounding
3D reconstruction	Geometric models of real or imagined scenes from images or video	Structure-from-motion, multi-view stereo, NeRFs, Gaussian splatting
Generative world synthesis	Creation of new 3D scenes from text, images, or video prompts	World Labs Marble, NVIDIA Lyra, video diffusion models distilled into 3D
Geometric and physical reasoning	Predictions that respect spatial layout, occlusion, contact, and physical law	Differentiable physics, neural simulators, physics-aware video models
Dynamics and prediction	Forecasts of how scenes evolve under time and action	World models, latent dynamics models, video prediction
Embodiment and control	Linking perception to action for robots and agents	Vision-language-action models, PaLM-E, policy learning in simulation, sim-to-real transfer

A recurring engineering insight is that no single representation suffices. Industrial systems typically combine 2D foundation models, explicit 3D structure, and learned dynamics. A 2025 survey on embodied spatial intelligence argued that the most promising path is to integrate high-quality 3D structure with large-scale 2D foundation models rather than treat them as competing options ^[17].

Scene understanding

Spatially intelligent systems need to know not just what objects exist in an image but where they are, how they relate, and what affordances they offer. Research in open-vocabulary 3D segmentation, 3D scene graphs, and vision-language models with spatial grounding aims to fill this gap. Benchmarks measuring distance estimation, object rotation, and spatial memory have repeatedly shown that even state-of-the-art multimodal models struggle with tasks that are trivial for humans ^[18].

3D reconstruction

Reconstruction methods convert images and video into explicit or implicit 3D representations. Classical structure-from-motion and multi-view stereo remain workhorses for tasks like mapping. Neural approaches such as NeRFs and Gaussian splatting deliver higher fidelity for novel-view synthesis and have become standard outputs of generative pipelines. By 2025, Gaussian splatting had been adopted as a candidate addition to the glTF standard, signaling its emergence as the "JPEG of 3D" ^[14].

Generative world synthesis

The defining new capability of the 2024-2026 era is generative spatial synthesis: producing entire 3D worlds from text, image, or video prompts. World Labs's Marble, launched in late 2025, generates persistent, downloadable 3D environments using Gaussian splatting representations and lets users export them as Gaussian splats, triangle meshes, or videos ^[13]^[14]. NVIDIA's research lab introduced Lyra, a system that distills video-diffusion knowledge into Gaussian splat scenes at inference time ^[19]. Earlier work like DreamFusion and Magic3D had shown that diffusion priors could be lifted into 3D, but the new wave aimed at producing entire navigable worlds rather than single objects.

Geometric and physical reasoning

A spatially intelligent system needs to understand not only what a scene looks like but how it behaves. This requires priors about occlusion, contact, friction, gravity, and other physical laws. Researchers have explored physics-aware video models, differentiable simulators, and hybrid approaches that combine learned dynamics with explicit physical equations. Whether large video generation models like Sora possess such priors remains contested; a 2025 analysis argued that Sora and V-JEPA had not yet learned a complete physical world model ^[20].

Dynamics and prediction

Dynamics is the home territory of world models. Methods like Dreamer in reinforcement learning, GAIA in autonomous driving, Genie in interactive environments, and Cosmos for robotics all learn to predict future frames or future latent states conditioned on actions ^[6]^[7]^[8]. These models can be evaluated on rollout fidelity, controllability, and downstream usefulness for planning or policy learning.

Embodiment and control

Finally, spatial intelligence is connected to embodiment. Vision-language-action models, such as Google's RT-2 and PaLM-E, fuse perception, language, and motor control in a single network. World models trained on robotics data, such as NVIDIA Cosmos, are designed to provide a shared simulation substrate for many embodied agents ^[8]. Wayve's GAIA series applies the same logic to autonomous driving, using generated worlds to expose vehicles to rare and dangerous scenarios at scale ^[9].

Which companies are building spatial intelligence?

The table below summarizes major organizations working in spatial intelligence and their representative products or research programs.

Organization	Focus	Flagship system or program
World Labs	Foundation models for 3D worlds	Marble (text/image/video to 3D scenes with Gaussian splatting) ^[13]
Google DeepMind	Interactive generative world models	Genie 3 (real-time playable 3D environments from prompts) ^[6]
Meta AI	Self-supervised video world models	V-JEPA 2 (video joint-embedding predictive architecture) ^[7]
NVIDIA	Physics-aware world foundation models	NVIDIA Cosmos platform (Predict, Transfer, Reason) ^[8]
Wayve	Driving-specific world models	GAIA-1, GAIA-2, GAIA-3 ^[9]
Niantic Spatial	Geospatial visual positioning	Visual Positioning System and Large Geospatial Model ^[10]
Stanford and academic labs	Foundational research	Embodied spatial intelligence surveys, SpaVLE workshop series ^[17]^[21]

World Labs

World Labs is the company most strongly identified with the term spatial intelligence. It was co-founded in 2023 by Fei-Fei Li, Justin Johnson, Ben Mildenhall, and Christoph Lassner, and emerged from stealth in 2024 ^[4]^[15]. The team combines deep learning research, computer vision, and graphics, and it has explicitly positioned itself as building "frontier models that can perceive, generate, reason and interact with the 3D world" ^[15].

In September 2024, the company announced approximately $230 million in Series A funding at a valuation of about $1 billion, with investors including Andreessen Horowitz, NVIDIA's venture arm, and Radical Ventures ^[4]. On February 18, 2026, it raised a further $1 billion Series B, reportedly at a valuation near $5 billion, from a group that included NVIDIA, AMD, Autodesk (which alone contributed about $200 million), Andreessen Horowitz, and Fidelity, taking its total funding to roughly $1.23 billion and confirming spatial intelligence as a heavily capitalized category ^[5].

World Labs's first commercial product, Marble, opened in a limited release in September 2025 and launched publicly on November 12, 2025. Marble accepts text, image, video, panorama, or coarse 3D-layout prompts and produces persistent, downloadable 3D worlds represented as Gaussian splat scenes, with an in-app hybrid editor that lets users block out spatial structures before the model fills in visual detail ^[13]. Outputs can be exported as Gaussian splats, triangle meshes, or rendered videos, and Marble is positioned for applications in gaming, virtual production, design, and immersive media ^[13].

Google DeepMind

Google DeepMind has developed an influential series of generative world models under the Genie name. Genie 1 in early 2024 demonstrated playable 2D environments generated from images. Genie 2 in late 2024 extended this to richer 3D-feeling worlds for short rollouts. Genie 3, announced in August 2025, generates interactive 3D environments from text prompts that run in real time at 24 frames per second and 720p resolution for several minutes while maintaining spatial consistency, including a degree of emergent object permanence in which changes to a scene persist after the camera looks away ^[6]. In January 2026, Google launched Project Genie as a consumer-facing experiment on Google Labs.

Meta AI

Meta's contribution to spatial intelligence has centered on self-supervised video models built around the Joint Embedding Predictive Architecture (JEPA) advocated by Yann LeCun. V-JEPA 2, released in June 2025, is a 1.2-billion-parameter model pre-trained on more than one million hours of internet video plus around one million images, then adapted with roughly 62 hours of robot trajectories to support embodied control of robot arms ^[7]. The JEPA family is designed to learn predictive world representations in a self-supervised, non-generative way, an alternative bet on how spatial intelligence might be built.

NVIDIA

NVIDIA has positioned itself as the infrastructure layer for spatial AI through its Cosmos platform, introduced in January 2025. Cosmos is a family of open world foundation models trained on tens of millions of hours of real-world video data and organized into three lines: Predict for future-state simulation, Transfer for sim-to-real bridging, and Reason for physics-aware reasoning ^[8]. Cosmos models are released openly through Hugging Face and have been widely adopted as a substrate for robotics and autonomous vehicle development.

Wayve

Wayve, a London-based autonomous driving company, has built a production-oriented spatial intelligence stack through its GAIA series of world models. GAIA-1 introduced a 9-billion-parameter generative world model for driving in 2023. GAIA-2 improved controllability and geographic diversity. GAIA-3, announced in 2025, is a 15-billion-parameter latent diffusion world model trained on roughly ten times more data than GAIA-2, designed for safety evaluation of autonomous driving policies at scale ^[9].

Niantic Spatial

The game and AR company Niantic spun out Niantic Spatial in 2025 to focus on visual positioning and geospatial AI. Its Visual Positioning System anchors content to physical locations with centimeter-level accuracy by matching live camera imagery to a Large Geospatial Model built from billions of crowdsourced images ^[10]. Niantic represents the geospatial wing of spatial intelligence, where the world to be modeled is the actual planet rather than a generated environment.

What is spatial intelligence used for?

Spatial intelligence is being applied across several industries.

Robotics and autonomous vehicles

The most direct application is to embodied systems. Robots and self-driving cars need to understand 3D space, predict how it will evolve, and act within it. World models like NVIDIA Cosmos and Wayve GAIA provide simulated environments in which policies can be trained safely and at scale, including rare and dangerous events that are hard to encounter in the real world ^[8]^[9]. Vision-language-action models like PaLM-E and RT-2 integrate spatial perception with language understanding so that robots can follow natural-language instructions in the physical world.

Gaming, film, and immersive media

Generative world synthesis is reshaping content creation. Marble, Genie 3, and similar systems let designers prototype environments from prompts in minutes rather than weeks, and the resulting Gaussian splat or volumetric outputs can be imported into existing engines. Visual effects pipelines have begun adopting NeRF and Gaussian splatting for set extension, virtual production, and post-production. Game studios are exploring world models as procedural content generators and as tools for non-player character behavior ^[13]^[22].

Augmented and virtual reality

Spatial intelligence is a precondition for compelling AR and VR. Persistent, shared world maps such as Niantic's Large Geospatial Model anchor AR objects to physical locations across users and sessions ^[10]. On the device side, vision-based positioning and scene understanding are required for occlusion, lighting, and plausible placement of virtual content. World Labs has named immersive experiences as a target application for Marble ^[15].

Design, architecture, and engineering

Generative 3D tools are starting to support architectural visualization, product design, and engineering simulation. The February 2026 World Labs funding round included Autodesk as a strategic investor with about $200 million, signaling integration plans with computer-aided design workflows ^[5]. Combined with physical reasoning, spatial intelligence systems could support iterative design where geometry, structural behavior, and aesthetics are co-optimized.

Science and healthcare

Li has consistently linked spatial intelligence to scientific discovery and medicine. Her own Stanford research has applied AI-driven sensing to clinical environments to monitor patient safety and reduce staff burnout ^[3]. More broadly, spatial AI is being applied to molecular simulation, microscopy, and 3D medical imaging, where the fundamental data is inherently spatial rather than textual ^[2].

Education

Spatial intelligence is being explored as a foundation for new educational tools, including interactive simulations that let learners explore complex concepts in physics, biology, and engineering through immersive 3D experiences. Li frames this as restoring the bodily and spatial dimensions of learning that text-based AI tutors miss ^[2].

What are the limitations of spatial intelligence?

Despite rapid progress, spatial intelligence remains far behind human capability in several respects.

Physical fidelity. Current generative world models often produce visually striking scenes that nonetheless violate basic physics, such as objects passing through each other, water that does not conserve volume, or shadows that do not match light sources. Independent evaluations have repeatedly shown that even leading video and 3D models struggle to enforce consistent physical laws ^[20].
Long-horizon consistency. Maintaining a coherent world over long timescales, especially when users revisit a location, remains an unsolved problem. Genie 3 made notable progress by maintaining consistency for several minutes ^[6], but persistent multi-hour or multi-session worlds are still an open challenge.
Spatial reasoning in language models. Multimodal LLMs continue to perform poorly on simple spatial benchmarks involving distance, rotation, and viewpoint changes, suggesting that grafting spatial competence onto language models is harder than scaling them up ^[18].
Evaluation. There is no settled benchmark for spatial intelligence comparable to MMLU or GSM8K for language models. Researchers are developing new evaluation suites that cover 3D question answering, embodied navigation, and physics prediction, but the field lacks a single agreed-upon scoreboard ^[17].
Data and compute. Training large spatial models requires vast amounts of paired 3D, video, and action data, which is harder to scrape from the public web than text. The 2026 wave of multi-billion-dollar funding rounds reflects both the perceived opportunity and the heavy infrastructure cost of building these systems.

A further open question is how spatial intelligence relates to broader debates about AGI. Li, Yann LeCun, and others have argued that spatial grounding is essential and that language-only paths are insufficient ^[16]. Sceptics counter that sufficiently large multimodal models might implicitly capture spatial regularities without explicit 3D representations. Resolution of this debate will depend on how spatial models scale in the second half of the 2020s.

References

Li, F. (2024). "With Spatial Intelligence, AI Will Understand the Real World." TED2024. https://www.ted.com/talks/fei_fei_li_with_spatial_intelligence_ai_will_understand_the_real_world ↩
Li, F. (2025). "From Words to Worlds: Spatial Intelligence is AI's Next Frontier." https://drfeifei.substack.com/p/from-words-to-worlds-spatial-intelligence ↩
Li, F. (2025). "Spatial Intelligence Is AI's Next Frontier." *Time*. https://time.com/7339693/fei-fei-li-ai/ ↩
AI Business. (2024). "$1B Funding for Spatial Intelligence Startup." https://aibusiness.com/generative-ai/-1-billion-funding-for-spatial-intelligence-startup ↩
TechCrunch. (2026). "World Labs lands $1B, with $200M from Autodesk, to bring world models into 3D workflows." https://techcrunch.com/2026/02/18/world-labs-lands-200m-from-autodesk-to-bring-world-models-into-3d-workflows/ ↩
Google DeepMind. (2025). "Genie 3: A new frontier for world models." https://deepmind.google/blog/genie-3-a-new-frontier-for-world-models/ ↩
Meta AI. (2025). "V-JEPA 2: Self-supervised video world models for embodied AI." https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/ ↩
NVIDIA. (2025). "NVIDIA Launches Cosmos World Foundation Model Platform to Accelerate Physical AI Development." https://nvidianews.nvidia.com/news/nvidia-launches-cosmos-world-foundation-model-platform-to-accelerate-physical-ai-development ↩
Wayve. (2025). "GAIA-3: Scaling World Models to Power Safety and Evaluation." https://wayve.ai/thinking/gaia-3/ ↩
Niantic Spatial. (2025). "Visual Positioning System for Real-World Positioning." https://www.nianticspatial.com/products/visual-positioning-system ↩
Gardner, H. (1983). *Frames of Mind: The Theory of Multiple Intelligences*. Basic Books. ↩
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). "ImageNet: A Large-Scale Hierarchical Image Database." *CVPR*. ↩
TechCrunch. (2025). "Fei-Fei Li's World Labs speeds up the world model race with Marble, its first commercial product." https://techcrunch.com/2025/11/12/fei-fei-lis-world-labs-speeds-up-the-world-model-race-with-marble-its-first-commercial-product/ ↩
Lee, M. (2025). "Gaussian Splats Are Becoming the JPEG of 3D: Why 2025 Is the Breakout Year." Medium. https://medium.com/@qsibmini123/gaussian-splats-are-becoming-the-jpeg-of-3d-why-2025-is-the-breakout-year-ac841ed39440 ↩
World Labs. (2025). "About World Labs." https://www.worldlabs.ai/about ↩
Entropy Town. (2025). "Why Fei-Fei Li, Yann LeCun and DeepMind Are All Betting on World Models." https://entropytown.com/articles/2025-11-13-world-model-lecun-feifei-li/ ↩
arXiv. (2025). "Embodied Spatial Intelligence: From Implicit Scene Modeling to Spatial Reasoning." https://arxiv.org/abs/2509.00465 ↩
Roboflow. (2025). "Spatial Intelligence in AI: World Models, 3D Vision & Action." https://blog.roboflow.com/spatial-intelligence/ ↩
NVIDIA Research. (2025). "Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation." https://research.nvidia.com/labs/toronto-ai/lyra/ ↩
Motamed, S., et al. (2025). "Do generative video models understand physical principles?" https://arxiv.org/abs/2501.09038 ↩
SpaVLE Workshop. (2025). "Workshop on Space in Vision, Language, and Embodied AI." NeurIPS 2025. https://space-in-vision-language-embodied-ai.github.io/ ↩
Introl. (2026). "World Models Race 2026: How LeCun, DeepMind, and Others Compete." https://introl.com/blog/world-models-race-agi-2026 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Ben Mildenhall Fei-Fei Li Justin Johnson Marble (World Labs)World Labs

What is spatial intelligence?

Where does spatial intelligence come from?

Roots in computer vision

Neural radiance fields and Gaussian splatting

Embodied AI and world models

How does spatial intelligence differ from language models and world models?

What are the technical pillars of spatial intelligence?

Scene understanding

3D reconstruction

Generative world synthesis

Geometric and physical reasoning

Dynamics and prediction

Embodiment and control

Which companies are building spatial intelligence?

World Labs

Google DeepMind

Meta AI

NVIDIA

Wayve

Niantic Spatial

What is spatial intelligence used for?

Robotics and autonomous vehicles

Gaming, film, and immersive media

Augmented and virtual reality

Design, architecture, and engineering

Science and healthcare

Education

What are the limitations of spatial intelligence?

See also

References

Improve this article

Related Articles

NVIDIA Cosmos

Diffusion model

Photography

AI Image Generation

AI Video Generation

Latent diffusion model

What links here

Related Articles

NVIDIA Cosmos

Diffusion model

Photography

AI Image Generation

AI Video Generation

Latent diffusion model

What links here