Spatial intelligence
Last reviewed
May 16, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 ยท 4,000 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
22 citations
Review status
Source-backed
Revision
v1 ยท 4,000 words
Add missing citations, update stale details, or suggest a clearer explanation.
Spatial intelligence is a term used in artificial intelligence research to describe the ability of machines to perceive, understand, reason about, generate, and interact with three-dimensional space. Where large language models operate on sequences of tokens that describe the world in words, spatially intelligent systems are expected to operate on representations of the world itself, including its geometry, physics, dynamics, and the way agents move and act within it. The phrase has been popularized since 2024 by computer scientist Fei-Fei Li, her startup World Labs, and a growing community of researchers who argue that genuine progress toward general-purpose AI requires models that understand worlds, not just words [1][2].
Li has framed spatial intelligence as the natural successor to the language-centric era of AI. In her 2024 TED talk and her 2025 essay "From Words to Worlds," she argues that perception and action, not language, were the original drivers of biological intelligence, and that machines will only become truly useful in the physical world when they can build, manipulate, and predict spatial structure as fluently as today's large language models manipulate text [1][3]. The concept overlaps with, but is distinct from, the closely related notion of a world model. A world model is a learned representation of environmental dynamics used for prediction and planning. Spatial intelligence is broader: it names the overall cognitive capability of which world models, 3D reconstruction, geometric reasoning, and embodied control are component parts.
By 2026, spatial intelligence had become one of the most active and well-funded frontiers in AI. World Labs raised approximately $230 million in 2024 and a further $1 billion in early 2026 to build foundation models for 3D environments [4][5]. Google DeepMind, Meta, NVIDIA, and Wayve all released world model systems with explicit spatial reasoning ambitions [6][7][8][9], while Niantic Spatial pushed visual positioning into mainstream augmented reality [10]. The shared bet is that the next leap in AI capability will come from grounding intelligence in three-dimensional space rather than two-dimensional text.
Spatial intelligence has both an old meaning in cognitive science and a newer technical meaning in AI. The older usage traces to developmental psychologist Howard Gardner, who in his 1983 book Frames of Mind listed spatial intelligence as one of his eight types of human intelligence, defined as the capacity to think in images, perceive the visual world accurately, and mentally manipulate three-dimensional structures [11]. Architects, sculptors, surgeons, athletes, and chess players have long been described as spatially intelligent in this sense.
The AI usage adopted by Li and World Labs preserves the spirit of Gardner's definition but extends it to machines. In Li's framing, spatial intelligence in AI involves four interlinked capabilities [1][2][3]:
This is a deliberately broader formulation than "3D vision" or "world model." It treats geometry, physics, embodiment, and generation as facets of a single cognitive capability that current AI systems lack and that Li argues is required for AI to be useful outside the chat window [1].
The technical foundations of spatial intelligence sit in classical and modern computer vision. Decades of research in structure-from-motion, multi-view stereo, simultaneous localization and mapping (SLAM), and photogrammetry created the mathematical machinery to recover 3D structure from 2D images. Li herself helped define an earlier era of the field by leading the creation of ImageNet in the late 2000s, a dataset that catalyzed the deep learning revolution in image recognition [12]. For much of the 2010s, vision research focused on recognition. Networks like AlexNet, ResNet, and later vision transformers learned to label images with increasing accuracy but did not generally produce explicit 3D scene representations. The shift toward spatially grounded vision accelerated in the late 2010s as researchers combined deep learning with classical geometry.
Two technical breakthroughs reshaped 3D representation in AI during the 2020s. The first was neural radiance fields, introduced by Ben Mildenhall and colleagues in 2020. A NeRF represents a 3D scene as a continuous function from spatial coordinates and viewing directions to color and density, parameterized by a small neural network. Given a handful of input photos, a NeRF can be optimized to render novel views with photorealistic quality. NeRFs demonstrated that neural networks could capture rich 3D structure without explicit meshes or point clouds, and they triggered an explosion of follow-up work on faster training, larger scenes, dynamic content, and editing.
The second breakthrough was 3D Gaussian splatting, introduced in 2023 by researchers at Inria and the Max Planck Institute. Rather than encoding a scene implicitly in a neural network, Gaussian splatting represents a scene as millions of small anisotropic 3D Gaussians with position, color, opacity, and orientation, rendered through highly efficient rasterization. Gaussian splatting matched or exceeded NeRF quality while enabling true real-time rendering on commodity GPUs, and it quickly became the dominant 3D representation for industrial applications, including World Labs's Marble product [13][14].
Mildenhall, the lead author of the original NeRF paper, is a co-founder of World Labs, alongside Christoph Lassner, another long-time graphics and vision researcher. Their joint presence reflects how directly the spatial intelligence agenda builds on the NeRF and Gaussian splatting lineage [4][15].
The other major foundation is the field of embodied AI, which studies agents that learn by acting in physical or simulated environments. It established the importance of egocentric perception, manipulation, navigation, and the integration of vision with control, and revived interest in classical world models. In 2018, David Ha and Jurgen Schmidhuber popularized the modern term "world model" with a paper that trained a generative model of game environments and demonstrated agents that could plan inside their own dreams. By the early 2020s, world models had become central to model-based reinforcement learning and large-scale video prediction. The current wave of spatial intelligence work sits at the confluence of these two threads: NeRF-style and Gaussian-splatting-style 3D representations on the perception side, and world models on the dynamics side.
A recurring theme in Li's writing is that language alone is an insufficient foundation for general intelligence. In her 2025 essay "From Words to Worlds," she describes current LLMs as "wordsmiths in the dark, eloquent but inexperienced, knowledgeable but ungrounded," able to manipulate symbols about the world but unable to estimate distances, mentally rotate objects, predict physical interactions, or maintain spatial memory in any reliable way [2]. Spatial intelligence is positioned as the complement that LLMs lack.
The relationship between spatial intelligence and world models is more subtle. World models are one technical implementation of part of spatial intelligence: specifically, the ability to predict future states given actions. Spatial intelligence as a concept also includes perception, geometric reasoning, generation of static scenes, and the broader integration of all of these into agents. Many world-model systems in 2025-2026 are spatially intelligent by design, but some video-generation systems labeled as world models do not actually reason about 3D geometry and would not satisfy a strict definition of spatial intelligence.
The table below contrasts the three concepts as they are typically used.
| Concept | Primary substrate | Core capability | Key examples |
|---|---|---|---|
| Large language model | Text tokens | Predict next token; manipulate symbols and concepts in language | GPT-class models, Claude, Gemini |
| World model | Learned environment dynamics | Predict future states given actions; support planning | Ha and Schmidhuber 2018, Dreamer, GAIA, Genie, Cosmos |
| Spatial intelligence | 3D geometry, physics, embodiment | Perceive, generate, reason about, and act in 3D worlds | World Labs Marble, Genie 3, Gaussian splatting pipelines, embodied robot policies |
Li has been careful to note that spatial intelligence is not in opposition to language models. The intended endpoint is multimodal systems that combine the symbolic fluency of LLMs with the spatial and physical grounding of world models and 3D representations, an architecture sometimes informally called "large world models" or "large multimodal models" [2][16].
Research in spatial intelligence draws on several distinct but converging technical pillars. The table below summarizes the main pillars and representative research directions associated with each.
| Pillar | What it provides | Representative methods and systems |
|---|---|---|
| Scene understanding | Semantic parsing of objects, surfaces, and relations in 2D and 3D | Open-vocabulary segmentation, 3D scene graphs, vision-language models with spatial grounding |
| 3D reconstruction | Geometric models of real or imagined scenes from images or video | Structure-from-motion, multi-view stereo, NeRFs, Gaussian splatting |
| Generative world synthesis | Creation of new 3D scenes from text, images, or video prompts | World Labs Marble, NVIDIA Lyra, video diffusion models distilled into 3D |
| Geometric and physical reasoning | Predictions that respect spatial layout, occlusion, contact, and physical law | Differentiable physics, neural simulators, physics-aware video models |
| Dynamics and prediction | Forecasts of how scenes evolve under time and action | World models, latent dynamics models, video prediction |
| Embodiment and control | Linking perception to action for robots and agents | Vision-language-action models, PaLM-E, policy learning in simulation, sim-to-real transfer |
A recurring engineering insight is that no single representation suffices. Industrial systems typically combine 2D foundation models, explicit 3D structure, and learned dynamics. A 2025 survey on embodied spatial intelligence argued that the most promising path is to integrate high-quality 3D structure with large-scale 2D foundation models rather than treat them as competing options [17].
Spatially intelligent systems need to know not just what objects exist in an image but where they are, how they relate, and what affordances they offer. Research in open-vocabulary 3D segmentation, 3D scene graphs, and vision-language models with spatial grounding aims to fill this gap. Benchmarks measuring distance estimation, object rotation, and spatial memory have repeatedly shown that even state-of-the-art multimodal models struggle with tasks that are trivial for humans [18].
Reconstruction methods convert images and video into explicit or implicit 3D representations. Classical structure-from-motion and multi-view stereo remain workhorses for tasks like mapping. Neural approaches such as NeRFs and Gaussian splatting deliver higher fidelity for novel-view synthesis and have become standard outputs of generative pipelines. By 2025, Gaussian splatting had been adopted as a candidate addition to the glTF standard, signaling its emergence as the "JPEG of 3D" [14].
The defining new capability of the 2024-2026 era is generative spatial synthesis: producing entire 3D worlds from text, image, or video prompts. World Labs's Marble, launched in late 2025, generates persistent, downloadable 3D environments using Gaussian splatting representations [13][14]. NVIDIA's research lab introduced Lyra, a system that distills video-diffusion knowledge into Gaussian splat scenes at inference time [19]. Earlier work like DreamFusion and Magic3D had shown that diffusion priors could be lifted into 3D, but the new wave aimed at producing entire navigable worlds rather than single objects.
A spatially intelligent system needs to understand not only what a scene looks like but how it behaves. This requires priors about occlusion, contact, friction, gravity, and other physical laws. Researchers have explored physics-aware video models, differentiable simulators, and hybrid approaches that combine learned dynamics with explicit physical equations. Whether large video generation models like Sora possess such priors remains contested; a 2025 analysis argued that Sora and V-JEPA had not yet learned a complete physical world model [20].
Dynamics is the home territory of world models. Methods like Dreamer in reinforcement learning, GAIA in autonomous driving, Genie in interactive environments, and Cosmos for robotics all learn to predict future frames or future latent states conditioned on actions [6][7][8]. These models can be evaluated on rollout fidelity, controllability, and downstream usefulness for planning or policy learning.
Finally, spatial intelligence is connected to embodiment. Vision-language-action models, such as Google's RT-2 and PaLM-E, fuse perception, language, and motor control in a single network. World models trained on robotics data, such as NVIDIA Cosmos, are designed to provide a shared simulation substrate for many embodied agents [8]. Wayve's GAIA series applies the same logic to autonomous driving, using generated worlds to expose vehicles to rare and dangerous scenarios at scale [9].
The table below summarizes major organizations working in spatial intelligence and their representative products or research programs.
| Organization | Focus | Flagship system or program |
|---|---|---|
| World Labs | Foundation models for 3D worlds | Marble (text/image/video to 3D scenes with Gaussian splatting) [13] |
| Google DeepMind | Interactive generative world models | Genie 3 (real-time playable 3D environments from prompts) [6] |
| Meta AI | Self-supervised video world models | V-JEPA 2 (video joint-embedding predictive architecture) [7] |
| NVIDIA | Physics-aware world foundation models | NVIDIA Cosmos platform (Predict, Transfer, Reason) [8] |
| Wayve | Driving-specific world models | GAIA-1, GAIA-2, GAIA-3 [9] |
| Niantic Spatial | Geospatial visual positioning | Visual Positioning System and Large Geospatial Model [10] |
| Stanford and academic labs | Foundational research | Embodied spatial intelligence surveys, SpaVLE workshop series [17][21] |
World Labs is the company most strongly identified with the term spatial intelligence. It was co-founded in 2023 by Fei-Fei Li, Justin Johnson, Ben Mildenhall, and Christoph Lassner, and emerged from stealth in 2024 [4][15]. The team combines deep learning research, computer vision, and graphics, and it has explicitly positioned itself as building "frontier models that can perceive, generate, reason and interact with the 3D world" [15].
In September 2024, the company announced approximately $230 million in funding at a valuation of about $1 billion, with investors including Andreessen Horowitz, NVIDIA's venture arm, and Radical Ventures [4]. In early 2026, it raised a further $1 billion in fresh funding from a group that included NVIDIA, AMD, Autodesk, Andreessen Horowitz, and Fidelity, taking its total funding past $1.2 billion and confirming spatial intelligence as a heavily capitalized category [5].
World Labs's first commercial product, Marble, launched in November 2025. Marble accepts text, image, or video prompts and produces persistent, downloadable 3D worlds represented as Gaussian splat scenes, with an in-app hybrid editor that lets users block out spatial structures before the model fills in visual detail [13]. Marble is positioned for applications in gaming, virtual production, design, and immersive media.
Google DeepMind has developed an influential series of generative world models under the Genie name. Genie 1 in early 2024 demonstrated playable 2D environments generated from images. Genie 2 in late 2024 extended this to richer 3D-feeling worlds for short rollouts. Genie 3, announced in August 2025, generates interactive 3D environments from text prompts that run in real time at 24 frames per second and 720p resolution for several minutes while maintaining spatial consistency [6]. In January 2026, Google launched Project Genie as a consumer-facing experiment on Google Labs.
Meta's contribution to spatial intelligence has centered on self-supervised video models built around the Joint Embedding Predictive Architecture (JEPA) advocated by Yann LeCun. V-JEPA 2, released in 2025, was pre-trained on more than one million hours of internet video and fine-tuned on under 62 hours of robot trajectories to support embodied tasks [7]. The JEPA family is designed to learn predictive world representations in a self-supervised, non-generative way, an alternative bet on how spatial intelligence might be built.
NVIDIA has positioned itself as the infrastructure layer for spatial AI through its Cosmos platform, introduced in January 2025. Cosmos is a family of open world foundation models trained on tens of millions of hours of real-world video data and organized into three lines: Predict for future-state simulation, Transfer for sim-to-real bridging, and Reason for physics-aware reasoning [8]. NVIDIA has reported more than two million downloads across the platform, and Cosmos has been adopted as a substrate for robotics and autonomous vehicle development.
Wayve, a London-based autonomous driving company, has built a production-oriented spatial intelligence stack through its GAIA series of world models. GAIA-1 introduced a 9-billion-parameter generative world model for driving in 2023. GAIA-2 improved controllability and geographic diversity. GAIA-3, announced in 2025, is a 15-billion-parameter latent diffusion world model trained on roughly ten times more data than GAIA-2, designed for safety evaluation of autonomous driving policies at scale [9].
The game and AR company Niantic spun out Niantic Spatial in 2025 to focus on visual positioning and geospatial AI. Its Visual Positioning System anchors content to physical locations with centimeter-level accuracy by matching live camera imagery to a Large Geospatial Model built from billions of crowdsourced images [10]. Niantic represents the geospatial wing of spatial intelligence, where the world to be modeled is the actual planet rather than a generated environment.
Spatial intelligence is being applied across several industries.
The most direct application is to embodied systems. Robots and self-driving cars need to understand 3D space, predict how it will evolve, and act within it. World models like NVIDIA Cosmos and Wayve GAIA provide simulated environments in which policies can be trained safely and at scale, including rare and dangerous events that are hard to encounter in the real world [8][9]. Vision-language-action models like PaLM-E and RT-2 integrate spatial perception with language understanding so that robots can follow natural-language instructions in the physical world.
Generative world synthesis is reshaping content creation. Marble, Genie 3, and similar systems let designers prototype environments from prompts in minutes rather than weeks, and the resulting Gaussian splat or volumetric outputs can be imported into existing engines. Visual effects pipelines have begun adopting NeRF and Gaussian splatting for set extension, virtual production, and post-production. Game studios are exploring world models as procedural content generators and as tools for non-player character behavior [13][22].
Spatial intelligence is a precondition for compelling AR and VR. Persistent, shared world maps such as Niantic's Large Geospatial Model anchor AR objects to physical locations across users and sessions [10]. On the device side, vision-based positioning and scene understanding are required for occlusion, lighting, and plausible placement of virtual content. World Labs has named immersive experiences as a target application for Marble [15].
Generative 3D tools are starting to support architectural visualization, product design, and engineering simulation. The 2026 World Labs funding round included Autodesk as a strategic investor, signaling integration plans with computer-aided design workflows [5]. Combined with physical reasoning, spatial intelligence systems could support iterative design where geometry, structural behavior, and aesthetics are co-optimized.
Li has consistently linked spatial intelligence to scientific discovery and medicine. Her own Stanford research has applied AI-driven sensing to clinical environments to monitor patient safety and reduce staff burnout [3]. More broadly, spatial AI is being applied to molecular simulation, microscopy, and 3D medical imaging, where the fundamental data is inherently spatial rather than textual [2].
Spatial intelligence is being explored as a foundation for new educational tools, including interactive simulations that let learners explore complex concepts in physics, biology, and engineering through immersive 3D experiences. Li frames this as restoring the bodily and spatial dimensions of learning that text-based AI tutors miss [2].
Despite rapid progress, spatial intelligence remains far behind human capability in several respects.
A further open question is how spatial intelligence relates to broader debates about AGI. Li, Yann LeCun, and others have argued that spatial grounding is essential and that language-only paths are insufficient [16]. Sceptics counter that sufficiently large multimodal models might implicitly capture spatial regularities without explicit 3D representations. Resolution of this debate will depend on how spatial models scale in the second half of the 2020s.