GAIA-2 (Wayve)

AI Models Autonomous Vehicles Computer Vision Generative AI World Models

19 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v2 · 3,897 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

GAIA-2 (Generative AI for Autonomy 2) is a controllable, multi-camera generative world model for autonomous driving, announced by the British self-driving company Wayve on 26 March 2025.^[1]^[2] It is an 8.4 billion parameter latent video diffusion model that generates high resolution, spatiotemporally consistent video across up to five surrounding cameras, conditioned on structured inputs such as ego vehicle actions, the positions and types of other agents, road layout, weather, and time of day.^[1] GAIA-2 is used primarily as an offboard tool to synthesise diverse and safety critical driving scenarios for training and testing autonomous driving systems, especially the rare events that are difficult or impossible to collect from a real world fleet.^[1]^[2]

GAIA-2 is the successor to GAIA-1, the 9 billion parameter autoregressive world model Wayve released in 2023,^[7] and the predecessor to GAIA-3, a 15 billion parameter model announced on 2 December 2025 that is oriented towards autonomy evaluation rather than open ended scenario generation.^[5]^[6] Wayve describes GAIA-2 as combining "a latent diffusion architecture with extensive domain-specific conditioning to enable precise control over multi-camera video generation."^[2] Within Wayve's product stack, GAIA-2 complements the LINGO-2 vision language action model, which provides natural language driving commentary and instruction following, by supplying the simulated environments in which driving policies are trained and stress tested.

What is GAIA-2?

GAIA-2 is a video generative world model that lets engineers author driving scenes to order. Where a conventional driving log is whatever the fleet happened to record, GAIA-2 can be told what to generate: a given ego trajectory, specified other agents, a chosen road layout, particular weather, and a particular time of day, rendered consistently across a surround view camera rig.^[1] The model produces "high-resolution, spatiotemporally consistent multi-camera videos" across geographically diverse environments spanning the United Kingdom, the United States, and Germany.^[1] Wayve frames the system as a way to push "the boundaries of synthetic data generation with enhanced controllability, expanded geographic diversity, and broader vehicle representation" relative to GAIA-1.^[2]

The single most important property of GAIA-2 is controllability. The model exposes an explicit, structured conditioning interface rather than only a text prompt, which is what makes it usable as a programmable source of edge cases rather than merely a video generator. Wayve states that "GAIA-2 addresses this by enabling precise, controlled generation of high-risk scenarios" including near collisions, sudden cut ins, emergency braking, and other out of distribution behaviours.^[2]

Why did Wayve build GAIA-2?

Wayve was founded in Cambridge in 2017 and has pursued an end to end approach to self driving in which a single neural network maps raw camera input directly to driving actions, an approach the company refers to as AV 2.0. End to end models require very large and very diverse driving datasets to learn robust policies, and they are particularly sensitive to the long tail of rare events, such as sudden cut ins, near misses, debris in the roadway, and emergency manoeuvres. Real world fleet data is heavily biased towards uneventful highway and urban cruising, with crashes in the United States occurring roughly once per 535,000 vehicle miles according to figures cited by Wayve. Capturing enough naturally occurring safety critical events to train and validate a driving policy is therefore impractical, which is the central motivation for generative world models in autonomy.^[1]^[2]

Wayve's first generation model, GAIA-1, was unveiled in June 2023 and described in a technical report later that year.^[7]^[8] GAIA-1 was an autoregressive transformer that predicted the next discrete video token in a single forward camera stream, scaled to nine billion parameters and trained on around 4,700 hours of United Kingdom driving footage.^[8] Although GAIA-1 demonstrated that a generative world model could produce coherent driving video conditioned on text and ego actions, it had three structural limitations: it was restricted to a single camera, it produced occasional frame level discontinuities because of its autoregressive sampling, and its conditioning interface was relatively coarse. GAIA-2 was designed explicitly to remove these limitations.

How does GAIA-2 work?

GAIA-2 follows the two stage design that has become standard for high resolution video diffusion models. A video tokenizer first compresses raw camera footage into a compact continuous latent space, and a latent diffusion world model then generates new latent sequences that the tokenizer decodes back into pixel space.^[1] Both components are space time factorised transformers.

Video tokenizer

The tokenizer is an asymmetric encoder decoder with 85 million parameters in the encoder and 200 million in the decoder.^[1] The encoder downsamples each camera stream by a factor of eight in the temporal dimension and thirty two in each spatial dimension, projecting to a continuous latent with 64 channels. A 24 frame, 448 by 960 pixel video clip is therefore compressed to a latent tensor of three temporal steps by fourteen by thirty spatial positions per camera, an overall compression ratio of approximately 400 to 1. The decoder is the larger of the two and uses full spatiotemporal attention so that decoded video is temporally consistent across long horizons. To handle sequences longer than the training window, the decoder is applied with a rolling overlap rather than as a single pass.

Latent diffusion world model

The world model itself has 8.4 billion parameters and is implemented as a space time factorised transformer trained using flow matching.^[1] In Wayve's own description, "it is implemented as a space-time factorized transformer with 8.4B parameters and is trained using flow matching."^[1] It is organised into 22 transformer blocks with a hidden dimension of 4,096 and 32 attention heads, and it operates on the continuous latents produced by the tokenizer rather than on the discrete next token objective used by GAIA-1. Flow matching learns a velocity field that transports samples from a simple noise distribution to the data distribution, and in practice it produces smoother and more stable video than discrete autoregressive sampling at comparable parameter counts.

The model can generate up to five temporally and spatially consistent camera streams at a resolution of 448 by 960.^[1] For a typical surround view generation, GAIA-2 jointly models six temporal latent steps across five cameras at fourteen by thirty spatial latent positions, for a total of 12,600 latent tokens per scene. Cross camera attention enforces that an object visible to two adjacent cameras appears in geometrically consistent positions in both views, and cross frame attention enforces temporal coherence.

What inputs can GAIA-2 be conditioned on?

GAIA-2's controllability is delivered by a structured conditioning interface that mixes several different mechanisms inside the transformer:

Continuous ego vehicle actions, specifically forward speed and steering curvature, are injected through adaptive layer normalisation, the same conditioning channel used for the flow matching time step.
Discrete and structured metadata, including road attributes, weather, time of day, country, and the type of vehicle the cameras are mounted on, is supplied to cross attention layers.
External latent embeddings, including CLIP text embeddings and embeddings from a proprietary driving specific scenario encoder, are also routed through cross attention. The scenario encoder allows a real recorded clip to be summarised into a single vector that GAIA-2 can then re render in different conditions.
Per camera geometry, including intrinsics, extrinsics, and lens distortion parameters, is projected by learnable linear layers and supplied to the model so that the same scene can be rendered for different sensor rigs.
Other agents are conditioned through their three dimensional bounding boxes, which the model uses to place vehicles, cyclists, and pedestrians at specified positions with specified headings.

Classifier free guidance is supported but is not applied by default. For challenging out of distribution scenarios the report notes guidance scales between two and twenty, and describes a spatially selective variant of classifier free guidance that applies stronger guidance inside agent bounding boxes while leaving the surrounding scene loosely guided, which improves agent placement without destabilising the global composition.^[1]

What data was GAIA-2 trained on?

GAIA-2 was trained on approximately 25 million two second video sequences collected between 2019 and 2024, on the order of 58,000 hours of driving, drawn from Wayve's fleet operations in the United Kingdom, the United States, and Germany.^[1] The dataset spans three classes of vehicle platform (a sports car, an SUV, and a large van), several different camera rigs, and capture rates of 20, 25, and 30 hertz, with balanced sampling across these axes so the model does not collapse to the most common configuration.

The video tokenizer was trained for 300,000 optimisation steps with a batch size of 128 on 128 NVIDIA H100 GPUs. The world model was trained for 460,000 steps with a batch size of 256 on 256 H100 GPUs.^[1] The technical report does not state total compute in floating point operations, but the configuration is broadly comparable to other 2024 to 2025 generation video diffusion models in the eight to ten billion parameter range.

What can GAIA-2 do?

The capabilities of GAIA-2 fall into several categories that the technical report and Wayve's accompanying blog post organise as follows.

Capability	Description
Unconditional generation	Sampling fresh driving scenes from noise, with no prompt, to test that the model has internalised the distribution of real driving
Action conditioned rollout	Given an initial scene and a target trajectory expressed as speed and curvature, predict how the scene unfolds, used to ask what would happen if the ego vehicle braked, swerved, or accelerated
Scene editing	Given a real recorded clip, re render it under different weather, lighting, time of day, or country while preserving the underlying geometry and agents
Agent insertion and removal	Add or delete vehicles, cyclists, or pedestrians at specified three dimensional positions, which is the primary mechanism for generating cut in and near miss scenarios
Safety critical synthesis	Compose rare events such as sudden cut ins, emergency stops, jaywalking pedestrians, and adverse weather, on demand, in arbitrary geographies
Multi camera surround rendering	Generate temporally consistent video across up to five cameras simultaneously, supporting the surround view sensor suite of modern assisted and automated driving systems
Sensor rig transfer	Re render a scene as it would have been captured by a different vehicle's camera rig, supporting transfer of training data between platforms

The model's synthetic data is used as an offboard augmentation to Wayve's real world fleet recordings, both for training the company's end to end driving models and for systematic regression testing of those models against curated edge cases.^[2]

Conditioning inputs

The full set of inputs that GAIA-2 accepts at inference is summarised below.

Input	Type	Mechanism
Ego speed	Continuous scalar per frame	Adaptive layer norm
Ego steering curvature	Continuous scalar per frame	Adaptive layer norm
Flow matching time	Continuous scalar	Adaptive layer norm
Weather	Categorical (clear, rain, snow, fog, etc.)	Cross attention metadata
Time of day	Categorical (day, night, dusk, dawn)	Cross attention metadata
Country	Categorical (UK, US, Germany)	Cross attention metadata
Road attributes	Structured (lane count, speed limit, bus lane, cycle lane, zebra crossing, intersection, traffic light state)	Cross attention metadata
Vehicle platform	Categorical (sports car, SUV, van)	Cross attention metadata
Text prompt	Free text	CLIP embedding, cross attention
Scenario embedding	Continuous vector from proprietary encoder	Cross attention
Other agents	Three dimensional bounding boxes with class labels	Cross attention with optional spatial classifier free guidance
Camera intrinsics, extrinsics, distortion	Continuous per camera	Learnable linear projection

All inputs are optional, and any subset can be supplied at inference. A prompt may consist of nothing more than a CLIP text string, or it may pin down speed, curvature, agent positions, weather, and sensor rig to fully determine the generation.

How is GAIA-2 evaluated?

Wayve evaluates GAIA-2 with a small set of metrics chosen to reflect properties that matter for downstream driving rather than generic video quality. Visual fidelity is measured with Frechet Inception Distance, computed at the model's native 448 by 952 resolution rather than the customary 299 by 299, and with Frechet DINO Distance, which uses a DINO vision backbone in place of the Inception network and is regarded as a more reliable signal for high resolution natural imagery. Temporal consistency is measured with Frechet Video Motion Distance rather than the more common Frechet Video Distance, on the grounds that FVMD is more sensitive to the kind of motion artefacts that matter for driving. Agent fidelity is measured by projecting the input three dimensional bounding boxes into each camera and comparing the projections to segmentation masks extracted from the generated frames with a class wise intersection over union.^[1]

The technical report does not publish a head to head leaderboard against other video models on a public benchmark. Instead it presents qualitative example grids across each conditioning axis and reports the trends of its internal metrics across ablations of camera count, latent dimension, and conditioning richness.^[1]

How does GAIA-2 differ from GAIA-1 and GAIA-3?

GAIA-2 sits in the middle of three Wayve generations to date. GAIA-1 established the basic recipe of a video tokenizer feeding a transformer that predicts future visual state; GAIA-2 redesigned that transformer as a latent diffusion model with structured conditioning, added native multi camera support, and expanded the geographic coverage of the training data; and GAIA-3 doubled the parameter count again and shifted the system's centre of gravity from synthetic data generation towards autonomy evaluation.^[5]^[6]

Property	GAIA-1 (2023)	GAIA-2 (March 2025)	GAIA-3 (December 2025)
Parameters	9 billion (world model)	8.4 billion (world model)	15 billion (world model)
Generative model	Autoregressive transformer over discrete video tokens	Latent diffusion transformer trained with flow matching	Scaled latent diffusion transformer
Cameras	Single forward camera	Up to 5 surround cameras	Multi camera, scaled
Training data	~4,700 hours, UK only	~25M sequences (~58,000 hours), UK, US, Germany	Approximately 10x more data than GAIA-2
Conditioning	Text, ego action	Text, ego action, agents, weather, time of day, country, road semantics, sensor rig, scenario embedding	Conditioning inherited and extended
Primary purpose	Proof of concept generative world model	Synthetic scenario generation and edge case testing	Repeatable evaluation and validation of driving policies
Notable release	Technical report October 2023	Technical report and arXiv preprint 26 March 2025	Announced 2 December 2025

GAIA-3 is described by Wayve as "transforming world modeling from a tool for visual synthesis into a foundation for autonomy evaluation," and is built as a 15 billion parameter latent diffusion world model trained using five times more compute than GAIA-2 and on roughly ten times more data.^[5]^[6] Early studies cited at launch reported that synthetic GAIA-3 tests mirrored real world driving results closely enough to reduce synthetic test rejection rates roughly fivefold compared to earlier evaluation pipelines.^[5] GAIA-2 remains the foundation on which that progression was built.

How does GAIA-2 compare to other driving world models?

GAIA-2 was released into a rapidly crowding field of generative world models for driving. The most directly comparable systems are summarised below.

System	Developer	Release	Architecture	Cameras	Notes
GAIA-2	Wayve	March 2025	8.4B parameter latent diffusion transformer with flow matching	Up to 5 surround	Rich structured conditioning, focus on safety critical synthesis
GAIA-1	Wayve	2023	9B parameter autoregressive transformer	Single forward	First generative world model for driving
Cosmos	NVIDIA	January 2025	Family of diffusion and autoregressive world foundation models	Multi view	Released as open weights, broader physical AI scope rather than driving only
DriveDreamer / DriveDreamer-2	GigaAI and collaborators	2023 to 2024	Diffusion based driving world model	Multi view	Research line focused on controllable driving video
Vista	Shanghai AI Lab and collaborators	2024	Driving video diffusion model	Multi view	Open source academic baseline
Genie / Genie 3 derived Waymo World Model	Google DeepMind and Waymo	2025	Action conditioned world model adapted from Genie	Multi sensor	Tightly coupled to Waymo's robotaxi service
Tesla world model	Tesla	Internal	Latent world model trained on Tesla fleet video	Multi view	Not publicly documented in detail; used for end to end driver training

GAIA-2 is differentiated within this group primarily by the breadth and explicitness of its conditioning interface. Where many contemporaries condition only on a short text prompt or a single trajectory, GAIA-2 exposes structured controls over agents, road semantics, weather, time of day, country, and sensor rig, which is the property that makes it useful as a programmable source of edge cases rather than merely a video generator. Among publicly documented industrial systems it is also one of the largest pure driving specific world models, surpassed in 2025 only by Wayve's own GAIA-3 and by general purpose world foundation models such as Cosmos that are not exclusively oriented to autonomy.

What is GAIA-2 used for?

The practical role of GAIA-2 inside Wayve is to expand and stress the training and validation distributions of the company's end to end driving stack. Specific uses described by Wayve and observed in the technical report include:

Synthetic training data. Generations are mixed with real fleet recordings during training of Wayve's driving policy, broadening the data distribution beyond what the fleet has naturally collected.
Domain transfer. Real recordings from one country can be re rendered under the visual conventions of another, reducing the amount of fresh real data needed to operate in a new geography.
Weather and lighting augmentation. Clear weather clips can be re rendered in rain, snow, fog, dusk, or night, which is much faster than waiting for the fleet to capture those conditions.
Cut in and near miss synthesis. Adversarial agent trajectories, such as a vehicle aggressively merging from an adjacent lane or a pedestrian stepping out from behind a parked van, can be inserted into otherwise normal clips so the driving model is repeatedly tested against the scenarios that matter most for safety.
Counterfactual evaluation. Given a recorded clip, GAIA-2 can roll the scene forward under alternative ego actions, supporting questions such as what would have happened if the ego vehicle had not braked.
Sensor rig transfer. Data captured on one vehicle can be re rendered as if captured by another, reducing the cost of supporting new platforms.

Wayve presents these capabilities as a complement to, rather than a replacement for, real world fleet data.

How was GAIA-2 received?

GAIA-2 was widely covered in the autonomous driving and AI press in the days following its 26 March 2025 announcement.^[10]^[11] Coverage focused on three themes: the scale and quality of the surround video generations relative to GAIA-1, the practical value of authoring safety critical scenarios on demand, and the broader trend of large generative world models replacing classical rule based simulators in the autonomous vehicle industry. Independent commentators noted that the explicit, structured conditioning interface was a clearer step forward for industrial usability than raw image quality, and contrasted GAIA-2 favourably with research models whose only handle was a text prompt.^[12] The model was frequently discussed alongside NVIDIA's Cosmos world foundation model family, released earlier the same year, as evidence that 2025 had become the year in which large generative world models moved from research curiosities to core infrastructure for autonomy and embodied AI.^[13]

What are GAIA-2's limitations?

The GAIA-2 technical report is explicit about several limitations of the system as released.^[1]

Long horizon and complex generations occasionally exhibit temporal or semantic inconsistencies, such as objects flickering between cameras or unrealistic vehicle behaviour over extended rollouts.
Generation is computationally intensive even with the parallelism afforded by latent diffusion; sampling a single surround view scene requires many denoising steps and substantial GPU memory, which makes very large scale offline synthesis expensive.
The diversity of agent behaviour available to the model is limited by what is present in Wayve's training data and by the relative scarcity of recorded safety critical events, which the model can compose but does not invent from first principles.
The model has no native physical or dynamics simulator and therefore does not guarantee physically plausible vehicle trajectories or contact interactions, which limits its use for closed loop physics fidelity tests.
Evaluation is performed with internal metrics rather than against a community benchmark, which makes head to head comparison with other world models difficult.

GAIA-3 was positioned partly as a response to these limitations, with a larger model, more data, and an explicit focus on producing repeatable evaluation signals.^[5]^[6]

ELI5: what is GAIA-2?

Imagine a video game that can build any driving situation you describe. You tell it the weather, the time of day, the country, where the cars and people are, and what your own car does, and it draws a realistic video from several cameras at once, as if filmed from a real car. Wayve, a British self driving company, built GAIA-2 so its driving software can practise on tricky and dangerous moments (like someone suddenly cutting in front of you) thousands of times safely on a computer, instead of waiting for those rare moments to happen on a real road.

References

Russell, L., Hu, A., Bertoni, L., Fedoseev, G., Shotton, J., Arani, E., and Corrado, G. "GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving." arXiv:2503.20523, 26 March 2025. https://arxiv.org/abs/2503.20523 ↩
Wayve. "GAIA-2: Pushing the Boundaries of Video Generative Models for Safer Assisted and Automated Driving." Wayve Thinking, 26 March 2025. https://wayve.ai/thinking/gaia-2/ ↩
Wayve. "Wayve Unveils GAIA-2: Cutting-Edge Scalable Video Generation for Assisted and Automated Driving." Wayve press release, 26 March 2025. https://wayve.ai/press/wayve-unveils-gaia2/
Wayve. "GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving." Technical report PDF. https://wayve.ai/wp-content/uploads/2025/03/GAIA_2_Technical_Report.pdf
Wayve. "Wayve launches GAIA-3, advancing world models from simulation to evaluation." Wayve press release, 2 December 2025. https://wayve.ai/press/wayve-launches-gaia3/ ↩
Wayve. "GAIA-3: Scaling World Models to Power Safety and Evaluation." Wayve Thinking, 2 December 2025. https://wayve.ai/thinking/gaia-3/ ↩
Hu, A. et al. "GAIA-1: A Generative World Model for Autonomous Driving." arXiv:2309.17080, September 2023. https://arxiv.org/abs/2309.17080 ↩
Wayve. "Scaling GAIA-1: 9-billion parameter generative world model for autonomous driving." Wayve Thinking, October 2023. https://wayve.ai/thinking/scaling-gaia-1/ ↩
Wayve. "Wayve Releases GAIA-1 Technical Report to Advance World Models for Autonomy." Press release, October 2023. https://wayve.ai/press/wayve-releases-gaia-1-technical-report/
IoT World Today. "AI Company Releases New Video-Generative Model for Self-Driving." March 2025. https://www.iotworldtoday.com/transportation-logistics/ai-company-releases-new-video-generative-model-for-self-driving ↩
Self Drive News. "Wayve Unveils GAIA-2: Synthetic Data for AI Driving." March 2025. https://selfdrivenews.com/wayve-unveils-gaia-2-synthetic-data-for-ai-driving/ ↩
Matt3r. "Generative Models in Autonomous Driving: GAIA-1 to GAIA-2 and the Realism Gap." 2025. https://matt3r.ai/blogs/our-latest-thoughts/gaia-2-synthetic-data-autonomous-driving ↩
Natix Network. "State of the Art: A Review of the World Foundational Model Landscape." 2025. https://www.natix.network/blog/review-of-the-world-foundational-model-landscape ↩
Wayve. "Wayve GAIA: Generative AI for video generation and simulation." Wayve Science. https://wayve.ai/science/gaia/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Cruise (self-driving)GAIA-3 (Wayve)LINGO-2 (Wayve)Wayve

What is GAIA-2?

Why did Wayve build GAIA-2?

How does GAIA-2 work?

Video tokenizer

Latent diffusion world model

What inputs can GAIA-2 be conditioned on?

What data was GAIA-2 trained on?

What can GAIA-2 do?

Conditioning inputs

How is GAIA-2 evaluated?

How does GAIA-2 differ from GAIA-1 and GAIA-3?

How does GAIA-2 compare to other driving world models?

What is GAIA-2 used for?

How was GAIA-2 received?

What are GAIA-2's limitations?

ELI5: what is GAIA-2?

See also

References

Improve this article

Related Articles

GAIA-3 (Wayve)

World Labs

Marble (World Labs)

V-JEPA 2

NVIDIA Cosmos

Genie 3

What links here

Related Articles

GAIA-3 (Wayve)

World Labs

Marble (World Labs)

V-JEPA 2

NVIDIA Cosmos

Genie 3

What links here