GAIA-2 (Wayve)
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,484 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
14 citations
Review status
Source-backed
Revision
v1 · 3,484 words
Add missing citations, update stale details, or suggest a clearer explanation.
GAIA-2 (Generative AI for Autonomy 2) is a controllable multi view generative world model for autonomous driving developed by the British artificial intelligence company Wayve. It was announced on 26 March 2025 alongside a technical report on arXiv, and is the successor to GAIA-1, the 9 billion parameter autoregressive world model that Wayve released in 2023. GAIA-2 is a latent video diffusion model with 8.4 billion parameters that produces high resolution, spatiotemporally consistent video across up to five surrounding cameras, conditioned on a rich set of structured inputs including ego vehicle dynamics, the positions and types of other agents, road semantics, weather, time of day, geographical region, and camera geometry. It is designed primarily as an offboard tool for synthesising training and validation data for Wayve's end to end driving software, and in particular for generating rare and safety critical scenarios that are difficult or impossible to collect during real world fleet operation.
GAIA-2 was succeeded in December 2025 by GAIA-3, a 15 billion parameter model trained on roughly ten times more data and oriented towards autonomy evaluation rather than open ended scenario generation. Within Wayve's product stack, GAIA-2 complements the LINGO-2 vision language action model, which provides natural language driving commentary and instruction following, by supplying the simulated environments in which driving policies are trained and stress tested.
Wayve was founded in Cambridge in 2017 and has pursued an end to end approach to self driving in which a single neural network maps raw camera input directly to driving actions, an approach the company refers to as AV 2.0. End to end models require very large and very diverse driving datasets to learn robust policies, and they are particularly sensitive to the long tail of rare events, such as sudden cut ins, near misses, debris in the roadway, and emergency manoeuvres. Real world fleet data is heavily biased towards uneventful highway and urban cruising, with crashes in the United States occurring roughly once per 535,000 vehicle miles according to figures cited by Wayve. Capturing enough naturally occurring safety critical events to train and validate a driving policy is therefore impractical, which is the central motivation for generative world models in autonomy.
Wayve's first generation model, GAIA-1, was unveiled in June 2023 and described in a technical report later that year. GAIA-1 was an autoregressive transformer that predicted the next discrete video token in a single forward camera stream, scaled to nine billion parameters and trained on around 4,700 hours of United Kingdom driving footage. Although GAIA-1 demonstrated that a generative world model could produce coherent driving video conditioned on text and ego actions, it had three structural limitations: it was restricted to a single camera, it produced occasional frame level discontinuities because of its autoregressive sampling, and its conditioning interface was relatively coarse. GAIA-2 was designed explicitly to remove these limitations.
GAIA-2 follows the two stage design that has become standard for high resolution video diffusion models. A video tokenizer first compresses raw camera footage into a compact continuous latent space, and a latent diffusion world model then generates new latent sequences that the tokenizer decodes back into pixel space. Both components are space time factorised transformers.
The tokenizer is an asymmetric encoder decoder. The encoder has roughly 85 million parameters and downsamples each camera stream by a factor of eight in the temporal dimension and thirty two in each spatial dimension, projecting to a continuous latent with 64 channels. A 24 frame, 448 by 960 pixel video clip is therefore compressed to a latent tensor of three temporal steps by fourteen by thirty spatial positions per camera, an overall compression ratio of approximately 400 to 1. The decoder is larger at around 200 million parameters and uses full spatiotemporal attention so that decoded video is temporally consistent across long horizons. To handle sequences longer than the training window, the decoder is applied with a rolling overlap rather than as a single pass.
The world model itself has 8.4 billion parameters, organised into 22 transformer blocks with a hidden dimension of 4,096 and 32 attention heads. It operates on the continuous latents produced by the tokenizer and is trained with flow matching rather than the discrete next token objective used by GAIA-1. Flow matching learns a velocity field that transports samples from a simple noise distribution to the data distribution, and in practice it produces smoother and more stable video than denoising diffusion at comparable parameter counts.
For a typical surround view generation, GAIA-2 jointly models six temporal latent steps across five cameras at fourteen by thirty spatial latent positions, for a total of 12,600 latent tokens per scene. Cross camera attention enforces that an object visible to two adjacent cameras appears in geometrically consistent positions in both views, and cross frame attention enforces temporal coherence.
GAIA-2's controllability is delivered by a structured conditioning interface that mixes several different mechanisms inside the transformer:
Classifier free guidance is supported but is not applied by default. For challenging out of distribution scenarios the report notes guidance scales between two and twenty, and describes a spatially selective variant of classifier free guidance that applies stronger guidance inside agent bounding boxes while leaving the surrounding scene loosely guided, which improves agent placement without destabilising the global composition.
GAIA-2 was trained on roughly 25 million two second video sequences, totalling on the order of 58,000 hours of driving, drawn from Wayve's fleet operations in the United Kingdom, the United States, and Germany. The dataset spans three classes of vehicle platform (a sports car, an SUV, and a large van), several different camera rigs, and capture rates of 20, 25, and 30 hertz, with balanced sampling across these axes so the model does not collapse to the most common configuration.
The video tokenizer was trained for 300,000 optimisation steps with a batch size of 128 on 128 NVIDIA H100 GPUs. The world model was trained for 460,000 steps with a batch size of 256 on 256 H100 GPUs. The technical report does not state total compute in floating point operations, but the configuration is broadly comparable to other 2024 to 2025 generation video diffusion models in the eight to ten billion parameter range.
The capabilities of GAIA-2 fall into several categories that the technical report and Wayve's accompanying blog post organise as follows.
| Capability | Description |
|---|---|
| Unconditional generation | Sampling fresh driving scenes from noise, with no prompt, to test that the model has internalised the distribution of real driving |
| Action conditioned rollout | Given an initial scene and a target trajectory expressed as speed and curvature, predict how the scene unfolds, used to ask what would happen if the ego vehicle braked, swerved, or accelerated |
| Scene editing | Given a real recorded clip, re render it under different weather, lighting, time of day, or country while preserving the underlying geometry and agents |
| Agent insertion and removal | Add or delete vehicles, cyclists, or pedestrians at specified three dimensional positions, which is the primary mechanism for generating cut in and near miss scenarios |
| Safety critical synthesis | Compose rare events such as sudden cut ins, emergency stops, jaywalking pedestrians, and adverse weather, on demand, in arbitrary geographies |
| Multi camera surround rendering | Generate temporally consistent video across up to five cameras simultaneously, supporting the surround view sensor suite of modern assisted and automated driving systems |
| Sensor rig transfer | Re render a scene as it would have been captured by a different vehicle's camera rig, supporting transfer of training data between platforms |
The model's synthetic data is used as an offboard augmentation to Wayve's real world fleet recordings, both for training the company's end to end driving models and for systematic regression testing of those models against curated edge cases.
The full set of inputs that GAIA-2 accepts at inference is summarised below.
| Input | Type | Mechanism |
|---|---|---|
| Ego speed | Continuous scalar per frame | Adaptive layer norm |
| Ego steering curvature | Continuous scalar per frame | Adaptive layer norm |
| Flow matching time | Continuous scalar | Adaptive layer norm |
| Weather | Categorical (clear, rain, snow, fog, etc.) | Cross attention metadata |
| Time of day | Categorical (day, night, dusk, dawn) | Cross attention metadata |
| Country | Categorical (UK, US, Germany) | Cross attention metadata |
| Road attributes | Structured (lane count, speed limit, bus lane, cycle lane, zebra crossing, intersection, traffic light state) | Cross attention metadata |
| Vehicle platform | Categorical (sports car, SUV, van) | Cross attention metadata |
| Text prompt | Free text | CLIP embedding, cross attention |
| Scenario embedding | Continuous vector from proprietary encoder | Cross attention |
| Other agents | Three dimensional bounding boxes with class labels | Cross attention with optional spatial classifier free guidance |
| Camera intrinsics, extrinsics, distortion | Continuous per camera | Learnable linear projection |
All inputs are optional, and any subset can be supplied at inference. A prompt may consist of nothing more than a CLIP text string, or it may pin down speed, curvature, agent positions, weather, and sensor rig to fully determine the generation.
Wayve evaluates GAIA-2 with a small set of metrics chosen to reflect properties that matter for downstream driving rather than generic video quality. Visual fidelity is measured with Frechet Inception Distance, computed at the model's native 448 by 952 resolution rather than the customary 299 by 299, and with Frechet DINO Distance, which uses a DINO vision backbone in place of the Inception network and is regarded as a more reliable signal for high resolution natural imagery. Temporal consistency is measured with Frechet Video Motion Distance rather than the more common Frechet Video Distance, on the grounds that FVMD is more sensitive to the kind of motion artefacts that matter for driving. Agent fidelity is measured by projecting the input three dimensional bounding boxes into each camera and comparing the projections to segmentation masks extracted from the generated frames with a class wise intersection over union.
The technical report does not publish a head to head leaderboard against other video models on a public benchmark. Instead it presents qualitative example grids across each conditioning axis and reports the trends of its internal metrics across ablations of camera count, latent dimension, and conditioning richness.
GAIA-2 sits in the middle of three Wayve generations to date. GAIA-1 established the basic recipe of a video tokenizer feeding a transformer that predicts future visual state; GAIA-2 redesigned that transformer as a latent diffusion model with structured conditioning, added native multi camera support, and expanded the geographic coverage of the training data; and GAIA-3 doubled the parameter count again and shifted the system's centre of gravity from synthetic data generation towards autonomy evaluation.
| Property | GAIA-1 (2023) | GAIA-2 (March 2025) | GAIA-3 (December 2025) |
|---|---|---|---|
| Parameters | 9 billion (world model) | 8.4 billion (world model) | ~15 billion (world model) |
| Generative model | Autoregressive transformer over discrete video tokens | Latent diffusion transformer trained with flow matching | Scaled latent diffusion transformer |
| Cameras | Single forward camera | Up to 5 surround cameras | Multi camera, scaled |
| Training data | ~4,700 hours, UK only | ~58,000 hours, UK, US, Germany | Approximately 10x more data than GAIA-2 |
| Conditioning | Text, ego action | Text, ego action, agents, weather, time of day, country, road semantics, sensor rig, scenario embedding | Conditioning inherited and extended |
| Primary purpose | Proof of concept generative world model | Synthetic scenario generation and edge case testing | Repeatable evaluation and validation of driving policies |
| Notable release | Technical report October 2023 | Technical report and arXiv preprint 26 March 2025 | Announced 2 December 2025 |
GAIA-3 is described by Wayve as advancing world models from simulation to evaluation, with early studies cited at launch reporting that synthetic GAIA-3 tests reflected real world driving results closely enough to reduce false rejections by a factor of five compared to earlier evaluation pipelines. GAIA-2 remains the foundation on which that progression was built.
GAIA-2 was released into a rapidly crowding field of generative world models for driving. The most directly comparable systems are summarised below.
| System | Developer | Release | Architecture | Cameras | Notes |
|---|---|---|---|---|---|
| GAIA-2 | Wayve | March 2025 | 8.4B parameter latent diffusion transformer with flow matching | Up to 5 surround | Rich structured conditioning, focus on safety critical synthesis |
| GAIA-1 | Wayve | 2023 | 9B parameter autoregressive transformer | Single forward | First generative world model for driving |
| Cosmos | NVIDIA | January 2025 | Family of diffusion and autoregressive world foundation models | Multi view | Released as open weights, broader physical AI scope rather than driving only |
| DriveDreamer / DriveDreamer-2 | GigaAI and collaborators | 2023 to 2024 | Diffusion based driving world model | Multi view | Research line focused on controllable driving video |
| Vista | Shanghai AI Lab and collaborators | 2024 | Driving video diffusion model | Multi view | Open source academic baseline |
| Genie / Genie 3 derived Waymo World Model | Google DeepMind and Waymo | 2025 | Action conditioned world model adapted from Genie | Multi sensor | Tightly coupled to Waymo's robotaxi service |
| Tesla world model | Tesla | Internal | Latent world model trained on Tesla fleet video | Multi view | Not publicly documented in detail; used for end to end driver training |
GAIA-2 is differentiated within this group primarily by the breadth and explicitness of its conditioning interface. Where many contemporaries condition only on a short text prompt or a single trajectory, GAIA-2 exposes structured controls over agents, road semantics, weather, time of day, country, and sensor rig, which is the property that makes it useful as a programmable source of edge cases rather than merely a video generator. Among publicly documented industrial systems it is also one of the largest pure driving specific world models, surpassed in 2025 only by Wayve's own GAIA-3 and by general purpose world foundation models such as Cosmos that are not exclusively oriented to autonomy.
The practical role of GAIA-2 inside Wayve is to expand and stress the training and validation distributions of the company's end to end driving stack. Specific uses described by Wayve and observed in the technical report include:
Wayve presents these capabilities as a complement to, rather than a replacement for, real world fleet data.
GAIA-2 was widely covered in the autonomous driving and AI press in the days following its 26 March 2025 announcement. Coverage focused on three themes: the scale and quality of the surround video generations relative to GAIA-1, the practical value of authoring safety critical scenarios on demand, and the broader trend of large generative world models replacing classical rule based simulators in the autonomous vehicle industry. Independent commentators noted that the explicit, structured conditioning interface was a clearer step forward for industrial usability than raw image quality, and contrasted GAIA-2 favourably with research models whose only handle was a text prompt. The model was frequently discussed alongside NVIDIA's Cosmos world foundation model family, released earlier the same year, as evidence that 2025 had become the year in which large generative world models moved from research curiosities to core infrastructure for autonomy and embodied AI.
The GAIA-2 technical report is explicit about several limitations of the system as released.
GAIA-3 was positioned partly as a response to these limitations, with a larger model, more data, and an explicit focus on producing repeatable evaluation signals.