LINGO-2 (Wayve)
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,262 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,262 words
Add missing citations, update stale details, or suggest a clearer explanation.
LINGO-2 is a closed-loop vision-language-action model for autonomous driving developed by the British self-driving company Wayve. Unveiled on 17 April 2024 as the successor to LINGO-1, LINGO-2 was described by Wayve as the first driving model trained on natural language to be tested on public roads. In a single end-to-end network it consumes camera video, optional route and speed conditioning, and free-form text prompts, and it produces both a planned driving trajectory and a continuous stream of natural language commentary that explains, in real time, what the model is doing and why.
LINGO-2 sits at the intersection of three research strands that converged in 2023 and 2024: end-to-end learned driving stacks of the kind Wayve has pursued since 2017, large multimodal language models that ground text in images and video, and the vision-language-action paradigm that emerged in robotics through systems such as RT-2 and later OpenVLA and π0. LINGO-2 is the most prominent attempt to date to apply that paradigm to passenger-vehicle driving rather than tabletop manipulation, and it remains tightly integrated with the rest of Wayve's research program around generative world models such as GAIA-2 and GAIA-3 and its photoreal neural simulator Ghost Gym.
Wayve was founded in Cambridge, United Kingdom in 2017 by Alex Kendall and Amar Shah to pursue an end-to-end learned approach to self-driving. Rather than composing a stack of hand-engineered perception, prediction, planning and control modules tied to high-definition maps, Wayve treats driving as a single deep-learning problem in which a neural network maps sensor input directly to driving action. By 2023 the company had moved from research demonstrations to extensive on-road testing in central London and other UK cities, and it was building a research portfolio around generative video world models and large foundation models for driving.
A recurring criticism of fully end-to-end driving models is that they are opaque: a network that maps pixels to steering and acceleration does not naturally explain why it chose a given action. Regulators, fleet operators and passengers all benefit from being able to interrogate a driving system after the fact, and engineers benefit from being able to debug failure cases by asking pointed questions about a scene. In September 2023 Wayve introduced LINGO-1, which it called an open-loop driving commentator: a vision-language model trained on synchronized video, expert driving actions and natural language commentary recorded by professional drivers in the UK. LINGO-1 could narrate a recorded drive, answer questions about what was happening in a scene, and explain why a particular action had been taken, but it did not itself drive the car.
LINGO-2 was conceived as the closed-loop continuation of that line of work. The motivation was twofold. First, by training a single model to jointly produce driving actions and language, Wayve hoped to bind explanations to the actual decision being made rather than to a separate retrospective narration. Second, by using language as both an input and an output, the same model could be instructed in natural language to change its behaviour, opening up a new interface for in-cabin interaction, fleet management and training-time supervision.
LINGO-2 is built from two principal components: the Wayve vision model and an auto-regressive language model. The vision model encodes a temporal sequence of camera frames into visual tokens. Those tokens are concatenated with text tokens drawn from a free-form natural language prompt and with conditioning variables that include the intended route and the current speed of the vehicle. The combined sequence is fed to the auto-regressive language model, which predicts two streams of output in a unified token vocabulary: a sequence of action tokens that describes the planned driving trajectory and a sequence of language tokens that forms the model's commentary on what it is doing.
The key architectural commitment is that action and language are produced by the same network, sharing the same intermediate representations. Wayve frames this as a way of tightening the alignment between what the model says and what the model does. A separate captioning head bolted on to a driving network would only describe its inputs, not its decisions; in LINGO-2 the language stream is generated from the same hidden state that yields the trajectory, so the model is in principle explaining its own behaviour rather than narrating someone else's.
The trajectory output is consumed by a downstream low-level controller that translates planned waypoints into throttle, brake and steering signals. The text output is exposed to operators, passengers and developers as a live driving commentary, and the same text channel is used to receive instructions and questions from a human. Because the language head is auto-regressive, the system can produce open-ended explanations rather than choosing from a fixed set of canned phrases, and it can answer questions about objects, hazards and counterfactuals that were not seen at training time.
Wayve has not publicly disclosed the parameter count of LINGO-2, the exact configuration of the vision backbone, or the corpus mix used to train the language model. The company has indicated that the model was trained on a combination of multimodal driving data collected by its fleet in the UK and broader vision-language data, and that the driving commentary protocol used by its expert drivers in the LINGO-1 dataset was extended for LINGO-2.
Wayve highlighted three primary capabilities for LINGO-2 at launch.
The first is driving commentary. As the car moves, the model produces a running natural language description of what it is doing and what it is paying attention to. Example utterances released by Wayve include statements such as "I'm slowing down because of the pedestrian crossing the road," "I'm overtaking a parked vehicle on the right," and "I'm stopping at the give-way line." Because the commentary is produced jointly with the trajectory, it tracks the model's evolving plan rather than narrating events after the fact.
The second is visual question answering during operation. Operators or passengers can pose ad-hoc questions to the model, such as queries about the colour of a traffic light, the presence of a particular type of road user or the reason for a manoeuvre. The model answers in natural language using the same vision tokens that informed its driving plan, providing a way to interrogate the system about the scene it is currently perceiving.
The third, and most novel relative to LINGO-1, is linguistic control of behaviour. LINGO-2 can be given short natural language instructions that influence what it does next. Wayve has demonstrated commands such as "pull over," "turn left," "turn right," "change lane" and "stop at the give-way line," and has shown clips in which the same junction is approached with different prompts and the car takes correspondingly different actions. In Wayve's framing, this opens a new training-time interface in which a supervisor can shape model behaviour by talking to it rather than by editing reward functions or labelling more data, and it suggests a future passenger interface in which a rider might ask the car to make a non-safety-critical change of plan.
The table below summarises how LINGO-1 and LINGO-2 differ in capability surface.
| Capability | LINGO-1 (Sep 2023) | LINGO-2 (Apr 2024) |
|---|---|---|
| Vision input | Yes, multi-frame camera | Yes, multi-frame camera |
| Language output (commentary) | Yes, open-loop narration of recorded clips | Yes, real-time commentary tied to planned actions |
| Visual question answering | Yes, on recorded clips | Yes, on live driving scene |
| Action output (driving) | No, commentary only | Yes, planned trajectory in same model |
| Language input as instruction | Limited to prompting questions | Yes, including behavioural commands such as "pull over" |
| Tested in closed-loop on public roads | No | Yes, central London |
| Role in stack | Diagnostic and explainability tool | Driving model with built-in commentary |
LINGO-2 was first validated in Wayve's neural simulator, Ghost Gym, before being deployed on a research vehicle for closed-loop testing on public roads in central London. Ghost Gym is a learned, photoreal 4D simulator that Wayve uses to rehearse driving policies in reconstructed real-world scenes, and it provides a controlled environment in which to probe how a model reacts to linguistic prompts and to corner cases.
The public-road demonstrations released by Wayve in April 2024 showed the same vehicle being driven by LINGO-2 along urban routes. The model was shown changing lanes, slowing down to follow traffic, passing parked buses, stopping at red traffic lights and pulling over in response to natural language commands, while simultaneously producing a stream of commentary. In one demonstration the car approached a junction multiple times with different prompts and took different actions on each pass, illustrating the linguistic conditioning of behaviour. Wayve described the result as the first time a closed-loop vision-language-action driving model had been tested on public roads.
Wayve was careful to characterise LINGO-2 as a research step rather than a deployable product. The company noted that quantitatively measuring how well the language commentary actually reflects the internal state of the driving policy is an open research problem, that the model can hallucinate in ways familiar from other large language models, and that language-conditioned control raises additional safety questions that have to be resolved before such an interface could be exposed to the public. The first deployment was confined to research vehicles operating under safety driver supervision.
LINGO-2 is best understood as both a research artefact and a template for future product capabilities. The use cases Wayve and external commentators have highlighted fall into several categories.
| Use case | Audience | What LINGO-2 contributes |
|---|---|---|
| Explainability and trust | Passengers, operators, regulators | Real-time natural language reasons for each manoeuvre |
| Engineering debugging | AV developers and safety teams | Ability to interrogate a model about scenes that produced unusual behaviour |
| Training-time supervision | Researchers and data scientists | Language prompts to shape behaviour without rewriting reward functions or relabelling data |
| Cabin interaction | Riders in future robotaxis | Ability to ask the car questions and issue non-safety-critical instructions |
| Fleet teleoperation support | Remote operators | Richer status descriptions when a vehicle requests guidance |
| Curriculum and corner-case generation | Research teams | Pairing a closed-loop VLA driver with generative world models such as GAIA-2 and GAIA-3 to rehearse rare scenarios |
The shared model design also matters for training efficiency. Because language is a flexible signal, supervisors can label intent or context with sentences rather than with bespoke ontologies, and the same network that drives can absorb that supervision. Wayve has argued that this is one of the ways an AI-first approach can scale faster than a stack that depends on hand-designed labels and high-definition maps.
LINGO-2 is one strand of a larger research portfolio at Wayve that pairs end-to-end driving foundation models with generative world models and a learned simulator.
The GAIA line of generative video world models is the most visible counterpart. GAIA-1, announced in 2023, demonstrated that a large generative model could synthesise realistic driving video conditioned on text and action prompts. GAIA-2, released in 2025, extended that work into a multi-camera, controllable world model that could be used to generate consistent driving scenes for training and evaluation. GAIA-3, announced later, pushed scale and controllability further and tightened the integration with Wayve's driving stack. LINGO-2 supplies the side of the picture that GAIA does not: where GAIA generates worlds, LINGO-2 generates actions and language inside a world.
Ghost Gym sits between the two. As a photoreal neural simulator, it allows a closed-loop driving policy such as LINGO-2 to be tested against reconstructed and synthesised scenes before going on the road. The combination of a controllable world generator, a controllable driving model with a language interface and a neural simulator is, in Wayve's framing, the substrate for a new kind of AV development loop in which both the environment and the driver can be specified in text.
The vision-language-action paradigm was popularised in robotics in 2023, when Google DeepMind published RT-2, a model that fine-tuned a large vision-language model on robot demonstration data and represented robot actions as discrete tokens that could be emitted by the same auto-regressive decoder as text. RT-2 was followed in 2024 by OpenVLA, a seven-billion-parameter open-source VLA from Stanford and collaborators trained on the Open X-Embodiment dataset, and by π0 from Physical Intelligence, which produced continuous robot actions at high frequency using a flow-matching action head on top of a vision-language backbone. LINGO-2 fits naturally into this lineage but is targeted at a different embodiment and a different problem domain.
The table below sketches the comparison.
| Model | Developer | Year | Embodiment | Action representation | Language role | Notes |
|---|---|---|---|---|---|---|
| RT-2 | Google DeepMind | 2023 | Tabletop manipulator | Discrete action tokens | Instruction following, web-knowledge transfer | First widely cited VLA; built on PaLI-X and PaLM-E |
| OpenVLA | Stanford and collaborators | 2024 | Multiple manipulators | Discrete action tokens | Instruction following | Open-source 7B model trained on Open X-Embodiment |
| π0 | Physical Intelligence | 2024 | Bi-manual manipulators and others | Continuous actions via flow matching | Instruction following | High-frequency control, generalist manipulation policy |
| LINGO-2 | Wayve | 2024 | Passenger car | Planned driving trajectory plus commentary tokens | Instruction following plus continuous commentary | Closed-loop driving on public roads in central London |
Several things stand out from this comparison. First, LINGO-2 was the earliest publicly demonstrated VLA driving a real road vehicle on public streets, rather than a manipulator in a lab or warehouse. Second, LINGO-2 places unusually heavy emphasis on the language output of the model, treating commentary as a first-class capability alongside action, while many robotic VLAs treat language primarily as an input and emit only actions. Third, LINGO-2 produces a planned trajectory at relatively low frequency that is then refined by a downstream controller, in contrast to high-frequency continuous-action policies such as π0, reflecting the longer horizons typical of driving compared to in-hand manipulation. Fourth, the embodiment matters: a manipulator failing a task generally drops an object, whereas a driving policy failing a task can injure someone, which is why LINGO-2 has been kept inside research vehicles with safety drivers.
LINGO-2 was widely covered in trade press and AI media when it was unveiled in April 2024, and it has since been cited as one of the canonical demonstrations of the vision-language-action approach outside robotic manipulation. Coverage in outlets including Automotive World, ADAS and Autonomous Vehicle International, Analytics India Magazine and Analytics Vidhya emphasised both the explainability angle and the unusual move of letting natural language steer driving behaviour. Several commentators positioned LINGO-2 as a step towards an AV stack in which a single foundation model handles perception, planning and dialogue, in contrast to the modular pipelines that have dominated the industry.
The research community responded by exploring related designs. Subsequent academic work, including the SimLingo paper presented in 2025, took inspiration from LINGO-2's closed-loop language-action alignment but pursued vision-only variants in simulation. Other groups have published vision-language driving models that focus on instruction following, scene captioning or driving question answering, and LINGO-2 is often cited as a reference point for the closed-loop language-action combination.
For Wayve, LINGO-2 was also a strategic signal. It coincided with the company's effort to position itself as a foundation-model lab for embodied AI, and it preceded a Series C funding round in May 2024 that valued Wayve in the multi-billion-dollar range with investment from SoftBank, Nvidia and Microsoft. The model was repeatedly cited in coverage of that round as evidence of the company's distinctive AI-first approach.
Wayve has been explicit about the limitations of LINGO-2 as a research system. The most fundamental is the alignment problem: although the model is trained to produce commentary that is consistent with its own driving actions, there is no guarantee that the natural language sentences faithfully reflect the internal computations that produced the trajectory. A model that has learned to say plausible things about driving could in principle issue a confident-sounding explanation that is not the real reason for a manoeuvre. Quantitatively measuring this alignment, and detecting cases where it fails, remains an open research problem.
A second limitation is hallucination. As an auto-regressive language model, LINGO-2 is subject to the same failure modes as other large language models, including the production of fluent but incorrect statements. In a driving context, a hallucinated commentary could mislead operators or passengers about what the car has perceived.
A third limitation concerns safety of language-conditioned control. Allowing a vehicle's behaviour to be modified by free-form text is powerful, but it raises questions about input validation, prompt injection and the safe boundaries of admissible instructions. Wayve restricted the language control demonstrated in LINGO-2 to constrained navigation prompts such as "pull over" or "turn left" and confined the experiments to research vehicles with safety drivers. Productising such an interface would require additional layers of policy enforcement, intent recognition and fallback behaviour.
Finally, generalisation remains a concern. LINGO-1 was trained largely on data from central London, and although Wayve has expanded its data collection, public demonstrations of LINGO-2 also focused on UK urban driving. How a language-conditioned closed-loop driving model behaves in markedly different driving cultures, road infrastructures and weather regimes is something the company has flagged for future work.
For most of the 2010s the dominant story about autonomous driving was about increasingly sophisticated modular stacks built on high-definition maps, lidar, and hand-engineered planners. By the early 2020s a counter-narrative emerged in which end-to-end learned policies, large pre-trained backbones and generative world models would replace much of that machinery. LINGO-2 is a concrete demonstration of how that counter-narrative might look when fused with the vision-language-action paradigm that emerged in robotics. It shows that a single neural network can both drive a car on a real street and explain itself in fluent English, and that the same model can be talked to as well as observed.
Whether LINGO-2 itself becomes a production technology or remains a research milestone is, as of mid-2026, an open question. What is clear is that it altered the terms of the conversation about explainable, instructable autonomous driving, and that subsequent driving foundation models from Wayve and others have been compared to it as a reference design.