LINGO-2 (Wayve)

AI Models Autonomous Vehicles Multimodal AI

20 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v2 · 3,925 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

LINGO-2 is a closed-loop vision-language-action model for autonomous driving developed by the British self-driving company Wayve. Announced on 17 April 2024, LINGO-2 is the first driving model trained on natural language to drive a car on public roads while simultaneously explaining its decisions in real time. In a single end-to-end network it consumes camera video, optional route and speed conditioning, and free-form text prompts, and it produces both a planned driving trajectory and a continuous stream of natural language commentary that describes what the model is doing and why. ^[1]^[2]

Wayve describes LINGO-2 as "the first closed-loop vision-language-action driving model (VLAM) tested on public roads," and stresses the unified design: "The same deep learning model generates the driving behavior and textual predictions in real-time." ^[2] LINGO-2 is the closed-loop successor to LINGO-1, an "open-loop driving commentator" that Wayve released in September 2023, which could narrate and answer questions about a drive but could not itself control the vehicle. ^[3]^[4]

LINGO-2 sits at the intersection of three research strands that converged in 2023 and 2024: end-to-end learned driving stacks of the kind Wayve has pursued since 2017, large multimodal language models that ground text in images and video, and the vision-language-action paradigm that emerged in robotics through systems such as RT-2 and later OpenVLA and π0. ^[10] LINGO-2 is the most prominent attempt to date to apply that paradigm to passenger-vehicle driving rather than tabletop manipulation, and it remains tightly integrated with the rest of Wayve's research program around generative world models such as GAIA-2 and GAIA-3 and its photoreal neural simulator Ghost Gym. ^[2]

What is LINGO-2?

LINGO-2 is a single neural network that both drives a car and explains its driving in plain English. ^[2] At each moment it ingests a sequence of camera frames plus optional conditioning (intended route, current speed, speed limit) and free-form text, and it outputs two things in a shared token vocabulary: a planned driving trajectory that a downstream controller turns into steering, throttle and brake, and a running natural language commentary on its own behaviour. ^[1]^[2] Because both outputs are produced by the same model from the same internal representation, Wayve positions the commentary as an explanation of the model's actual decision rather than a separate after-the-fact narration. ^[2]

The headline claim at launch was that LINGO-2 was "the first closed-loop vision-language-action driving model (VLAM) tested on public roads." ^[2] Closed-loop here means the model's outputs feed back into the world (the car actually moves according to the trajectory it predicts), in contrast to the open-loop LINGO-1, which only commented on drives it did not control. ^[3]^[4]

Background and motivation

Wayve was founded in Cambridge, United Kingdom in 2017 by Alex Kendall and Amar Shah to pursue an end-to-end learned approach to self-driving. Rather than composing a stack of hand-engineered perception, prediction, planning and control modules tied to high-definition maps, Wayve treats driving as a single deep-learning problem in which a neural network maps sensor input directly to driving action. By 2023 the company had moved from research demonstrations to extensive on-road testing in central London and other UK cities, and it was building a research portfolio around generative video world models and large foundation models for driving. ^[2]

A recurring criticism of fully end-to-end driving models is that they are opaque: a network that maps pixels to steering and acceleration does not naturally explain why it chose a given action. Regulators, fleet operators and passengers all benefit from being able to interrogate a driving system after the fact, and engineers benefit from being able to debug failure cases by asking pointed questions about a scene. In September 2023 Wayve introduced LINGO-1, which it called an open-loop driving commentator: a vision-language model trained on synchronized video, expert driving actions and natural language commentary recorded by professional drivers in the UK. ^[3]^[4] LINGO-1 could narrate a recorded drive, answer questions about what was happening in a scene, and explain why a particular action had been taken, but it did not itself drive the car; on Wayve's own evaluation it reached "around 60% accurate compared to human-level performance" on its question-answering benchmark. ^[4]

LINGO-2 was conceived as the closed-loop continuation of that line of work. The motivation was twofold. First, by training a single model to jointly produce driving actions and language, Wayve hoped to bind explanations to the actual decision being made rather than to a separate retrospective narration. Second, by using language as both an input and an output, the same model could be instructed in natural language to change its behaviour, opening up a new interface for in-cabin interaction, fleet management and training-time supervision. ^[2]

How does LINGO-2 work?

LINGO-2 is built from two principal components: the Wayve vision model and an auto-regressive language model. ^[1]^[2] The vision model encodes a temporal sequence of camera frames into visual tokens. Those tokens are concatenated with text tokens drawn from a free-form natural language prompt and with conditioning variables that include the intended route, the current speed and the speed limit. ^[2] The combined sequence is fed to the auto-regressive language model, which predicts two streams of output in a unified token vocabulary: a sequence of action tokens that describes the planned driving trajectory and a sequence of language tokens that forms the model's commentary on what it is doing. ^[1]^[2]

The key architectural commitment is that action and language are produced by the same network, sharing the same intermediate representations: in Wayve's words, "The same deep learning model generates the driving behavior and textual predictions in real-time." ^[2] Wayve frames this as a way of tightening the alignment between what the model says and what the model does. A separate captioning head bolted on to a driving network would only describe its inputs, not its decisions; in LINGO-2 the language stream is generated from the same hidden state that yields the trajectory, so the model is in principle explaining its own behaviour rather than narrating someone else's. ^[2]

The trajectory output is consumed by a downstream low-level controller that translates planned waypoints into throttle, brake and steering signals. The text output is exposed to operators, passengers and developers as a live driving commentary, and the same text channel is used to receive instructions and questions from a human. Because the language head is auto-regressive, the system can produce open-ended explanations rather than choosing from a fixed set of canned phrases, and it can answer questions about objects, hazards and counterfactuals that were not seen at training time. ^[2]

Wayve has not publicly disclosed the parameter count of LINGO-2, the exact configuration of the vision backbone, or the corpus mix used to train the language model. The company has indicated that the model was trained on a combination of multimodal driving data collected by its fleet in the UK and broader vision-language data, and that the driving commentary protocol used by its expert drivers in the LINGO-1 dataset was extended for LINGO-2. ^[2]^[3]

What can LINGO-2 do?

Wayve highlighted three primary capabilities for LINGO-2 at launch. ^[2]

The first is driving commentary. As the car moves, the model produces a running natural language description of what it is doing and what it is paying attention to. Example utterances released by Wayve include statements such as "I'm slowing down because of the pedestrian crossing the road," "I'm overtaking a parked vehicle on the right," and "I'm stopping at the give-way line." ^[2] Because the commentary is produced jointly with the trajectory, it tracks the model's evolving plan rather than narrating events after the fact.

The second is visual question answering during operation. Operators or passengers can pose ad-hoc questions to the model, such as queries about the colour of a traffic light, the presence of a particular type of road user or the reason for a manoeuvre. The model answers in natural language using the same vision tokens that informed its driving plan, providing a way to interrogate the system about the scene it is currently perceiving. ^[2]

The third, and most novel relative to LINGO-1, is linguistic control of behaviour. LINGO-2 can be given short natural language instructions that influence what it does next. Wayve has demonstrated commands such as "pull over," "turn left," "turn right," "change lane" and "stop at the give-way line," and has shown clips in which the same junction is approached with different prompts and the car takes correspondingly different actions. ^[2] In Wayve's framing, this opens a new training-time interface in which a supervisor can shape model behaviour by talking to it rather than by editing reward functions or labelling more data, and it suggests a future passenger interface in which a rider might ask the car to make a non-safety-critical change of plan. ^[2]

The table below summarises how LINGO-1 and LINGO-2 differ in capability surface.

Capability	LINGO-1 (Sep 2023)	LINGO-2 (Apr 2024)
Vision input	Yes, multi-frame camera	Yes, multi-frame camera
Language output (commentary)	Yes, open-loop narration of recorded clips	Yes, real-time commentary tied to planned actions
Visual question answering	Yes, on recorded clips	Yes, on live driving scene
Action output (driving)	No, commentary only	Yes, planned trajectory in same model
Language input as instruction	Limited to prompting questions	Yes, including behavioural commands such as "pull over"
Tested in closed-loop on public roads	No	Yes, central London
Role in stack	Diagnostic and explainability tool	Driving model with built-in commentary

How is LINGO-2 different from LINGO-1?

The central difference is the loop. LINGO-1, released in September 2023, was an "open-loop driving commentator": it watched recorded driving video and produced commentary and answers, but it never controlled a car. ^[3]^[4] LINGO-2, announced in April 2024, is closed-loop: the same network that explains the drive also produces the trajectory the car follows, so language and action are generated jointly in real time. ^[2]

Three practical consequences follow. First, LINGO-2 adds an action output that LINGO-1 lacked entirely, turning a diagnostic tool into a driving model. ^[2] Second, LINGO-2 turns language into a control input as well as an output: behavioural commands such as "pull over" or "turn left" can change what the car does, which was not a capability of LINGO-1. ^[2] Third, because LINGO-2 generates commentary from the same hidden state that yields the trajectory, its explanations are tied to its own decision rather than to a third party's drive, which Wayve argues makes the explanation more faithful in principle. ^[2] LINGO-1's reported ceiling of "around 60% accurate compared to human-level performance" on its question-answering benchmark also set the baseline that the LINGO-2 line of work aimed to improve on. ^[4]

When and where was LINGO-2 tested?

LINGO-2 was first validated in Wayve's neural simulator, Ghost Gym, before being deployed on a research vehicle for closed-loop testing on public roads in central London. ^[2] Ghost Gym is a learned, photoreal 4D simulator that Wayve uses to rehearse driving policies in reconstructed real-world scenes, and it provides a controlled environment in which to probe how a model reacts to linguistic prompts and to corner cases. ^[2]

The public-road demonstrations released by Wayve in April 2024 showed the same vehicle being driven by LINGO-2 along urban routes. ^[2] The model was shown changing lanes, slowing down to follow traffic, passing parked buses, stopping at red traffic lights and pulling over in response to natural language commands, while simultaneously producing a stream of commentary. In one demonstration the car approached a junction multiple times with different prompts and took different actions on each pass, illustrating the linguistic conditioning of behaviour. ^[2] Wayve described the result as the first time a closed-loop vision-language-action driving model had been tested on public roads. ^[2]^[6]

Wayve was careful to characterise LINGO-2 as a research step rather than a deployable product. The company noted that quantitatively measuring how well the language commentary actually reflects the internal state of the driving policy is an open research problem, that the model can hallucinate in ways familiar from other large language models, and that language-conditioned control raises additional safety questions that have to be resolved before such an interface could be exposed to the public. ^[2] The first deployment was confined to research vehicles operating under safety driver supervision.

What is LINGO-2 used for?

LINGO-2 is best understood as both a research artefact and a template for future product capabilities. The use cases Wayve and external commentators have highlighted fall into several categories. ^[2]^[6]

Use case	Audience	What LINGO-2 contributes
Explainability and trust	Passengers, operators, regulators	Real-time natural language reasons for each manoeuvre
Engineering debugging	AV developers and safety teams	Ability to interrogate a model about scenes that produced unusual behaviour
Training-time supervision	Researchers and data scientists	Language prompts to shape behaviour without rewriting reward functions or relabelling data
Cabin interaction	Riders in future robotaxis	Ability to ask the car questions and issue non-safety-critical instructions
Fleet teleoperation support	Remote operators	Richer status descriptions when a vehicle requests guidance
Curriculum and corner-case generation	Research teams	Pairing a closed-loop VLA driver with generative world models such as GAIA-2 and GAIA-3 to rehearse rare scenarios

The shared model design also matters for training efficiency. Because language is a flexible signal, supervisors can label intent or context with sentences rather than with bespoke ontologies, and the same network that drives can absorb that supervision. ^[2] Wayve has argued that this is one of the ways an AI-first approach can scale faster than a stack that depends on hand-designed labels and high-definition maps. ^[2]

How does LINGO-2 fit into Wayve's broader research program?

LINGO-2 is one strand of a larger research portfolio at Wayve that pairs end-to-end driving foundation models with generative world models and a learned simulator. ^[2]

The GAIA line of generative video world models is the most visible counterpart. GAIA-1, announced in 2023, demonstrated that a large generative model could synthesise realistic driving video conditioned on text and action prompts. GAIA-2, released in 2025, extended that work into a multi-camera, controllable world model that could be used to generate consistent driving scenes for training and evaluation. GAIA-3, announced later, pushed scale and controllability further and tightened the integration with Wayve's driving stack. LINGO-2 supplies the side of the picture that GAIA does not: where GAIA generates worlds, LINGO-2 generates actions and language inside a world.

Ghost Gym sits between the two. As a photoreal neural simulator, it allows a closed-loop driving policy such as LINGO-2 to be tested against reconstructed and synthesised scenes before going on the road. ^[2] The combination of a controllable world generator, a controllable driving model with a language interface and a neural simulator is, in Wayve's framing, the substrate for a new kind of AV development loop in which both the environment and the driver can be specified in text.

How does LINGO-2 compare to vision-language-action models in robotics?

The vision-language-action paradigm was popularised in robotics in 2023, when Google DeepMind published RT-2, a model that fine-tuned a large vision-language model on robot demonstration data and represented robot actions as discrete tokens that could be emitted by the same auto-regressive decoder as text. ^[10] RT-2 was followed in 2024 by OpenVLA, a seven-billion-parameter open-source VLA from Stanford and collaborators trained on the Open X-Embodiment dataset, and by π0 from Physical Intelligence, which produced continuous robot actions at high frequency using a flow-matching action head on top of a vision-language backbone. LINGO-2 fits naturally into this lineage but is targeted at a different embodiment and a different problem domain.

The table below sketches the comparison.

Model	Developer	Year	Embodiment	Action representation	Language role	Notes
RT-2	Google DeepMind	2023	Tabletop manipulator	Discrete action tokens	Instruction following, web-knowledge transfer	First widely cited VLA; built on PaLI-X and PaLM-E
OpenVLA	Stanford and collaborators	2024	Multiple manipulators	Discrete action tokens	Instruction following	Open-source 7B model trained on Open X-Embodiment
π0	Physical Intelligence	2024	Bi-manual manipulators and others	Continuous actions via flow matching	Instruction following	High-frequency control, generalist manipulation policy
LINGO-2	Wayve	2024	Passenger car	Planned driving trajectory plus commentary tokens	Instruction following plus continuous commentary	Closed-loop driving on public roads in central London

Several things stand out from this comparison. First, LINGO-2 was the earliest publicly demonstrated VLA driving a real road vehicle on public streets, rather than a manipulator in a lab or warehouse. ^[2]^[6] Second, LINGO-2 places unusually heavy emphasis on the language output of the model, treating commentary as a first-class capability alongside action, while many robotic VLAs treat language primarily as an input and emit only actions. Third, LINGO-2 produces a planned trajectory at relatively low frequency that is then refined by a downstream controller, in contrast to high-frequency continuous-action policies such as π0, reflecting the longer horizons typical of driving compared to in-hand manipulation. Fourth, the embodiment matters: a manipulator failing a task generally drops an object, whereas a driving policy failing a task can injure someone, which is why LINGO-2 has been kept inside research vehicles with safety drivers. ^[2]

Reception and significance

LINGO-2 was widely covered in trade press and AI media when it was unveiled in April 2024, and it has since been cited as one of the canonical demonstrations of the vision-language-action approach outside robotic manipulation. Coverage in outlets including Automotive World, ADAS and Autonomous Vehicle International, Analytics India Magazine and Analytics Vidhya emphasised both the explainability angle and the unusual move of letting natural language steer driving behaviour. ^[6]^[7]^[8]^[9] Several commentators positioned LINGO-2 as a step towards an AV stack in which a single foundation model handles perception, planning and dialogue, in contrast to the modular pipelines that have dominated the industry.

The research community responded by exploring related designs. Subsequent academic work, including the SimLingo paper presented in 2025, took inspiration from LINGO-2's closed-loop language-action alignment but pursued vision-only variants in simulation. ^[12] Other groups have published vision-language driving models that focus on instruction following, scene captioning or driving question answering, and LINGO-2 is often cited as a reference point for the closed-loop language-action combination.

For Wayve, LINGO-2 was also a strategic signal. It coincided with the company's effort to position itself as a foundation-model lab for embodied AI, and it preceded a Series C funding round in May 2024 that valued Wayve in the multi-billion-dollar range with investment from SoftBank, Nvidia and Microsoft. The model was repeatedly cited in coverage of that round as evidence of the company's distinctive AI-first approach.

What are the limitations of LINGO-2?

Wayve has been explicit about the limitations of LINGO-2 as a research system. The most fundamental is the alignment problem: although the model is trained to produce commentary that is consistent with its own driving actions, there is no guarantee that the natural language sentences faithfully reflect the internal computations that produced the trajectory. ^[2] A model that has learned to say plausible things about driving could in principle issue a confident-sounding explanation that is not the real reason for a manoeuvre. Quantitatively measuring this alignment, and detecting cases where it fails, remains an open research problem.

A second limitation is hallucination. As an auto-regressive language model, LINGO-2 is subject to the same failure modes as other large language models, including the production of fluent but incorrect statements. ^[2] In a driving context, a hallucinated commentary could mislead operators or passengers about what the car has perceived.

A third limitation concerns safety of language-conditioned control. Allowing a vehicle's behaviour to be modified by free-form text is powerful, but it raises questions about input validation, prompt injection and the safe boundaries of admissible instructions. Wayve restricted the language control demonstrated in LINGO-2 to constrained navigation prompts such as "pull over" or "turn left" and confined the experiments to research vehicles with safety drivers. ^[2] Productising such an interface would require additional layers of policy enforcement, intent recognition and fallback behaviour.

Finally, generalisation remains a concern. LINGO-1 was trained largely on data from central London, and although Wayve has expanded its data collection, public demonstrations of LINGO-2 also focused on UK urban driving. ^[3] How a language-conditioned closed-loop driving model behaves in markedly different driving cultures, road infrastructures and weather regimes is something the company has flagged for future work.

Significance

For most of the 2010s the dominant story about autonomous driving was about increasingly sophisticated modular stacks built on high-definition maps, lidar, and hand-engineered planners. By the early 2020s a counter-narrative emerged in which end-to-end learned policies, large pre-trained backbones and generative world models would replace much of that machinery. LINGO-2 is a concrete demonstration of how that counter-narrative might look when fused with the vision-language-action paradigm that emerged in robotics. It shows that a single neural network can both drive a car on a real street and explain itself in fluent English, and that the same model can be talked to as well as observed. ^[2]

Whether LINGO-2 itself becomes a production technology or remains a research milestone is, as of mid-2026, an open question. What is clear is that it altered the terms of the conversation about explainable, instructable autonomous driving, and that subsequent driving foundation models from Wayve and others have been compared to it as a reference design.

ELI5: LINGO-2 explained simply

Imagine a self-driving car that does two things at once. It steers itself down a real street, and at the same time it talks out loud about what it is doing, like a friendly driving instructor: "I'm slowing down because someone is crossing the road." Older versions of this idea (LINGO-1) could only talk about driving videos; they could not actually drive. LINGO-2 is the first version from Wayve that both drives the car on a public road and explains itself in normal sentences, and you can even tell it things like "pull over" and it will try to do them. ^[2]^[4]

References

Wayve. "Driving with Language: Introducing Wayve's Multimodal Driving Model LINGO-2." Press release, 17 April 2024. https://wayve.ai/press/introducing-lingo-2/ ↩
Wayve. "LINGO-2: Driving with Natural Language." Research blog, 17 April 2024. https://wayve.ai/thinking/lingo-2-driving-with-language/ ↩
Wayve. "Robot car talk: Introducing Wayve's new AI model LINGO-1." Press release, 14 September 2023. https://wayve.ai/press/introducing_lingo1/ ↩
Wayve. "LINGO-1: Exploring Natural Language for Autonomous Driving." Research blog, 14 September 2023. https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/ ↩
Wayve. "LINGO: Advancements in AI Explainability for Self-Driving Vehicles." Science overview. https://wayve.ai/science/lingo/
MarkLines. "Wayve, UK unveils LINGO-2, first vision-language-action model tested on public roads." April 2024. https://www.marklines.com/en/news/306594 ↩
ADAS and Autonomous Vehicle International. "Wayve launches multimodal driving model Lingo-2." April 2024. https://www.autonomousvehicleinternational.com/news/ai-sensor-fusion/wayve-launches-multimodal-driving-model-lingo-2.html ↩
Automotive World. "Driving with Language: Introducing Wayve's Multimodal Driving Model LINGO-2." April 2024. https://www.automotiveworld.com/news-releases/driving-with-language-introducing-wayves-multimodal-driving-model-lingo-2/ ↩
Analytics Vidhya. "Wayve Lingo-2: Closed-loop Vision-Language-Action Driving Model." April 2024. https://www.analyticsvidhya.com/blog/2024/04/wayves-lingo-redefines-autonomous-vehicles-with-the-power-of-speech/ ↩
Anthony Brohan et al. "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control." arXiv:2307.15818, 2023. https://arxiv.org/abs/2307.15818 ↩
Wikipedia. "Vision-language-action model." https://en.wikipedia.org/wiki/Vision-language-action_model
Renz et al. "SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment." arXiv:2503.09594, 2025. https://arxiv.org/html/2503.09594v1 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Cruise (self-driving)GAIA-2 (Wayve)GAIA-3 (Wayve)Wayve

What is LINGO-2?

Background and motivation

How does LINGO-2 work?

What can LINGO-2 do?

How is LINGO-2 different from LINGO-1?

When and where was LINGO-2 tested?

What is LINGO-2 used for?

How does LINGO-2 fit into Wayve's broader research program?

How does LINGO-2 compare to vision-language-action models in robotics?

Reception and significance

What are the limitations of LINGO-2?

Significance

ELI5: LINGO-2 explained simply

See also

References

Improve this article

Related Articles

GAIA-3 (Wayve)

GAIA-2 (Wayve)

NVIDIA Alpamayo 2 Super

Document Question Answering Models

Feature Extraction Models

SmolVLA

What links here

Related Articles

GAIA-3 (Wayve)

GAIA-2 (Wayve)

NVIDIA Alpamayo 2 Super

Document Question Answering Models

Feature Extraction Models

SmolVLA

What links here