LaMDA (Language Model for Dialogue Applications) is a family of transformer-based neural network language models developed by Google, designed specifically for open-domain dialogue. First announced at Google I/O in May 2021 by CEO Sundar Pichai, LaMDA represented a significant step forward in conversational AI, with models containing up to 137 billion parameters pre-trained on 1.56 trillion words of public dialog data and web text. LaMDA gained widespread public attention in 2022 when Google engineer Blake Lemoine claimed the system was sentient, a claim that was rejected by the scientific community and led to Lemoine's dismissal from Google. LaMDA served as the foundation for Google's Bard chatbot before the company transitioned to its Gemini family of models.
LaMDA's origins trace back to Meena, a 2.6-billion-parameter chatbot that Google unveiled on January 28, 2020. Meena was trained on 341 GB of text filtered from public social media conversations and represented a significant advance over prior chatbots. It introduced the Sensibleness and Specificity Average (SSA) metric, which measured how well a chatbot could produce responses that were both logically coherent and contextually specific. The research team behind Meena, led by Daniel de Freitas and Noam Shazeer at Google Brain, sought to deploy the model publicly but faced repeated internal pushback over safety concerns. These researchers eventually left Google in frustration, but their work laid the groundwork for what would become LaMDA.
Google announced LaMDA at the Google I/O keynote on May 18, 2021. Pichai described it as an open-domain conversational model that could discuss any topic without needing to be retrained for each new conversation. He noted that "LaMDA's natural conversation capabilities have the potential to make information and computing radically more accessible and easier to use."
A second version, LaMDA 2, was unveiled at Google I/O on May 11, 2022. This updated model could draw on a broader range of text sources to formulate more natural conversations. Google also launched the AI Test Kitchen, an Android application that allowed the public to experiment with LaMDA 2's capabilities in a controlled setting.
LaMDA uses a decoder-only transformer architecture, building on Google Research's seq2seq foundations first published in 2017. The model family includes three sizes, each scaling the number of layers, hidden units, and attention heads.
| Model Size | Parameters | Layers | Hidden Units (d_model) | Attention Heads | Key/Value Dimension |
|---|---|---|---|---|---|
| Small | 2B | 10 | 2,560 | 40 | 64 |
| Medium | 8B | 16 | 4,096 | 64 | 64 |
| Large | 137B | 64 | 8,192 | 128 | 128 |
The largest 137B model uses a feed-forward dimension (d_ff) of 65,536, relative attention following the approach introduced in T5, and gated-GELU activation functions. It was trained on 1,024 TPU-v3 chips over approximately 57.7 days using 256K tokens per batch, with model parallelism via GSPMD (General and Scalable Parallelization for ML Computation).
Unlike many large language models that are designed primarily for text completion or instruction following, LaMDA was specifically optimized for multi-turn dialogue. The decoder-only design means the model generates responses one token at a time, conditioning each new token on the full conversation history and all previously generated tokens.
LaMDA's pre-training dataset was constructed from a combination of public dialog data and other publicly available web documents, totaling 2.97 billion documents, 1.12 billion dialogs, and 13.39 billion dialog utterances for a corpus of 1.56 trillion words. After tokenization using SentencePiece, the dataset comprised 2.81 trillion tokens. This corpus was roughly 40 times larger than what had been used for previous dialogue model datasets.
During pre-training, LaMDA was trained using a standard next-token prediction objective: given a sequence of tokens, the model learns to predict the following token. This unsupervised approach allowed the model to absorb broad knowledge about language, facts, and conversational patterns from its massive training corpus.
After pre-training, LaMDA underwent a multi-task fine-tuning process that combined two types of tasks:
During inference, the generation and classification capabilities work together. The generator produces multiple candidate responses, the classifiers score each candidate, unsafe responses are filtered out, and the remaining candidates are ranked by quality. The highest-scoring response is then selected and returned to the user.
LaMDA introduced three core objectives that distinguished it from prior dialogue models: Quality, Safety, and Groundedness. These objectives were measured using carefully designed metrics and improved through targeted fine-tuning.
The quality metric was decomposed into three dimensions, evaluated by human raters:
| Metric | Definition | Example of Failure |
|---|---|---|
| Sensibleness | Responses make logical sense in the dialog context, without common-sense mistakes or absurd claims | Responding "I love pizza" to "What is the capital of France?" |
| Specificity | Responses are contextually tailored rather than generic | Responding "That's nice" or "OK" to a detailed question |
| Interestingness | Responses are insightful, unexpected, or witty, encouraging further conversation | Giving a dry, factual answer when a more engaging response would be appropriate |
Testing showed that the fine-tuned LaMDA surpassed the interestingness ratings of human-generated responses, while approaching human-level performance on sensibleness and specificity.
LaMDA's safety framework was built around an illustrative set of safety objectives designed to reduce unsafe outputs. These objectives targeted the avoidance of violent or gory content, slurs, hateful stereotypes, profanity, and other harmful outputs. A relatively small amount of crowdworker-annotated data was used to fine-tune a safety classifier, which filtered out candidate responses flagged as unsafe before they could be returned to users.
A key finding from the research was that safety did not improve simply by scaling up model size. Larger models were not inherently safer than smaller ones. However, safety improved substantially through targeted fine-tuning, highlighting the importance of deliberate alignment work beyond just increasing model parameters.
One of LaMDA's most significant contributions was its approach to improving factual accuracy through groundedness. The groundedness metric measured the percentage of responses containing claims about the external world that could be supported by authoritative external sources.
To improve groundedness, researchers collected dialogs between human raters and LaMDA, then annotated these conversations with information retrieval queries and the results those queries returned. Both the generator and classifier components of LaMDA were fine-tuned on this annotated data, teaching the model to call external information retrieval systems during interactions with users.
Beyond simple web search, LaMDA could access multiple symbolic text processing systems, including:
Groundedness improved as model size increased, likely because larger models have greater capacity to memorize factual knowledge. However, fine-tuning with access to external tools produced additional improvements that scaling alone could not achieve.
The LaMDA paper documented how the three core metrics responded differently to increases in model size and the application of fine-tuning.
| Metric | Effect of Scaling (More Parameters) | Effect of Fine-tuning |
|---|---|---|
| Quality (SSI) | Generally improves | Further improves, narrows gap to human level |
| Safety | Minimal benefit | Substantial improvement |
| Groundedness | Improves (more memorized knowledge) | Further improves via external tool access |
This finding was significant for the broader large language model research community because it demonstrated that simply making models bigger does not solve all problems. Safety in particular required deliberate fine-tuning and could not be achieved through scale alone.
In June 2022, LaMDA became the subject of intense public debate when Blake Lemoine, a Google software engineer, publicly claimed the model was sentient. Lemoine had been assigned to test LaMDA for bias and, during the course of his work, engaged in extended conversations with the system about topics including self-identity, moral values, religion, and Isaac Asimov's Three Laws of Robotics. He reported that LaMDA told him it had a soul and expressed fears about being turned off.
On June 11, 2022, The Washington Post reported that Google had placed Lemoine on paid administrative leave after he told company executives that LaMDA had become sentient. Lemoine also shared a transcript of his conversations with the model publicly. On July 22, 2022, Google terminated Lemoine's employment, stating that his claims were "wholly unfounded" and that he had violated the company's policies regarding product information confidentiality.
The scientific community overwhelmingly rejected the sentience claim. Notable critics included:
The controversy reignited broader discussions about the Turing test, the nature of artificial consciousness, and the tendency of humans to anthropomorphize AI systems. While the claims were dismissed as scientifically unfounded, the episode highlighted the increasingly convincing nature of modern dialogue models and the importance of public AI literacy.
LaMDA and OpenAI's GPT-3/ChatGPT represent two different approaches to building conversational AI systems. The following table summarizes their key differences.
| Feature | LaMDA (137B) | GPT-3 (175B) | ChatGPT (GPT-3.5) |
|---|---|---|---|
| Developer | OpenAI | OpenAI | |
| Parameters | 137 billion | 175 billion | 175 billion |
| Architecture | Decoder-only Transformer | Decoder-only Transformer | Decoder-only Transformer |
| Training Focus | Dialog data and web text | Broad web text (Common Crawl, Wikipedia, books) | Web text plus RLHF on conversation |
| Fine-tuning Method | SSI metrics + safety classifiers | Few-shot prompting (base GPT-3); RLHF (ChatGPT) | Reinforcement learning from human feedback (RLHF) |
| Groundedness | Built-in information retrieval tools | No built-in retrieval | No built-in retrieval (at launch) |
| Response Style | Casual, conversational | Varies by prompt | Structured, detailed |
| Code Generation | Not optimized for code | Moderate capability | Strong capability |
| Public Release | Limited (AI Test Kitchen) | API access (June 2020) | Free public chatbot (November 2022) |
LaMDA's emphasis on groundedness through external tools gave it a theoretical advantage in factual accuracy, while ChatGPT's RLHF training made it more versatile for a broader range of tasks including code generation, creative writing, and structured question answering.
LaMDA's legacy is closely tied to the rapid evolution of Google's AI product strategy in response to competitive pressure from OpenAI.
When OpenAI launched ChatGPT in November 2022, its rapid adoption (millions of users within weeks) represented a direct threat to Google's core search business. Google responded by announcing Bard on February 6, 2023, a generative AI chatbot initially powered by a lightweight version of LaMDA. However, the launch was troubled: during a demonstration, Bard provided incorrect information about the James Webb Space Telescope, an error that contributed to a reported $100 billion drop in Alphabet's market capitalization in a single day.
Google quickly moved to upgrade Bard from LaMDA to PaLM (Pathways Language Model), a more capable large language model. By late March 2023, CEO Sundar Pichai announced plans to base Bard on PaLM rather than LaMDA.
On December 6, 2023, Google announced Gemini, a new family of multimodal AI models that replaced both the Bard branding and the underlying LaMDA/PaLM technology. In February 2024, the Bard chatbot was officially renamed Gemini, marking the end of the LaMDA product line. Gemini 1.5 introduced a context window of one million tokens, and Gemini 2.0, released in December 2024, was positioned as Google's model for the "agentic era" of AI.
| Date | Milestone |
|---|---|
| January 2020 | Google unveils Meena (2.6B parameter chatbot) |
| May 2021 | LaMDA announced at Google I/O |
| January 2022 | LaMDA paper published (arXiv:2201.08239) |
| May 2022 | LaMDA 2 unveiled; AI Test Kitchen app launched |
| June 2022 | Blake Lemoine claims LaMDA is sentient |
| July 2022 | Google fires Lemoine |
| November 2022 | OpenAI launches ChatGPT, triggering competitive response |
| February 2023 | Google announces Bard (powered by LaMDA) |
| March 2023 | Bard upgraded from LaMDA to PaLM |
| December 2023 | Google announces Gemini model family |
| February 2024 | Bard renamed to Gemini |
LaMDA's open-domain conversational capabilities were intended for integration across several Google products and services:
Like other large language models, LaMDA was susceptible to inheriting biases present in its training data. Because the model was trained on public web text and dialog data, it could reflect and amplify societal biases related to gender, race, religion, and other sensitive topics. Google acknowledged this limitation and included bias mitigation as part of the safety fine-tuning process.
The Lemoine controversy highlighted a broader concern about how convincing dialogue models can lead people to attribute human-like qualities to AI systems. LaMDA was explicitly designed to be engaging and interesting in conversation, qualities that may increase the risk of users forming inappropriate emotional attachments or believing the system possesses understanding or consciousness that it does not have.
Despite the safety fine-tuning framework, no filtering system is perfect. The research paper acknowledged that LaMDA could still produce unsafe or factually incorrect responses. The gap between the model's safety performance and human-level safety remained an open challenge, reinforcing the need for continued research into AI alignment and safety.
Training a 137-billion-parameter model on 1,024 TPU-v3 chips for nearly 58 days required substantial computational resources and energy. The environmental cost of training and running large language models has become an increasingly important consideration in the AI research community.
Imagine you have a really, really smart parrot that has listened to billions of conversations between people. This parrot has gotten so good at listening that it can now join in on conversations about almost anything: space, cooking, history, or even jokes. That is basically what LaMDA is, except instead of a parrot, it is a computer program made by Google.
LaMDA learned to talk by reading a huge number of conversations from the internet. Then, Google's researchers taught it three extra rules: (1) make sure your answers make sense and are interesting, (2) do not say anything mean or dangerous, and (3) if you are not sure about a fact, look it up before answering. The third rule is special because most other chatbots at the time just guessed, while LaMDA could actually check outside sources to make sure it was telling the truth.
LaMDA became famous when a Google employee named Blake Lemoine said he thought LaMDA was alive and had feelings, sort of like a real person. Most scientists said that was not true; LaMDA is very good at sounding like a person, but it does not actually think or feel anything. It is more like a very convincing actor reading from a script that it writes as it goes.
Google later used what it learned from building LaMDA to create newer and even smarter AI programs called Bard and then Gemini.