Conversational Models
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,495 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,495 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing Models and Tasks
Conversational models are computational systems designed to carry on a dialogue with human users in natural language. The category covers everything from early pattern matching scripts such as ELIZA to modern instruction tuned large language models such as ChatGPT, Claude, and Gemini. The field overlaps with dialogue systems research, with agent work in AI, and with consumer assistants such as Siri and Alexa.
A dialogue system is software that conducts a back and forth exchange with a user across multiple turns. A chatbot is a dialogue system whose primary interface is text or voice chat. Conversational AI is an industry umbrella term for the language understanding, dialogue management, and language generation components that produce coherent multi turn responses.
Researchers split conversational models into two families. Task oriented dialogue systems help a user complete a specific goal such as booking a flight, tracking explicit slot values using a dialogue state tracker. Open domain or chitchat systems aim to hold an engaging conversation across arbitrary topics with no fixed task. Modern instruction tuned LLMs blur the boundary by handling both. The Turing test proposed by Alan Turing in 1950 used a conversational setup as a thought experiment for machine intelligence and still shapes how the public reads progress.
The first widely known chatbot, ELIZA, was written by Joseph Weizenbaum at MIT between 1964 and 1967. Its DOCTOR script imitated a Rogerian psychotherapist using around 200 lines of pattern matching code in the MAD-SLIP language on an IBM 7094 mainframe. Many users, including Weizenbaum's own secretary, attributed real understanding to the program, an effect later named the ELIZA effect. Weizenbaum himself was disturbed by the reaction and argued in his 1976 book Computer Power and Human Reason against delegating sensitive human tasks to machines.
In 1972 the Stanford psychiatrist Kenneth Colby released PARRY, a rule based program that simulated a patient with paranoid schizophrenia. PARRY held the first chatbot to chatbot conversation when it was connected to ELIZA over the ARPANET in 1972. A.L.I.C.E., the Artificial Linguistic Internet Computer Entity, was created by Richard Wallace on November 23, 1995 and used a custom XML based language called AIML to express thousands of pattern response rules; it won the Loebner Prize in 2000, 2001, and 2004. Cleverbot, launched on the web in 1997 by Rollo Carpenter, learned from accumulated user transcripts rather than from a hand authored script.
IBM's question answering system Watson won the Jeopardy! exhibition match in February 2011 against champions Ken Jennings and Brad Rutter, showing that statistical retrieval pipelines could handle open ended natural language questions. Apple released Siri on the iPhone 4S in October 2011, building on SRI International's CALO project. Amazon followed with Alexa on the Echo speaker in 2014, Microsoft launched Cortana the same year, and Google introduced Google Assistant in 2016. These products combined automatic speech recognition, intent classification, slot filling, and template based response generation. In 2015 Oriol Vinyals and Quoc Le of Google published A Neural Conversational Model, applying the sequence to sequence framework to dialogue and showing that a single recurrent network trained on subtitle and IT helpdesk corpora could produce passable open domain responses without hand authored rules.
The arrival of the transformer architecture and pretrained models such as GPT-2 reset expectations for open domain chatbots. Microsoft Research released DialoGPT in November 2019, a GPT-2 response generator trained on 147 million Reddit comment exchanges from 2005 to 2017, up to 762 million parameters. Google followed in January 2020 with Meena, a 2.6 billion parameter Evolved Transformer trained on 40 billion words of public social media conversation; it introduced the Sensibleness and Specificity Average metric and scored 79 percent SSA versus 86 percent for humans.
Facebook AI Research published Recipes for building an open domain chatbot by Stephen Roller and colleagues in April 2020, releasing the BlenderBot family in 90M, 2.7B, and 9.4B parameter sizes through the ParlAI framework. BlenderBot 2.0 added long term memory and internet search in 2021, and BlenderBot 3 in August 2022 was a 175B parameter system deployed as a public demo for safety research. Google's LaMDA, introduced by Romal Thoppilan and colleagues in January 2022, scaled the dialogue specialized transformer to 137B parameters trained on 1.56 trillion words. DeepMind's Sparrow, presented by Amelia Glaese and colleagues in September 2022, applied RLHF against 23 hand written rules and used Google search to support claims; it gave a plausible answer with evidence 78 percent of the time on factual questions.
OpenAI released ChatGPT on November 30, 2022, built on a fine tuned GPT-3.5 using the SFT + reward model + PPO recipe from the InstructGPT paper. It reached one million users in five days and an estimated 100 million monthly active users by January 2023, the fastest consumer software adoption on record at the time. GPT-4 followed in March 2023 with image input and improved reasoning. Anthropic released Claude in March 2023 trained with Constitutional AI; Meta released Llama 2 Chat in July 2023 as an open weight model in 7B, 13B, and 70B sizes; Google launched Bard in March 2023 and rebranded the line as Gemini in December 2023; Mistral released Mistral Large in February 2024.
| Paradigm | Mechanism | Representative systems |
|---|---|---|
| Rule based | Hand authored pattern templates, often in AIML or finite state scripts | ELIZA, PARRY, A.L.I.C.E. |
| Retrieval based | Score candidate responses from a corpus using TF-IDF, BM25, or a neural ranker | Cleverbot, early Smart Reply, Watson QA |
| Generative seq2seq | Encoder decoder neural net produces tokens conditioned on context | Vinyals and Le 2015, DialoGPT, Meena |
| Retrieval augmented generative | Generator conditions on retrieved passages from a search index or knowledge base | BlenderBot 2 and 3, Sparrow, RAG systems |
| Instruction tuned LLM with RLHF | Pretrained LLM fine tuned on demonstrations and ranked feedback | ChatGPT, Claude, Gemini, Llama Chat |
Production systems frequently combine paradigms. A customer support bot might use intent classification to route messages, retrieve relevant knowledge base articles, then call an instruction tuned LLM to write the reply.
The dominant pipeline for current chat models has four stages. First, a base language model is pretrained on a large web text corpus. Second, supervised fine tuning (SFT) on curated demonstrations teaches the model to follow instructions. Third, a reward model is trained on human comparisons between candidate outputs, and the policy is optimized against that reward, classically with proximal policy optimization. This recipe was popularized by the InstructGPT paper of Long Ouyang and colleagues in March 2022; they found that a 1.3 billion parameter InstructGPT was preferred to the 175 billion parameter GPT-3 base model on instruction tasks. Fourth, the model is evaluated and red teamed.
Several variants of the third stage compete in practice. Constitutional AI, introduced by Yuntao Bai and colleagues at Anthropic in December 2022, replaces human harmlessness labels with self critique against a written list of principles, an approach the authors called Reinforcement Learning from AI Feedback (RLAIF). Direct Preference Optimization (DPO), introduced by Rafael Rafailov and colleagues in May 2023, reparameterizes the RLHF objective so the policy can be optimized directly from preference pairs with a classification loss, avoiding the need for a separate reward model and PPO loop. Many open weight chat models in the Llama and Mistral families now use DPO or related preference losses.
| Model | Year | Developer | Size | Notes |
|---|---|---|---|---|
| ELIZA | 1966 | MIT (Weizenbaum) | ~200 lines MAD-SLIP | Pattern matching DOCTOR script |
| PARRY | 1972 | Stanford (Colby) | Rule based | Simulated paranoid patient |
| A.L.I.C.E. | 1995 | Richard Wallace | AIML rule base | Three time Loebner Prize winner |
| Cleverbot | 1997 | Rollo Carpenter | Learned response DB | Trained from user transcripts |
| IBM Watson | 2011 | IBM | Cluster pipeline | Won Jeopardy! |
| Siri | 2011 | Apple | Server pipeline | iPhone 4S launch |
| Alexa | 2014 | Amazon | Server pipeline | Echo voice assistant |
| Cortana | 2014 | Microsoft | Server pipeline | Windows Phone, Windows 10 |
| Google Assistant | 2016 | Server pipeline | Successor to Google Now | |
| DialoGPT | Nov 2019 | Microsoft Research | 117M-762M | GPT-2 on Reddit |
| Meena | Jan 2020 | 2.6B | Evolved Transformer, SSA metric | |
| BlenderBot 1 | Apr 2020 | Facebook AI | 90M, 2.7B, 9.4B | Personality, knowledge, empathy |
| BlenderBot 2 | Jul 2021 | Facebook AI | 2.7B | Long term memory, web search |
| LaMDA | Jan 2022 | up to 137B | Dialogue specialized transformer | |
| InstructGPT | Mar 2022 | OpenAI | 1.3B-175B | RLHF on GPT-3 base |
| BlenderBot 3 | Aug 2022 | Meta AI | 175B | Public deployment for safety research |
| Sparrow | Sep 2022 | DeepMind | 70B | RLHF against 23 rules, search citations |
| ChatGPT | Nov 2022 | OpenAI | GPT-3.5 backbone | Launched Nov 30, 2022 |
| Claude | Mar 2023 | Anthropic | Not disclosed | Trained with Constitutional AI |
| GPT-4 | Mar 2023 | OpenAI | Not disclosed | Multimodal image input |
| Bard | Mar 2023 | PaLM 2 then Gemini | Rebranded as Gemini Dec 2023 | |
| Llama 2 Chat | Jul 2023 | Meta | 7B, 13B, 70B | Open weight chat with RLHF |
| Gemini | Dec 2023 | Google DeepMind | Nano, Pro, Ultra | Multimodal from the start |
| Mistral Large | Feb 2024 | Mistral AI | Not disclosed | Multilingual, function calling |
Sizes reflect public disclosures at release; many later versions are not parameter tagged.
Evaluating open ended chat is harder than scoring a classification task. The field uses a mix of static benchmarks and live human comparisons.
| Benchmark | Type | Notes |
|---|---|---|
| MT-Bench | Multi turn open ended | 80 questions across 8 categories scored by GPT-4 as judge, Zheng et al. 2023 |
| LMSYS Chatbot Arena | Crowdsourced battle | Pairwise blind votes converted to Elo ratings, Zheng et al. 2023 |
| Persona-Chat | Persona grounded chat | 164k utterances over 1,155 personas, Zhang et al. 2018 |
| ConvAI2 | Persona-Chat extension | NeurIPS 2018 challenge with the same setup |
| MultiWOZ | Task oriented | 10k human to human dialogues across 7 domains, Budzianowski et al. 2018 |
| Wizard of Wikipedia | Knowledge grounded | Dialogues where one side has access to Wikipedia, Dinan et al. 2019 |
| DSTC | Annual challenges | Dialogue State Tracking Challenge series since 2013 |
Automatic metrics such as BLEU and perplexity correlate poorly with human judgements of dialogue quality, which is why human evaluation, LLM judges, and pairwise voting platforms such as Chatbot Arena have become the standard for ranking chat assistants.
Conversational models are deployed in customer support to triage tickets and draft replies; in virtual assistants on phones, speakers, and cars; in mental health support such as the CBT chatbot Woebot launched in 2017; in tutoring and language learning including Duolingo's Max; in programming assistance through tools such as GitHub Copilot Chat and Cursor; and in productivity assistants embedded in office suites. Modern AI agents extend chat into multi step task execution by calling external tools, browsing the web, and writing code, using the same instruction tuned LLM backbones.
From 2024 through 2026 the chat assistant category has been shaped by several developments. Multimodality is now standard: GPT-4o, Gemini, and Claude accept images and audio and produce speech, with live voice chat at sub second latency. Tool use and function calling let chat models call APIs, run code, and browse the web during a single response. Context windows have grown past one million tokens, allowing entire codebases to fit into a single conversation. Persistent memory features store user facts across sessions. Dedicated reasoning models, including OpenAI's o1 and o3 and Anthropic's extended thinking mode, spend extra compute on intermediate reasoning before producing a final answer.
Large chat models are aligned with operator and user intent through RLHF, Constitutional AI and other RLAIF variants, instruction hierarchy training, refusal classifiers, and post deployment monitoring. The Sparrow paper showed that targeted human judgement against explicit rules reduced rule breaking under adversarial probing by roughly three times relative to a baseline dialogue model. Jailbreaks, in which users craft prompts that bypass safety policies, remain an active area; defenses include adversarial training, input and output classifiers, and structured system prompts. Hallucination, the production of confident but incorrect statements, is mitigated by retrieval augmentation, citation requirements, and post hoc fact checking. Red teaming is now a standard pre release step at major labs.
Despite rapid progress, conversational models still show predictable failure modes. They fabricate facts and citations when not grounded in retrieval. They exhibit sycophancy, adjusting beliefs to match the user instead of pushing back. They lose track of facts across long conversations even with million token contexts. They are sensitive to prompt phrasing and can be steered into harmful outputs by adversarial inputs. Persona stability is brittle; minor prompt changes can flip tone or stance. These issues motivate continued work on retrieval, tool use, reasoning, alignment training, and evaluation.