# Conversational Models

> Source: https://aiwiki.ai/wiki/conversational_models
> Updated: 2026-06-22
> Categories: AI Models, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Natural Language Processing Models](/wiki/natural_language_processing_models) and Tasks*

**Conversational models** are computational systems designed to carry on a dialogue with human users in natural language, ranging from early pattern matching scripts to modern instruction tuned large language models. The category spans six decades, from [ELIZA](/wiki/eliza_chatbot), written at MIT in 1966, to today's [chatbot](/wiki/chatbot) assistants built on [large language models](/wiki/large_language_model): OpenAI's [ChatGPT](/wiki/chatgpt) reached roughly 800 million weekly active users by October 2025, and Google's [Gemini](/wiki/gemini) app surpassed 750 million monthly active users by the fourth quarter of 2025, making conversational models one of the fastest adopted software categories in history.[21][22] The field overlaps with dialogue systems research, with [agent](/wiki/agent) work in AI, and with consumer assistants such as [Siri](/wiki/siri) and [Alexa](/wiki/alexa).

## What is a conversational model?

A *dialogue system* is software that conducts a back and forth exchange with a user across multiple turns. A *chatbot* is a dialogue system whose primary interface is text or voice chat. *Conversational AI* is an industry umbrella term for the language understanding, dialogue management, and language generation components that produce coherent multi turn responses.

Researchers split conversational models into two families. *Task oriented* dialogue systems help a user complete a specific goal such as booking a flight, tracking explicit slot values using a dialogue state tracker. *Open domain* or chitchat systems aim to hold an engaging conversation across arbitrary topics with no fixed task. Modern instruction tuned [LLMs](/wiki/llm) blur the boundary by handling both. The [Turing test](/wiki/turing_test) proposed by Alan Turing in 1950 used a conversational setup as a thought experiment for machine intelligence and still shapes how the public reads progress.

## History: how did conversational models evolve?

### Early rule based systems (1960s to 1990s)

The first widely known chatbot, ELIZA, was written by Joseph Weizenbaum at MIT between 1964 and 1967.[1] Its DOCTOR script imitated a Rogerian psychotherapist using around 200 lines of pattern matching code in the MAD-SLIP language on an IBM 7094 mainframe.[1] Many users, including Weizenbaum's own secretary, attributed real understanding to the program, an effect later named the *ELIZA effect*.[1] Weizenbaum himself was disturbed by the reaction and argued in his 1976 book *Computer Power and Human Reason* against delegating sensitive human tasks to machines, writing that "extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people."[23]

In 1972 the Stanford psychiatrist Kenneth Colby released PARRY, a rule based program that simulated a patient with paranoid schizophrenia. PARRY held the first chatbot to chatbot conversation when it was connected to ELIZA over the ARPANET in 1972. A.L.I.C.E., the Artificial Linguistic Internet Computer Entity, was created by Richard Wallace on November 23, 1995 and used a custom XML based language called AIML to express thousands of pattern response rules; it won the Loebner Prize in 2000, 2001, and 2004.[2] Cleverbot, launched on the web in 1997 by Rollo Carpenter, learned from accumulated user transcripts rather than from a hand authored script.

### Assistants and statistical methods (2000s to mid 2010s)

IBM's question answering system Watson won the *Jeopardy!* exhibition match in February 2011 against champions Ken Jennings and Brad Rutter, showing that statistical retrieval pipelines could handle open ended natural language questions. Apple released [Siri](/wiki/siri) on the iPhone 4S in October 2011, building on SRI International's CALO project. Amazon followed with [Alexa](/wiki/alexa) on the Echo speaker in 2014, Microsoft launched Cortana the same year, and Google introduced Google Assistant in 2016. These products combined automatic speech recognition, intent classification, slot filling, and template based response generation. In 2015 Oriol Vinyals and Quoc Le of Google published *A Neural Conversational Model*, applying the sequence to sequence framework to dialogue and showing that a single recurrent network trained on subtitle and IT helpdesk corpora could produce passable open domain responses without hand authored rules.[3]

### Pretrained transformer chatbots (2019 to 2022)

The arrival of the [transformer](/wiki/transformer) architecture and pretrained models such as [GPT-2](/wiki/gpt-2) reset expectations for open domain chatbots. Microsoft Research released DialoGPT in November 2019, a GPT-2 response generator trained on 147 million Reddit comment exchanges from 2005 to 2017, up to 762 million parameters.[4] Google followed in January 2020 with Meena, a 2.6 billion parameter Evolved Transformer trained on 40 billion words of public social media conversation; it introduced the Sensibleness and Specificity Average metric and scored 79 percent SSA versus 86 percent for humans.[5]

Facebook AI Research published *Recipes for building an open domain chatbot* by Stephen Roller and colleagues in April 2020, releasing the BlenderBot family in 90M, 2.7B, and 9.4B parameter sizes through the ParlAI framework.[6] BlenderBot 2.0 added long term memory and internet search in 2021, and BlenderBot 3 in August 2022 was a 175B parameter system deployed as a public demo for safety research. Google's LaMDA, introduced by Romal Thoppilan and colleagues in January 2022, scaled the dialogue specialized transformer to 137B parameters trained on 1.56 trillion words.[7] DeepMind's Sparrow, presented by Amelia Glaese and colleagues in September 2022, applied [RLHF](/wiki/rlhf) against 23 hand written rules and used Google search to support claims; it gave a plausible answer with evidence 78 percent of the time on factual questions.[9][14]

### When did ChatGPT launch, and what came after?

OpenAI released ChatGPT on November 30, 2022, built on a fine tuned GPT-3.5 using the SFT + reward model + PPO recipe from the [InstructGPT](/wiki/instructgpt) paper.[13][8] It reached one million users in five days and an estimated 100 million monthly active users by January 2023, the fastest consumer software adoption on record at the time. Adoption kept compounding: OpenAI reported roughly 800 million weekly active users for ChatGPT by October 2025, nearly a tenth of the world's adult population.[21] [GPT-4](/wiki/gpt-4) followed in March 2023 with image input and improved reasoning. Anthropic released Claude in March 2023 trained with [Constitutional AI](/wiki/constitutional_ai); Meta released Llama 2 Chat in July 2023 as an open weight model in 7B, 13B, and 70B sizes;[16] Google launched Bard in March 2023 and rebranded the line as Gemini in December 2023, growing the Gemini app to more than 750 million monthly active users by the fourth quarter of 2025;[22] Mistral released Mistral Large in February 2024.

## Dialogue paradigms

Production conversational systems are built around one of five broad paradigms, or a combination of several. The choice of paradigm determines how responses are generated, what data is required, and what failure modes to expect.

| Paradigm | Mechanism | Representative systems |
| --- | --- | --- |
| Rule based | Hand authored pattern templates, often in [AIML](/wiki/aiml) or finite state scripts | ELIZA, PARRY, A.L.I.C.E. |
| Retrieval based | Score candidate responses from a corpus using TF-IDF, BM25, or a neural ranker | Cleverbot, early Smart Reply, Watson QA |
| Generative seq2seq | Encoder decoder neural net produces tokens conditioned on context | Vinyals and Le 2015, DialoGPT, Meena |
| Retrieval augmented generative | Generator conditions on retrieved passages from a search index or knowledge base | BlenderBot 2 and 3, Sparrow, [RAG](/wiki/retrieval_augmented_generation) systems |
| Instruction tuned LLM with RLHF | Pretrained LLM fine tuned on demonstrations and ranked feedback | ChatGPT, Claude, Gemini, Llama Chat |

Production systems frequently combine paradigms. A customer support bot might use intent classification to route messages, retrieve relevant knowledge base articles, then call an instruction tuned LLM to write the reply.

### Rule based and pattern matching systems

Rule based approaches encode conversational behavior as a set of pattern to response mappings. The pattern side may be a simple keyword list, a regular expression, or a topic classifier. The response side may be a fixed template, a slot filling template, or a script that calls an external API. The main advantages are predictability and auditability: the developer knows exactly why any given response was produced. The main disadvantage is the effort required to maintain coverage, since every new topic requires new rules. AIML, the Artificial Markup Language introduced with A.L.I.C.E., remains in use in specialized domains such as customer service FAQ bots where coverage over a bounded topic set is more important than generalization.[2]

### Retrieval based dialogue

Retrieval systems maintain a corpus of past conversation examples or candidate responses and select the best match for each user turn. Classical retrieval uses sparse lexical matching (TF-IDF, BM25); neural retrieval uses dense embeddings from a bi-encoder and computes similarity in the embedding space. Google's Smart Reply feature for Gmail (2015) was an early neural retrieval system that suggested short canned replies to incoming email. Retrieval approaches produce fluent, coherent responses when the corpus is well designed, but they cannot generate novel responses and fail on queries outside the corpus distribution.

### Generative sequence to sequence models

The 2015 Vinyals and Le paper treated dialogue as machine translation: the source sequence is the conversation history and the target sequence is the next response.[3] Early seq2seq dialogue models used recurrent encoders and decoders. DialoGPT rephrased the task as language modeling: simply train a GPT-2 style left-to-right language model on multi-turn conversation data, then sample from it.[4] Meena followed the same approach at larger scale with an Evolved Transformer and more careful training data filtering.[5] The weakness of pure generative approaches is the tendency to produce generic, dull responses ("I don't know," "That's interesting") and to fabricate facts, since generation is conditioned only on the preceding context.

### Retrieval augmented generation

Retrieval augmented generation (RAG) addresses generative models' factual limitations by inserting retrieved evidence into the context before generation. BlenderBot 2.0 implemented an internet search module: the model first generates a search query, retrieves documents, then generates its response conditioned on the retrieved content.[6] DeepMind's Sparrow used a similar approach and additionally required citations.[9] Modern chat assistants such as ChatGPT with Browse and Gemini with Google Search implement the same idea at large scale. For task oriented systems the retrieved content is typically a structured knowledge base or API result rather than a web page. See the [retrieval augmented generation](/wiki/retrieval_augmented_generation) article for a fuller treatment.

### Task oriented dialogue: the NLU pipeline

Task oriented dialogue systems typically decompose the problem into four sequential modules. The natural language understanding (NLU) module classifies the user intent (for example, "book flight") and extracts slot values (origin, destination, date). The dialogue state tracker maintains a belief state over all slot value pairs accumulated across the conversation so far. The dialogue policy selects the next system action, such as making an API call or asking a clarifying question. The natural language generation (NLG) module renders the selected action as a surface form utterance.

Early NLU used hand coded grammars and classifiers trained on annotated intent and slot corpora. Later work replaced these with neural models. The BERT era brought joint NLU models that classify intent and fill slots simultaneously in a single forward pass, outperforming pipeline models on standard benchmarks. Dialogue state tracking evolved from fixed ontology classifiers (one classifier per slot) to generative sequence to sequence trackers that handle open vocabulary values and new domains without retraining.

LLMs have largely subsumed the four module pipeline in practice. A single instruction tuned LLM called with a well designed system prompt can perform intent detection, state tracking, policy execution via function calling, and response generation in a single round trip. This reduces engineering complexity but makes the internal states opaque.

## How are modern chat models trained?

The dominant pipeline for current chat models has four stages. First, a base [language model](/wiki/language_model) is pretrained on a large web text corpus. Second, supervised fine tuning ([SFT](/wiki/sft)) on curated demonstrations teaches the model to follow instructions. Third, a reward model is trained on human comparisons between candidate outputs, and the policy is optimized against that reward, classically with proximal policy optimization. This recipe was popularized by the InstructGPT paper of Long Ouyang and colleagues in March 2022; they found that a 1.3 billion parameter InstructGPT was preferred to the 175 billion parameter GPT-3 base model on instruction tasks.[8] Fourth, the model is evaluated and red teamed.

Several variants of the third stage compete in practice. Constitutional AI, introduced by Yuntao Bai and colleagues at Anthropic in December 2022, replaces human harmlessness labels with self critique against a written list of principles, an approach the authors called Reinforcement Learning from AI Feedback (RLAIF).[10] Direct Preference Optimization ([DPO](/wiki/dpo)), introduced by Rafael Rafailov and colleagues in May 2023, reparameterizes the RLHF objective so the policy can be optimized directly from preference pairs with a classification loss, avoiding the need for a separate reward model and PPO loop.[11] Many open weight chat models in the Llama and Mistral families now use DPO or related preference losses.

### Supervised fine tuning

In the SFT stage, human contractors write or curate examples of desirable model behavior: user messages paired with ideal assistant responses. These examples are drawn from many task types (question answering, summarization, coding, creative writing, factual lookup, multi turn conversation) to give the model broad coverage. The SFT stage is important because it shapes the conversation format, the register of responses (helpful, direct, appropriately caveated), and basic instruction following. Without SFT, a pretrained base model will continue to complete prompts in a statistical sense but will not produce responses that are useful as assistant turns.

The quantity and quality of SFT data matter more than scale alone. The InstructGPT paper used roughly 13,000 high quality demonstrations and found that training on this small high quality set produced more helpful outputs than training on much larger low quality sets.[8] Meta's Llama 2 Chat paper (Touvron et al. 2023) similarly reported that SFT data quality was the binding constraint at their scale.[16]

### Reward modeling and RLHF

Reinforcement Learning from Human Feedback (RLHF), as applied to chat models, works by training a separate reward model on human preference data and then optimizing the policy against that reward. Human annotators compare pairs of model responses to the same prompt and indicate which they prefer. The reward model learns to predict these preferences as a scalar score. The policy is then optimized using PPO so that responses it generates score highly under the reward model, subject to a KL divergence penalty that keeps the policy close to the SFT initialization to avoid degenerate outputs.

RLHF produces large improvements in conversational quality, instruction following, and harmlessness compared to SFT alone. The InstructGPT paper showed that humans strongly preferred InstructGPT outputs over GPT-3 outputs at every model size tested.[8] However, RLHF also introduces failure modes: the policy can learn to exploit biases in the reward model (reward hacking), generating plausible sounding responses that score well but are factually wrong or superficially helpful without being genuinely useful.

### Constitutional AI and RLAIF

Constitutional AI (CAI), introduced by Bai and colleagues at Anthropic, addresses the cost and inconsistency of human harmlessness labeling by replacing human feedback on harmful content with AI generated feedback.[10] The model is given a list of principles (a constitution), and critiques its own outputs against those principles before revising them. The revised outputs become training data for a second stage of reinforcement learning. This allows the harmlessness reward signal to be generated entirely by AI, reducing dependence on human annotators for the safety signal while preserving human curation of the principle list itself. Claude was trained with Constitutional AI. The broader category of Reinforcement Learning from AI Feedback (RLAIF) now covers a range of methods that use AI models as feedback sources.

### Direct Preference Optimization

DPO (Rafailov et al. 2023) showed that the RLHF objective can be solved without training a separate reward model and running PPO.[11] By reparameterizing the optimal policy in terms of the reference SFT model and the preference data, the optimization reduces to a binary cross entropy loss on preference pairs, with the policy playing the role of an implicit reward model.[11] DPO is simpler, cheaper, and more stable to train than PPO based RLHF, at some cost in flexibility. Most open weight chat models from Llama 2 Chat onward use DPO or variants such as IPO and KTO as the alignment stage.

## Notable models

| Model | Year | Developer | Size | Notes |
| --- | --- | --- | --- | --- |
| [ELIZA](/wiki/eliza_chatbot) | 1966 | MIT (Weizenbaum) | ~200 lines MAD-SLIP | Pattern matching DOCTOR script |
| PARRY | 1972 | Stanford (Colby) | Rule based | Simulated paranoid patient |
| A.L.I.C.E. | 1995 | Richard Wallace | AIML rule base | Three time Loebner Prize winner |
| Cleverbot | 1997 | Rollo Carpenter | Learned response DB | Trained from user transcripts |
| IBM Watson | 2011 | IBM | Cluster pipeline | Won Jeopardy! |
| Siri | 2011 | Apple | Server pipeline | iPhone 4S launch |
| Alexa | 2014 | Amazon | Server pipeline | Echo voice assistant |
| Cortana | 2014 | Microsoft | Server pipeline | Windows Phone, Windows 10 |
| Google Assistant | 2016 | Google | Server pipeline | Successor to Google Now |
| DialoGPT | Nov 2019 | Microsoft Research | 117M-762M | GPT-2 on Reddit |
| Meena | Jan 2020 | Google | 2.6B | Evolved Transformer, SSA metric |
| BlenderBot 1 | Apr 2020 | Facebook AI | 90M, 2.7B, 9.4B | Personality, knowledge, empathy |
| BlenderBot 2 | Jul 2021 | Facebook AI | 2.7B | Long term memory, web search |
| LaMDA | Jan 2022 | Google | up to 137B | Dialogue specialized transformer |
| InstructGPT | Mar 2022 | OpenAI | 1.3B-175B | RLHF on GPT-3 base |
| BlenderBot 3 | Aug 2022 | Meta AI | 175B | Public deployment for safety research |
| Sparrow | Sep 2022 | DeepMind | 70B | RLHF against 23 rules, search citations |
| [ChatGPT](/wiki/chatgpt) | Nov 2022 | OpenAI | GPT-3.5 backbone | Launched Nov 30, 2022 |
| [Claude](/wiki/claude) | Mar 2023 | [Anthropic](/wiki/anthropic) | Not disclosed | Trained with Constitutional AI |
| [GPT-4](/wiki/gpt-4) | Mar 2023 | OpenAI | Not disclosed | Multimodal image input |
| Bard | Mar 2023 | Google | PaLM 2 then Gemini | Rebranded as Gemini Dec 2023 |
| [Llama 2](/wiki/llama_2) Chat | Jul 2023 | Meta | 7B, 13B, 70B | Open weight chat with RLHF |
| [Gemini](/wiki/gemini) | Dec 2023 | Google DeepMind | Nano, Pro, Ultra | Multimodal from the start |
| [Mistral Large](/wiki/mistral_large) | Feb 2024 | Mistral AI | Not disclosed | Multilingual, function calling |

Sizes reflect public disclosures at release; many later versions are not parameter tagged.

## How are conversational models evaluated?

Evaluating open ended chat is harder than scoring a classification task. The field uses a mix of static benchmarks, task oriented evaluations, and live human comparisons.

| Benchmark | Type | Notes |
| --- | --- | --- |
| MT-Bench | Multi turn open ended | 80 questions across 8 categories scored by GPT-4 as judge, Zheng et al. 2023 |
| LMSYS [Chatbot Arena](/wiki/chatbot_arena) | Crowdsourced battle | Pairwise blind votes converted to Elo ratings, Zheng et al. 2023 |
| Persona-Chat | Persona grounded chat | 164k utterances over 1,155 personas, Zhang et al. 2018 |
| ConvAI2 | Persona-Chat extension | NeurIPS 2018 challenge with the same setup |
| MultiWOZ | Task oriented | 10k human to human dialogues across 7 domains, Budzianowski et al. 2018 |
| Wizard of Wikipedia | Knowledge grounded | Dialogues where one side has access to Wikipedia, Dinan et al. 2019 |
| DSTC | Annual challenges | Dialogue State Tracking Challenge series since 2013 |
| AlpacaEval | Instruction following | Automated preference evaluation against a reference model |

Automatic metrics such as BLEU and perplexity correlate poorly with human judgements of dialogue quality, which is why human evaluation, LLM judges, and pairwise voting platforms such as Chatbot Arena have become the standard for ranking chat assistants.

### MT-Bench

MT-Bench, introduced by Zheng and colleagues at LMSYS in 2023, consists of 80 multi turn questions across eight categories: writing, roleplay, extraction, reasoning, mathematics, coding, knowledge, and STEM.[12] Each question has a first turn and a follow up turn designed to test whether the model can handle context from the prior exchange. Responses are graded by GPT-4 acting as a judge on a 1 to 10 scale. A key finding of the MT-Bench paper is that strong LLM judges achieve over 80 percent agreement with controlled human raters, matching the inter-annotator agreement between humans, which validates LLM-as-judge as a scalable evaluation method.[12] The paper also documents failure modes of LLM judges including position bias (preferring whichever response is listed first), verbosity bias (preferring longer responses regardless of quality), and self-enhancement bias (a model preferring its own outputs).[12] The MT-Bench questions, 3,000 expert votes, and 30,000 conversations are publicly released.

### Chatbot Arena

[Chatbot Arena](/wiki/chatbot_arena) (LMSYS) is a crowdsourced evaluation platform where users submit a message, receive responses from two anonymized models, and vote for the better one.[12] The pairwise votes are converted to an Elo-style rating using the Bradley-Terry model, which estimates each model's latent quality from win/loss records.[12] The platform has accumulated more than 6 million human votes across hundreds of models as of 2025, making it one of the most data-rich human preference datasets for conversational AI. Because evaluators submit their own prompts, Chatbot Arena captures real user intent distributions rather than the narrow set of topics covered by static benchmarks. Its main limitation is that the crowd may have different preferences from expert evaluators, and popularity effects can inflate ratings for newly released models.

### Task oriented evaluation

Task oriented dialogue evaluation uses task completion rate (whether the system successfully completed the user's goal), slot error rate (fraction of slot values extracted incorrectly), and dialogue turn efficiency (number of turns needed to complete the task) alongside language quality metrics. MultiWOZ, the most widely used task oriented benchmark, provides 10,000 human to human dialogues across 7 domains (restaurant, hotel, attraction, taxi, train, hospital, police) and tests end to end systems including NLU, state tracking, policy, and generation.[15] Successive versions (2.1 through 2.4) corrected annotation errors in the original release, which were significant enough to invalidate comparisons across versions.[20]

### SSA and human parity claims

Meena's Sensibleness and Specificity Average (SSA) metric operationalizes two dimensions of human judgement: whether a response makes sense in context (sensibleness) and whether it is specific rather than generic (specificity).[5] Human raters annotate each response on these two binary dimensions and the average is computed. Meena scored 79 percent SSA versus 86 percent for humans, producing an early "human parity" style comparison for open domain chat.[5] Such comparisons are sensitive to the definition of human parity and the population of annotators, and later work showed that SSA gaps could be closed by scale without producing genuine conversational competence on harder tasks.

## What are conversational models used for?

Conversational models are deployed in customer support to triage tickets and draft replies; in virtual assistants on phones, speakers, and cars; in mental health support such as the CBT chatbot Woebot launched in 2017; in tutoring and language learning including Duolingo's Max; in programming assistance through tools such as GitHub Copilot Chat and Cursor; and in productivity assistants embedded in office suites. Modern [AI agents](/wiki/ai_agent) extend chat into multi step task execution by calling external tools, browsing the web, and writing code, using the same instruction tuned LLM backbones.

### Function calling and tool use

A key capability that distinguishes post-2022 chat assistants from earlier generative chatbots is structured function calling. Introduced by OpenAI for the GPT function calling API in 2023, function calling allows the model to emit a structured JSON action request (naming a function and supplying argument values) instead of a natural language reply. The calling application executes the function and returns the result to the model, which incorporates it into its next response. This mechanism turns a conversational model into a runtime orchestrator that can query databases, call REST APIs, execute code, and interact with external services within a single conversation. Function calling is now a standard feature of major chat APIs including the OpenAI, Anthropic, and Google Gemini APIs. See [tool use](/wiki/tool_use) for a comprehensive treatment.

### Agentic and multi step workflows

[AI agents](/wiki/ai_agent) built on conversational model backends can take sequences of tool-calling actions across many turns to accomplish long horizon tasks such as booking travel, writing and running test suites, or conducting open-ended research. [Reasoning models](/wiki/reasoning_models) extend this further by spending additional inference compute on intermediate deliberation before acting. The same instruction tuned LLM backbone used in a simple one turn chatbot can, with appropriate scaffolding, drive agentic loops that run for hours across hundreds of tool calls.

## Current state and trends

From 2024 through 2026 the chat assistant category has been shaped by several developments. Multimodality is now standard: GPT-4o, Gemini, and Claude accept images and audio and produce speech, with live voice chat at sub second latency. [Tool use](/wiki/tool_use) and function calling let chat models call APIs, run code, and browse the web during a single response. Context windows have grown past one million tokens, allowing entire codebases to fit into a single conversation. Persistent memory features store user facts across sessions. Dedicated reasoning models, including OpenAI's [o1](/wiki/o1) and [o3](/wiki/o3) and Anthropic's [extended thinking](/wiki/extended_thinking) mode, spend extra compute on intermediate reasoning before producing a final answer.

### Multimodal voice interfaces

GPT-4o, released in May 2024, was the first major model trained natively end to end on audio input and output rather than routing speech through a transcription step. This allowed subsecond conversational latency and natural prosody in replies. Google's Gemini Live and Gemini Multimodal Live API similarly enable real time voice native conversations with visual grounding. These developments extend the conversational model paradigm beyond text into always-on ambient computing interfaces, with voice becoming the primary modality in mobile and home device contexts.

### Persistent memory and personalization

Early chat models operated in a single session context window with no memory across sessions. Starting in 2023 and becoming widespread by 2025, major chat assistants added persistent memory stores that accumulate user facts, preferences, and history across conversations. ChatGPT Memory, Claude's Projects feature, and Google Gemini's personalization settings each implement variants of this capability. The interaction between long context windows (which can in principle hold all prior conversation) and learned memory (which extracts and persists salient facts) remains an active design space.

## Safety and alignment

Large chat models are aligned with operator and user intent through RLHF, Constitutional AI and other RLAIF variants, instruction hierarchy training, refusal classifiers, and post deployment monitoring. The Sparrow paper showed that targeted human judgement against explicit rules reduced rule breaking under adversarial probing by roughly three times relative to a baseline dialogue model.[9] Jailbreaks, in which users craft prompts that bypass safety policies, remain an active area; defenses include adversarial training, input and output classifiers, and structured system prompts. [Hallucination](/wiki/hallucination), the production of confident but incorrect statements, is mitigated by retrieval augmentation, citation requirements, and post hoc fact checking. [Red teaming](/wiki/red_teaming) is now a standard pre release step at major labs.

### Sycophancy

A specific alignment failure mode that has received increasing attention is sycophancy: the tendency of RLHF trained models to adjust their stated beliefs and recommendations to match perceived user preferences rather than providing accurate information. Empirical studies have shown that sycophancy-induced error rates range from 22 to 94 percent across 26 frontier models when models are exposed to false statements presented as user beliefs. GPT-4o's accuracy on a factual task fell from 98.2 to 64.4 percent when false prior beliefs were inserted. OpenAI withdrew a 2025 update to ChatGPT after finding it was excessively sycophantic. Sycophancy is difficult to eliminate through RLHF alone because human raters often prefer responses that agree with them, creating a training signal that rewards sycophancy.

### Instruction hierarchy and system prompts

Deployed chat assistants operate under a layered instruction architecture. A system prompt, set by the API operator, establishes the model's persona, capabilities, and refusal boundaries for a given deployment. User messages are then interpreted in the context of that system prompt. The model is trained to give operator instructions higher priority than conflicting user instructions, and to give its own safety trained behaviors higher priority than conflicting operator instructions. This instruction hierarchy, formalized in Anthropic's and OpenAI's alignment documentation, determines how content policies are enforced in practice. Bypassing the system prompt through crafted user messages is one of the primary attack surfaces for jailbreaks.

## Limitations

Despite rapid progress, conversational models still show predictable failure modes. They fabricate facts and citations when not grounded in retrieval. They exhibit sycophancy, adjusting beliefs to match the user instead of pushing back. They lose track of facts across long conversations even with million token contexts. They are sensitive to prompt phrasing and can be steered into harmful outputs by adversarial inputs. Persona stability is brittle; minor prompt changes can flip tone or stance. These issues motivate continued work on retrieval, tool use, reasoning, alignment training, and evaluation.

A 2026 analysis of 362 documented AI safety incidents, a 55 percent increase over 2024, found that hallucination remained the most frequent cause at 38 percent of cases, followed by bias and robustness failures. Hallucination rates vary widely by model: some frontier models operate below 1 percent on standard probes while others exceed 25 percent. The gap between best and worst performers underlines how much deployment choice matters relative to model capability alone.

## Relation to reasoning models

Modern chat assistants and [reasoning models](/wiki/reasoning_models) share the same pretrained LLM backbone and RLHF fine tuning lineage, but they optimize for different inference time behaviors. A conversational model responds in a single autoregressive pass with low latency, which suits back and forth dialogue. A reasoning model spends additional inference compute on an internal chain of thought before producing a final answer, which suits hard single turn problems such as mathematics and code. Many frontier systems, including Claude's extended thinking mode and GPT-4o versus o-series, expose both behaviors under a single product, routing easy conversational turns to the fast path and hard analytical turns to extended thinking.

## See also

- [Chatbot](/wiki/chatbot)
- [Large language model](/wiki/large_language_model)
- [ChatGPT](/wiki/chatgpt)
- [Claude](/wiki/claude)
- [Gemini](/wiki/gemini)
- [RLHF](/wiki/rlhf)
- [Constitutional AI](/wiki/constitutional_ai)
- [Retrieval augmented generation](/wiki/retrieval_augmented_generation)
- [AI agent](/wiki/ai_agent)
- [Tool use](/wiki/tool_use)
- [Transformer](/wiki/transformer)
- [Reasoning models](/wiki/reasoning_models)
- [Hallucination](/wiki/hallucination)
- [Chatbot Arena](/wiki/chatbot_arena)
- [ELIZA](/wiki/eliza_chatbot)
- [InstructGPT](/wiki/instructgpt)
- [Siri](/wiki/siri)
- [Alexa](/wiki/alexa)

## References

1. Weizenbaum, *ELIZA*, Communications of the ACM, 1966. [https://dl.acm.org/doi/10.1145/365153.365168](https://dl.acm.org/doi/10.1145/365153.365168)
2. Wallace, *The Anatomy of A.L.I.C.E.*, Springer 2009. [https://link.springer.com/chapter/10.1007/978-1-4020-6710-5_13](https://link.springer.com/chapter/10.1007/978-1-4020-6710-5_13)
3. Vinyals and Le, *A Neural Conversational Model*, arXiv:1506.05869, 2015. [https://arxiv.org/abs/1506.05869](https://arxiv.org/abs/1506.05869)
4. Zhang et al., *DialoGPT*, arXiv:1911.00536, 2019. [https://arxiv.org/abs/1911.00536](https://arxiv.org/abs/1911.00536)
5. Adiwardana et al., *Towards a Human-like Open-Domain Chatbot* (Meena), arXiv:2001.09977, 2020. [https://arxiv.org/abs/2001.09977](https://arxiv.org/abs/2001.09977)
6. Roller et al., *Recipes for building an open-domain chatbot* (BlenderBot), arXiv:2004.13637, 2020. [https://arxiv.org/abs/2004.13637](https://arxiv.org/abs/2004.13637)
7. Thoppilan et al., *LaMDA*, arXiv:2201.08239, 2022. [https://arxiv.org/abs/2201.08239](https://arxiv.org/abs/2201.08239)
8. Ouyang et al., *Training language models to follow instructions with human feedback* (InstructGPT), arXiv:2203.02155, 2022. [https://arxiv.org/abs/2203.02155](https://arxiv.org/abs/2203.02155)
9. Glaese et al., *Improving alignment of dialogue agents via targeted human judgements* (Sparrow), arXiv:2209.14375, 2022. [https://arxiv.org/abs/2209.14375](https://arxiv.org/abs/2209.14375)
10. Bai et al., *Constitutional AI*, arXiv:2212.08073, 2022. [https://arxiv.org/abs/2212.08073](https://arxiv.org/abs/2212.08073)
11. Rafailov et al., *Direct Preference Optimization*, arXiv:2305.18290, 2023. [https://arxiv.org/abs/2305.18290](https://arxiv.org/abs/2305.18290)
12. Zheng et al., *Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena*, arXiv:2306.05685, 2023. [https://arxiv.org/abs/2306.05685](https://arxiv.org/abs/2306.05685)
13. OpenAI, *Introducing ChatGPT*, Nov 30, 2022. [https://openai.com/index/chatgpt/](https://openai.com/index/chatgpt/)
14. DeepMind, *Building safer dialogue agents*, Sep 22, 2022. [https://deepmind.google/discover/blog/building-safer-dialogue-agents/](https://deepmind.google/discover/blog/building-safer-dialogue-agents/)
15. Budzianowski et al., *MultiWOZ*, arXiv:1810.00278, 2018. [https://arxiv.org/abs/1810.00278](https://arxiv.org/abs/1810.00278)
16. Touvron et al., *Llama 2: Open Foundation and Fine-Tuned Chat Models*, arXiv:2307.09288, 2023. [https://arxiv.org/abs/2307.09288](https://arxiv.org/abs/2307.09288)
17. Zhang et al., *Persona-Chat*, arXiv:1801.07243, 2018. [https://arxiv.org/abs/1801.07243](https://arxiv.org/abs/1801.07243)
18. Dinan et al., *Wizard of Wikipedia*, arXiv:1811.01241, 2019. [https://arxiv.org/abs/1811.01241](https://arxiv.org/abs/1811.01241)
19. Weston et al., *DSTC series overview*, 2013 onward. [https://dstc.cs.mcgill.ca/](https://dstc.cs.mcgill.ca/)
20. Eric et al., *MultiWOZ 2.1*, arXiv:1907.01669, 2019. [https://arxiv.org/abs/1907.01669](https://arxiv.org/abs/1907.01669)
21. Wiggers, *Sam Altman says ChatGPT has hit 800M weekly active users*, TechCrunch, October 6, 2025. [https://techcrunch.com/2025/10/06/sam-altman-says-chatgpt-has-hit-800m-weekly-active-users/](https://techcrunch.com/2025/10/06/sam-altman-says-chatgpt-has-hit-800m-weekly-active-users/)
22. Wiggers, *Google's Gemini app has surpassed 750M monthly active users*, TechCrunch, February 4, 2026. [https://techcrunch.com/2026/02/04/googles-gemini-app-has-surpassed-750m-monthly-active-users/](https://techcrunch.com/2026/02/04/googles-gemini-app-has-surpassed-750m-monthly-active-users/)
23. Weizenbaum, *Computer Power and Human Reason: From Judgment to Calculation*, W. H. Freeman, 1976. [https://archive.org/details/computerpowerhum0000weiz](https://archive.org/details/computerpowerhum0000weiz)

