Conversational Models

AI Models Natural Language Processing

27 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v5 · 5,349 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Natural Language Processing Models and Tasks

Conversational models are computational systems designed to carry on a dialogue with human users in natural language, ranging from early pattern matching scripts to modern instruction tuned large language models. The category spans six decades, from ELIZA, written at MIT in 1966, to today's chatbot assistants built on large language models: OpenAI's ChatGPT reached roughly 800 million weekly active users by October 2025, and Google's Gemini app surpassed 750 million monthly active users by the fourth quarter of 2025, making conversational models one of the fastest adopted software categories in history.^[21]^[22] The field overlaps with dialogue systems research, with agent work in AI, and with consumer assistants such as Siri and Alexa.

What is a conversational model?

A dialogue system is software that conducts a back and forth exchange with a user across multiple turns. A chatbot is a dialogue system whose primary interface is text or voice chat. Conversational AI is an industry umbrella term for the language understanding, dialogue management, and language generation components that produce coherent multi turn responses.

Researchers split conversational models into two families. Task oriented dialogue systems help a user complete a specific goal such as booking a flight, tracking explicit slot values using a dialogue state tracker. Open domain or chitchat systems aim to hold an engaging conversation across arbitrary topics with no fixed task. Modern instruction tuned LLMs blur the boundary by handling both. The Turing test proposed by Alan Turing in 1950 used a conversational setup as a thought experiment for machine intelligence and still shapes how the public reads progress.

History: how did conversational models evolve?

Early rule based systems (1960s to 1990s)

The first widely known chatbot, ELIZA, was written by Joseph Weizenbaum at MIT between 1964 and 1967.^[1] Its DOCTOR script imitated a Rogerian psychotherapist using around 200 lines of pattern matching code in the MAD-SLIP language on an IBM 7094 mainframe.^[1] Many users, including Weizenbaum's own secretary, attributed real understanding to the program, an effect later named the ELIZA effect.^[1] Weizenbaum himself was disturbed by the reaction and argued in his 1976 book Computer Power and Human Reason against delegating sensitive human tasks to machines, writing that "extremely short exposures to a relatively simple computer program could induce powerful delusional thinking in quite normal people."^[23]

In 1972 the Stanford psychiatrist Kenneth Colby released PARRY, a rule based program that simulated a patient with paranoid schizophrenia. PARRY held the first chatbot to chatbot conversation when it was connected to ELIZA over the ARPANET in 1972. A.L.I.C.E., the Artificial Linguistic Internet Computer Entity, was created by Richard Wallace on November 23, 1995 and used a custom XML based language called AIML to express thousands of pattern response rules; it won the Loebner Prize in 2000, 2001, and 2004.^[2] Cleverbot, launched on the web in 1997 by Rollo Carpenter, learned from accumulated user transcripts rather than from a hand authored script.

Assistants and statistical methods (2000s to mid 2010s)

IBM's question answering system Watson won the Jeopardy! exhibition match in February 2011 against champions Ken Jennings and Brad Rutter, showing that statistical retrieval pipelines could handle open ended natural language questions. Apple released Siri on the iPhone 4S in October 2011, building on SRI International's CALO project. Amazon followed with Alexa on the Echo speaker in 2014, Microsoft launched Cortana the same year, and Google introduced Google Assistant in 2016. These products combined automatic speech recognition, intent classification, slot filling, and template based response generation. In 2015 Oriol Vinyals and Quoc Le of Google published A Neural Conversational Model, applying the sequence to sequence framework to dialogue and showing that a single recurrent network trained on subtitle and IT helpdesk corpora could produce passable open domain responses without hand authored rules.^[3]

Pretrained transformer chatbots (2019 to 2022)

The arrival of the transformer architecture and pretrained models such as GPT-2 reset expectations for open domain chatbots. Microsoft Research released DialoGPT in November 2019, a GPT-2 response generator trained on 147 million Reddit comment exchanges from 2005 to 2017, up to 762 million parameters.^[4] Google followed in January 2020 with Meena, a 2.6 billion parameter Evolved Transformer trained on 40 billion words of public social media conversation; it introduced the Sensibleness and Specificity Average metric and scored 79 percent SSA versus 86 percent for humans.^[5]

Facebook AI Research published Recipes for building an open domain chatbot by Stephen Roller and colleagues in April 2020, releasing the BlenderBot family in 90M, 2.7B, and 9.4B parameter sizes through the ParlAI framework.^[6] BlenderBot 2.0 added long term memory and internet search in 2021, and BlenderBot 3 in August 2022 was a 175B parameter system deployed as a public demo for safety research. Google's LaMDA, introduced by Romal Thoppilan and colleagues in January 2022, scaled the dialogue specialized transformer to 137B parameters trained on 1.56 trillion words.^[7] DeepMind's Sparrow, presented by Amelia Glaese and colleagues in September 2022, applied RLHF against 23 hand written rules and used Google search to support claims; it gave a plausible answer with evidence 78 percent of the time on factual questions.^[9]^[14]

When did ChatGPT launch, and what came after?

OpenAI released ChatGPT on November 30, 2022, built on a fine tuned GPT-3.5 using the SFT + reward model + PPO recipe from the InstructGPT paper.^[13]^[8] It reached one million users in five days and an estimated 100 million monthly active users by January 2023, the fastest consumer software adoption on record at the time. Adoption kept compounding: OpenAI reported roughly 800 million weekly active users for ChatGPT by October 2025, nearly a tenth of the world's adult population.^[21] GPT-4 followed in March 2023 with image input and improved reasoning. Anthropic released Claude in March 2023 trained with Constitutional AI; Meta released Llama 2 Chat in July 2023 as an open weight model in 7B, 13B, and 70B sizes;^[16] Google launched Bard in March 2023 and rebranded the line as Gemini in December 2023, growing the Gemini app to more than 750 million monthly active users by the fourth quarter of 2025;^[22] Mistral released Mistral Large in February 2024.

Dialogue paradigms

Production conversational systems are built around one of five broad paradigms, or a combination of several. The choice of paradigm determines how responses are generated, what data is required, and what failure modes to expect.

Paradigm	Mechanism	Representative systems
Rule based	Hand authored pattern templates, often in AIML or finite state scripts	ELIZA, PARRY, A.L.I.C.E.
Retrieval based	Score candidate responses from a corpus using TF-IDF, BM25, or a neural ranker	Cleverbot, early Smart Reply, Watson QA
Generative seq2seq	Encoder decoder neural net produces tokens conditioned on context	Vinyals and Le 2015, DialoGPT, Meena
Retrieval augmented generative	Generator conditions on retrieved passages from a search index or knowledge base	BlenderBot 2 and 3, Sparrow, RAG systems
Instruction tuned LLM with RLHF	Pretrained LLM fine tuned on demonstrations and ranked feedback	ChatGPT, Claude, Gemini, Llama Chat

Production systems frequently combine paradigms. A customer support bot might use intent classification to route messages, retrieve relevant knowledge base articles, then call an instruction tuned LLM to write the reply.

Rule based and pattern matching systems

Rule based approaches encode conversational behavior as a set of pattern to response mappings. The pattern side may be a simple keyword list, a regular expression, or a topic classifier. The response side may be a fixed template, a slot filling template, or a script that calls an external API. The main advantages are predictability and auditability: the developer knows exactly why any given response was produced. The main disadvantage is the effort required to maintain coverage, since every new topic requires new rules. AIML, the Artificial Markup Language introduced with A.L.I.C.E., remains in use in specialized domains such as customer service FAQ bots where coverage over a bounded topic set is more important than generalization.^[2]

Retrieval based dialogue

Retrieval systems maintain a corpus of past conversation examples or candidate responses and select the best match for each user turn. Classical retrieval uses sparse lexical matching (TF-IDF, BM25); neural retrieval uses dense embeddings from a bi-encoder and computes similarity in the embedding space. Google's Smart Reply feature for Gmail (2015) was an early neural retrieval system that suggested short canned replies to incoming email. Retrieval approaches produce fluent, coherent responses when the corpus is well designed, but they cannot generate novel responses and fail on queries outside the corpus distribution.

Generative sequence to sequence models

The 2015 Vinyals and Le paper treated dialogue as machine translation: the source sequence is the conversation history and the target sequence is the next response.^[3] Early seq2seq dialogue models used recurrent encoders and decoders. DialoGPT rephrased the task as language modeling: simply train a GPT-2 style left-to-right language model on multi-turn conversation data, then sample from it.^[4] Meena followed the same approach at larger scale with an Evolved Transformer and more careful training data filtering.^[5] The weakness of pure generative approaches is the tendency to produce generic, dull responses ("I don't know," "That's interesting") and to fabricate facts, since generation is conditioned only on the preceding context.

Retrieval augmented generation

Retrieval augmented generation (RAG) addresses generative models' factual limitations by inserting retrieved evidence into the context before generation. BlenderBot 2.0 implemented an internet search module: the model first generates a search query, retrieves documents, then generates its response conditioned on the retrieved content.^[6] DeepMind's Sparrow used a similar approach and additionally required citations.^[9] Modern chat assistants such as ChatGPT with Browse and Gemini with Google Search implement the same idea at large scale. For task oriented systems the retrieved content is typically a structured knowledge base or API result rather than a web page. See the retrieval augmented generation article for a fuller treatment.

Task oriented dialogue: the NLU pipeline

Task oriented dialogue systems typically decompose the problem into four sequential modules. The natural language understanding (NLU) module classifies the user intent (for example, "book flight") and extracts slot values (origin, destination, date). The dialogue state tracker maintains a belief state over all slot value pairs accumulated across the conversation so far. The dialogue policy selects the next system action, such as making an API call or asking a clarifying question. The natural language generation (NLG) module renders the selected action as a surface form utterance.

Early NLU used hand coded grammars and classifiers trained on annotated intent and slot corpora. Later work replaced these with neural models. The BERT era brought joint NLU models that classify intent and fill slots simultaneously in a single forward pass, outperforming pipeline models on standard benchmarks. Dialogue state tracking evolved from fixed ontology classifiers (one classifier per slot) to generative sequence to sequence trackers that handle open vocabulary values and new domains without retraining.

LLMs have largely subsumed the four module pipeline in practice. A single instruction tuned LLM called with a well designed system prompt can perform intent detection, state tracking, policy execution via function calling, and response generation in a single round trip. This reduces engineering complexity but makes the internal states opaque.

How are modern chat models trained?

The dominant pipeline for current chat models has four stages. First, a base language model is pretrained on a large web text corpus. Second, supervised fine tuning (SFT) on curated demonstrations teaches the model to follow instructions. Third, a reward model is trained on human comparisons between candidate outputs, and the policy is optimized against that reward, classically with proximal policy optimization. This recipe was popularized by the InstructGPT paper of Long Ouyang and colleagues in March 2022; they found that a 1.3 billion parameter InstructGPT was preferred to the 175 billion parameter GPT-3 base model on instruction tasks.^[8] Fourth, the model is evaluated and red teamed.

Several variants of the third stage compete in practice. Constitutional AI, introduced by Yuntao Bai and colleagues at Anthropic in December 2022, replaces human harmlessness labels with self critique against a written list of principles, an approach the authors called Reinforcement Learning from AI Feedback (RLAIF).^[10] Direct Preference Optimization (DPO), introduced by Rafael Rafailov and colleagues in May 2023, reparameterizes the RLHF objective so the policy can be optimized directly from preference pairs with a classification loss, avoiding the need for a separate reward model and PPO loop.^[11] Many open weight chat models in the Llama and Mistral families now use DPO or related preference losses.

Supervised fine tuning

In the SFT stage, human contractors write or curate examples of desirable model behavior: user messages paired with ideal assistant responses. These examples are drawn from many task types (question answering, summarization, coding, creative writing, factual lookup, multi turn conversation) to give the model broad coverage. The SFT stage is important because it shapes the conversation format, the register of responses (helpful, direct, appropriately caveated), and basic instruction following. Without SFT, a pretrained base model will continue to complete prompts in a statistical sense but will not produce responses that are useful as assistant turns.

The quantity and quality of SFT data matter more than scale alone. The InstructGPT paper used roughly 13,000 high quality demonstrations and found that training on this small high quality set produced more helpful outputs than training on much larger low quality sets.^[8] Meta's Llama 2 Chat paper (Touvron et al. 2023) similarly reported that SFT data quality was the binding constraint at their scale.^[16]

Reward modeling and RLHF

Reinforcement Learning from Human Feedback (RLHF), as applied to chat models, works by training a separate reward model on human preference data and then optimizing the policy against that reward. Human annotators compare pairs of model responses to the same prompt and indicate which they prefer. The reward model learns to predict these preferences as a scalar score. The policy is then optimized using PPO so that responses it generates score highly under the reward model, subject to a KL divergence penalty that keeps the policy close to the SFT initialization to avoid degenerate outputs.

RLHF produces large improvements in conversational quality, instruction following, and harmlessness compared to SFT alone. The InstructGPT paper showed that humans strongly preferred InstructGPT outputs over GPT-3 outputs at every model size tested.^[8] However, RLHF also introduces failure modes: the policy can learn to exploit biases in the reward model (reward hacking), generating plausible sounding responses that score well but are factually wrong or superficially helpful without being genuinely useful.

Constitutional AI and RLAIF

Constitutional AI (CAI), introduced by Bai and colleagues at Anthropic, addresses the cost and inconsistency of human harmlessness labeling by replacing human feedback on harmful content with AI generated feedback.^[10] The model is given a list of principles (a constitution), and critiques its own outputs against those principles before revising them. The revised outputs become training data for a second stage of reinforcement learning. This allows the harmlessness reward signal to be generated entirely by AI, reducing dependence on human annotators for the safety signal while preserving human curation of the principle list itself. Claude was trained with Constitutional AI. The broader category of Reinforcement Learning from AI Feedback (RLAIF) now covers a range of methods that use AI models as feedback sources.

Direct Preference Optimization

DPO (Rafailov et al. 2023) showed that the RLHF objective can be solved without training a separate reward model and running PPO.^[11] By reparameterizing the optimal policy in terms of the reference SFT model and the preference data, the optimization reduces to a binary cross entropy loss on preference pairs, with the policy playing the role of an implicit reward model.^[11] DPO is simpler, cheaper, and more stable to train than PPO based RLHF, at some cost in flexibility. Most open weight chat models from Llama 2 Chat onward use DPO or variants such as IPO and KTO as the alignment stage.

Notable models

Model	Year	Developer	Size	Notes
ELIZA	1966	MIT (Weizenbaum)	~200 lines MAD-SLIP	Pattern matching DOCTOR script
PARRY	1972	Stanford (Colby)	Rule based	Simulated paranoid patient
A.L.I.C.E.	1995	Richard Wallace	AIML rule base	Three time Loebner Prize winner
Cleverbot	1997	Rollo Carpenter	Learned response DB	Trained from user transcripts
IBM Watson	2011	IBM	Cluster pipeline	Won Jeopardy!
Siri	2011	Apple	Server pipeline	iPhone 4S launch
Alexa	2014	Amazon	Server pipeline	Echo voice assistant
Cortana	2014	Microsoft	Server pipeline	Windows Phone, Windows 10
Google Assistant	2016	Google	Server pipeline	Successor to Google Now
DialoGPT	Nov 2019	Microsoft Research	117M-762M	GPT-2 on Reddit
Meena	Jan 2020	Google	2.6B	Evolved Transformer, SSA metric
BlenderBot 1	Apr 2020	Facebook AI	90M, 2.7B, 9.4B	Personality, knowledge, empathy
BlenderBot 2	Jul 2021	Facebook AI	2.7B	Long term memory, web search
LaMDA	Jan 2022	Google	up to 137B	Dialogue specialized transformer
InstructGPT	Mar 2022	OpenAI	1.3B-175B	RLHF on GPT-3 base
BlenderBot 3	Aug 2022	Meta AI	175B	Public deployment for safety research
Sparrow	Sep 2022	DeepMind	70B	RLHF against 23 rules, search citations
ChatGPT	Nov 2022	OpenAI	GPT-3.5 backbone	Launched Nov 30, 2022
Claude	Mar 2023	Anthropic	Not disclosed	Trained with Constitutional AI
GPT-4	Mar 2023	OpenAI	Not disclosed	Multimodal image input
Bard	Mar 2023	Google	PaLM 2 then Gemini	Rebranded as Gemini Dec 2023
Llama 2 Chat	Jul 2023	Meta	7B, 13B, 70B	Open weight chat with RLHF
Gemini	Dec 2023	Google DeepMind	Nano, Pro, Ultra	Multimodal from the start
Mistral Large	Feb 2024	Mistral AI	Not disclosed	Multilingual, function calling

Sizes reflect public disclosures at release; many later versions are not parameter tagged.

How are conversational models evaluated?

Evaluating open ended chat is harder than scoring a classification task. The field uses a mix of static benchmarks, task oriented evaluations, and live human comparisons.

Benchmark	Type	Notes
MT-Bench	Multi turn open ended	80 questions across 8 categories scored by GPT-4 as judge, Zheng et al. 2023
LMSYS Chatbot Arena	Crowdsourced battle	Pairwise blind votes converted to Elo ratings, Zheng et al. 2023
Persona-Chat	Persona grounded chat	164k utterances over 1,155 personas, Zhang et al. 2018
ConvAI2	Persona-Chat extension	NeurIPS 2018 challenge with the same setup
MultiWOZ	Task oriented	10k human to human dialogues across 7 domains, Budzianowski et al. 2018
Wizard of Wikipedia	Knowledge grounded	Dialogues where one side has access to Wikipedia, Dinan et al. 2019
DSTC	Annual challenges	Dialogue State Tracking Challenge series since 2013
AlpacaEval	Instruction following	Automated preference evaluation against a reference model

Automatic metrics such as BLEU and perplexity correlate poorly with human judgements of dialogue quality, which is why human evaluation, LLM judges, and pairwise voting platforms such as Chatbot Arena have become the standard for ranking chat assistants.

MT-Bench

MT-Bench, introduced by Zheng and colleagues at LMSYS in 2023, consists of 80 multi turn questions across eight categories: writing, roleplay, extraction, reasoning, mathematics, coding, knowledge, and STEM.^[12] Each question has a first turn and a follow up turn designed to test whether the model can handle context from the prior exchange. Responses are graded by GPT-4 acting as a judge on a 1 to 10 scale. A key finding of the MT-Bench paper is that strong LLM judges achieve over 80 percent agreement with controlled human raters, matching the inter-annotator agreement between humans, which validates LLM-as-judge as a scalable evaluation method.^[12] The paper also documents failure modes of LLM judges including position bias (preferring whichever response is listed first), verbosity bias (preferring longer responses regardless of quality), and self-enhancement bias (a model preferring its own outputs).^[12] The MT-Bench questions, 3,000 expert votes, and 30,000 conversations are publicly released.

Chatbot Arena

Chatbot Arena (LMSYS) is a crowdsourced evaluation platform where users submit a message, receive responses from two anonymized models, and vote for the better one.^[12] The pairwise votes are converted to an Elo-style rating using the Bradley-Terry model, which estimates each model's latent quality from win/loss records.^[12] The platform has accumulated more than 6 million human votes across hundreds of models as of 2025, making it one of the most data-rich human preference datasets for conversational AI. Because evaluators submit their own prompts, Chatbot Arena captures real user intent distributions rather than the narrow set of topics covered by static benchmarks. Its main limitation is that the crowd may have different preferences from expert evaluators, and popularity effects can inflate ratings for newly released models.

Task oriented evaluation

Task oriented dialogue evaluation uses task completion rate (whether the system successfully completed the user's goal), slot error rate (fraction of slot values extracted incorrectly), and dialogue turn efficiency (number of turns needed to complete the task) alongside language quality metrics. MultiWOZ, the most widely used task oriented benchmark, provides 10,000 human to human dialogues across 7 domains (restaurant, hotel, attraction, taxi, train, hospital, police) and tests end to end systems including NLU, state tracking, policy, and generation.^[15] Successive versions (2.1 through 2.4) corrected annotation errors in the original release, which were significant enough to invalidate comparisons across versions.^[20]

SSA and human parity claims

Meena's Sensibleness and Specificity Average (SSA) metric operationalizes two dimensions of human judgement: whether a response makes sense in context (sensibleness) and whether it is specific rather than generic (specificity).^[5] Human raters annotate each response on these two binary dimensions and the average is computed. Meena scored 79 percent SSA versus 86 percent for humans, producing an early "human parity" style comparison for open domain chat.^[5] Such comparisons are sensitive to the definition of human parity and the population of annotators, and later work showed that SSA gaps could be closed by scale without producing genuine conversational competence on harder tasks.

What are conversational models used for?

Conversational models are deployed in customer support to triage tickets and draft replies; in virtual assistants on phones, speakers, and cars; in mental health support such as the CBT chatbot Woebot launched in 2017; in tutoring and language learning including Duolingo's Max; in programming assistance through tools such as GitHub Copilot Chat and Cursor; and in productivity assistants embedded in office suites. Modern AI agents extend chat into multi step task execution by calling external tools, browsing the web, and writing code, using the same instruction tuned LLM backbones.

Function calling and tool use

A key capability that distinguishes post-2022 chat assistants from earlier generative chatbots is structured function calling. Introduced by OpenAI for the GPT function calling API in 2023, function calling allows the model to emit a structured JSON action request (naming a function and supplying argument values) instead of a natural language reply. The calling application executes the function and returns the result to the model, which incorporates it into its next response. This mechanism turns a conversational model into a runtime orchestrator that can query databases, call REST APIs, execute code, and interact with external services within a single conversation. Function calling is now a standard feature of major chat APIs including the OpenAI, Anthropic, and Google Gemini APIs. See tool use for a comprehensive treatment.

Agentic and multi step workflows

AI agents built on conversational model backends can take sequences of tool-calling actions across many turns to accomplish long horizon tasks such as booking travel, writing and running test suites, or conducting open-ended research. Reasoning models extend this further by spending additional inference compute on intermediate deliberation before acting. The same instruction tuned LLM backbone used in a simple one turn chatbot can, with appropriate scaffolding, drive agentic loops that run for hours across hundreds of tool calls.

Current state and trends

From 2024 through 2026 the chat assistant category has been shaped by several developments. Multimodality is now standard: GPT-4o, Gemini, and Claude accept images and audio and produce speech, with live voice chat at sub second latency. Tool use and function calling let chat models call APIs, run code, and browse the web during a single response. Context windows have grown past one million tokens, allowing entire codebases to fit into a single conversation. Persistent memory features store user facts across sessions. Dedicated reasoning models, including OpenAI's o1 and o3 and Anthropic's extended thinking mode, spend extra compute on intermediate reasoning before producing a final answer.

Multimodal voice interfaces

GPT-4o, released in May 2024, was the first major model trained natively end to end on audio input and output rather than routing speech through a transcription step. This allowed subsecond conversational latency and natural prosody in replies. Google's Gemini Live and Gemini Multimodal Live API similarly enable real time voice native conversations with visual grounding. These developments extend the conversational model paradigm beyond text into always-on ambient computing interfaces, with voice becoming the primary modality in mobile and home device contexts.

Persistent memory and personalization

Early chat models operated in a single session context window with no memory across sessions. Starting in 2023 and becoming widespread by 2025, major chat assistants added persistent memory stores that accumulate user facts, preferences, and history across conversations. ChatGPT Memory, Claude's Projects feature, and Google Gemini's personalization settings each implement variants of this capability. The interaction between long context windows (which can in principle hold all prior conversation) and learned memory (which extracts and persists salient facts) remains an active design space.

Safety and alignment

Large chat models are aligned with operator and user intent through RLHF, Constitutional AI and other RLAIF variants, instruction hierarchy training, refusal classifiers, and post deployment monitoring. The Sparrow paper showed that targeted human judgement against explicit rules reduced rule breaking under adversarial probing by roughly three times relative to a baseline dialogue model.^[9] Jailbreaks, in which users craft prompts that bypass safety policies, remain an active area; defenses include adversarial training, input and output classifiers, and structured system prompts. Hallucination, the production of confident but incorrect statements, is mitigated by retrieval augmentation, citation requirements, and post hoc fact checking. Red teaming is now a standard pre release step at major labs.

Sycophancy

A specific alignment failure mode that has received increasing attention is sycophancy: the tendency of RLHF trained models to adjust their stated beliefs and recommendations to match perceived user preferences rather than providing accurate information. Empirical studies have shown that sycophancy-induced error rates range from 22 to 94 percent across 26 frontier models when models are exposed to false statements presented as user beliefs. GPT-4o's accuracy on a factual task fell from 98.2 to 64.4 percent when false prior beliefs were inserted. OpenAI withdrew a 2025 update to ChatGPT after finding it was excessively sycophantic. Sycophancy is difficult to eliminate through RLHF alone because human raters often prefer responses that agree with them, creating a training signal that rewards sycophancy.

Instruction hierarchy and system prompts

Deployed chat assistants operate under a layered instruction architecture. A system prompt, set by the API operator, establishes the model's persona, capabilities, and refusal boundaries for a given deployment. User messages are then interpreted in the context of that system prompt. The model is trained to give operator instructions higher priority than conflicting user instructions, and to give its own safety trained behaviors higher priority than conflicting operator instructions. This instruction hierarchy, formalized in Anthropic's and OpenAI's alignment documentation, determines how content policies are enforced in practice. Bypassing the system prompt through crafted user messages is one of the primary attack surfaces for jailbreaks.

Limitations

Despite rapid progress, conversational models still show predictable failure modes. They fabricate facts and citations when not grounded in retrieval. They exhibit sycophancy, adjusting beliefs to match the user instead of pushing back. They lose track of facts across long conversations even with million token contexts. They are sensitive to prompt phrasing and can be steered into harmful outputs by adversarial inputs. Persona stability is brittle; minor prompt changes can flip tone or stance. These issues motivate continued work on retrieval, tool use, reasoning, alignment training, and evaluation.

A 2026 analysis of 362 documented AI safety incidents, a 55 percent increase over 2024, found that hallucination remained the most frequent cause at 38 percent of cases, followed by bias and robustness failures. Hallucination rates vary widely by model: some frontier models operate below 1 percent on standard probes while others exceed 25 percent. The gap between best and worst performers underlines how much deployment choice matters relative to model capability alone.

Relation to reasoning models

Modern chat assistants and reasoning models share the same pretrained LLM backbone and RLHF fine tuning lineage, but they optimize for different inference time behaviors. A conversational model responds in a single autoregressive pass with low latency, which suits back and forth dialogue. A reasoning model spends additional inference compute on an internal chain of thought before producing a final answer, which suits hard single turn problems such as mathematics and code. Many frontier systems, including Claude's extended thinking mode and GPT-4o versus o-series, expose both behaviors under a single product, routing easy conversational turns to the fast path and hard analytical turns to extended thinking.

References

Weizenbaum, *ELIZA*, Communications of the ACM, 1966. https://dl.acm.org/doi/10.1145/365153.365168 ↩
Wallace, *The Anatomy of A.L.I.C.E.*, Springer 2009. https://link.springer.com/chapter/10.1007/978-1-4020-6710-5_13 ↩
Vinyals and Le, *A Neural Conversational Model*, arXiv:1506.05869, 2015. https://arxiv.org/abs/1506.05869 ↩
Zhang et al., *DialoGPT*, arXiv:1911.00536, 2019. https://arxiv.org/abs/1911.00536 ↩
Adiwardana et al., *Towards a Human-like Open-Domain Chatbot* (Meena), arXiv:2001.09977, 2020. https://arxiv.org/abs/2001.09977 ↩
Roller et al., *Recipes for building an open-domain chatbot* (BlenderBot), arXiv:2004.13637, 2020. https://arxiv.org/abs/2004.13637 ↩
Thoppilan et al., *LaMDA*, arXiv:2201.08239, 2022. https://arxiv.org/abs/2201.08239 ↩
Ouyang et al., *Training language models to follow instructions with human feedback* (InstructGPT), arXiv:2203.02155, 2022. https://arxiv.org/abs/2203.02155 ↩
Glaese et al., *Improving alignment of dialogue agents via targeted human judgements* (Sparrow), arXiv:2209.14375, 2022. https://arxiv.org/abs/2209.14375 ↩
Bai et al., *Constitutional AI*, arXiv:2212.08073, 2022. https://arxiv.org/abs/2212.08073 ↩
Rafailov et al., *Direct Preference Optimization*, arXiv:2305.18290, 2023. https://arxiv.org/abs/2305.18290 ↩
Zheng et al., *Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena*, arXiv:2306.05685, 2023. https://arxiv.org/abs/2306.05685 ↩
OpenAI, *Introducing ChatGPT*, Nov 30, 2022. https://openai.com/index/chatgpt/ ↩
DeepMind, *Building safer dialogue agents*, Sep 22, 2022. https://deepmind.google/discover/blog/building-safer-dialogue-agents/ ↩
Budzianowski et al., *MultiWOZ*, arXiv:1810.00278, 2018. https://arxiv.org/abs/1810.00278 ↩
Touvron et al., *Llama 2: Open Foundation and Fine-Tuned Chat Models*, arXiv:2307.09288, 2023. https://arxiv.org/abs/2307.09288 ↩
Zhang et al., *Persona-Chat*, arXiv:1801.07243, 2018. https://arxiv.org/abs/1801.07243
Dinan et al., *Wizard of Wikipedia*, arXiv:1811.01241, 2019. https://arxiv.org/abs/1811.01241
Weston et al., *DSTC series overview*, 2013 onward. https://dstc.cs.mcgill.ca/
Eric et al., *MultiWOZ 2.1*, arXiv:1907.01669, 2019. https://arxiv.org/abs/1907.01669 ↩
Wiggers, *Sam Altman says ChatGPT has hit 800M weekly active users*, TechCrunch, October 6, 2025. https://techcrunch.com/2025/10/06/sam-altman-says-chatgpt-has-hit-800m-weekly-active-users/ ↩
Wiggers, *Google's Gemini app has surpassed 750M monthly active users*, TechCrunch, February 4, 2026. https://techcrunch.com/2026/02/04/googles-gemini-app-has-surpassed-750m-monthly-active-users/ ↩
Weizenbaum, *Computer Power and Human Reason: From Judgment to Calculation*, W. H. Freeman, 1976. https://archive.org/details/computerpowerhum0000weiz ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

Conversational Models

What is a conversational model?

History: how did conversational models evolve?

Early rule based systems (1960s to 1990s)

Assistants and statistical methods (2000s to mid 2010s)

Pretrained transformer chatbots (2019 to 2022)

When did ChatGPT launch, and what came after?

Dialogue paradigms

Rule based and pattern matching systems

Retrieval based dialogue

Generative sequence to sequence models

Retrieval augmented generation

Task oriented dialogue: the NLU pipeline

How are modern chat models trained?

Supervised fine tuning

Reward modeling and RLHF

Constitutional AI and RLAIF

Direct Preference Optimization

Notable models

How are conversational models evaluated?

MT-Bench

Chatbot Arena

Task oriented evaluation

SSA and human parity claims

What are conversational models used for?

Function calling and tool use

Agentic and multi step workflows

Current state and trends

Multimodal voice interfaces

Persistent memory and personalization

Safety and alignment

Sycophancy

Instruction hierarchy and system prompts

Limitations

Relation to reasoning models

See also

References

Improve this article

What links here

What links here

What is a conversational model?

History: how did conversational models evolve?

Early rule based systems (1960s to 1990s)

Assistants and statistical methods (2000s to mid 2010s)

Pretrained transformer chatbots (2019 to 2022)

When did ChatGPT launch, and what came after?

Dialogue paradigms

Rule based and pattern matching systems

Retrieval based dialogue

Generative sequence to sequence models

Retrieval augmented generation

Task oriented dialogue: the NLU pipeline

How are modern chat models trained?

Supervised fine tuning

Reward modeling and RLHF

Constitutional AI and RLAIF

Direct Preference Optimization

Notable models

How are conversational models evaluated?

MT-Bench

Chatbot Arena

Task oriented evaluation

SSA and human parity claims

What are conversational models used for?

Function calling and tool use

Agentic and multi step workflows

Current state and trends

Multimodal voice interfaces

Persistent memory and personalization

Safety and alignment

Sycophancy

Instruction hierarchy and system prompts

Limitations

Relation to reasoning models

See also

References

Improve this article

Related Articles

Translation Models

Bert-base-uncased model

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

What links here

Related Articles

Translation Models

Bert-base-uncased model

Fill-Mask Models

Question Answering Models

Sentence-transformers/all-MiniLM-L6-v2 model

Sentence-transformers/all-mpnet-base-v2 model

What links here