NLU

Artificial Intelligence Natural Language Processing

31 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

30 citations

Revision

v7 · 6,259 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Natural Language Understanding (NLU) is the subfield of artificial intelligence and natural language processing that turns unstructured human language into structured representations a computer can act on, covering machine reading comprehension tasks such as intent classification, named entity recognition, semantic parsing, question answering, and natural language inference. The modern NLU era was defined by BERT, the Google encoder that on its 2018 release obtained "new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement)."^[14] The field has moved through three distinct eras of method: hand-written symbolic rules in the 1960s and 1970s, statistical models trained on labeled corpora during the 1990s and 2000s, and deep neural transformer architectures and large language models since the late 2010s. (For the longer-form treatment of the field, see also natural language understanding.)

NLU is sometimes treated as a synonym for NLP, but the two are not identical. NLP is the broader umbrella that includes NLU plus tasks that produce or transform language, such as machine translation, speech synthesis, summarization, and natural language generation (NLG). NLU concerns the comprehension half of that pipeline: parsing meaning from input text or speech. NLG, by contrast, takes structured input and renders it as fluent language. Modern systems, especially large language models, tend to fold both directions into a single model, but the distinction still organizes how researchers think about evaluation, datasets, and product features.

NLU is the technology that lets a search engine understand what a user really meant by a sloppy query, lets a voice assistant turn the words "set a timer for ten minutes" into a function call with a duration argument, lets a customer-support bot route an angry email to a human agent, and lets a medical record system identify drug names and dosages buried in clinical notes. Most production NLU in 2026 still runs on fine-tuned BERT-family encoders for classification and tagging, with prompted LLMs layered on top for harder reasoning and zero-shot tasks.

How is NLU different from NLP and NLG?

The relationship between these three terms is a frequent source of confusion. The simplest framing is hierarchical: NLP is the parent field, and NLU and NLG are two of its main child areas.

Term	Direction	Typical input	Typical output	Example tasks
NLP	Both	Text or speech	Text, speech, or structured data	Translation, summarization, classification, generation
NLU	Comprehension	Text or speech	Structured meaning representation	Intent detection, entity extraction, parsing, question answering, sentiment, NLI
NLG	Production	Structured data or representation	Fluent text or speech	Report generation, dialogue responses, captioning, summarization

In the symbolic era these were sometimes implemented as completely separate modules with a clean interface. A pipeline might use an NLU front end to convert a user utterance into a logical form, run inference over a knowledge base, then hand the result to an NLG back end that produced an English answer. Modern transformer-based systems usually blur the boundary because the same network learns to do both, but the conceptual split still shows up in product organizations, vendor categories, and benchmark design. Industry analysts and cloud platforms continue to label products as "NLU services" when they expose intent and entity APIs, and as "NLG services" when they expose templated or generative text output.

History

Early symbolic systems (1960s and 1970s)

The roots of NLU lie in the 1950s and 1960s, when researchers tried to teach computers to handle human language using hand-written grammars and dictionaries. The earliest widely cited program was ELIZA, a pattern-matching chatbot written by Joseph Weizenbaum at MIT and described in a January 1966 paper in Communications of the ACM.^[1] ELIZA did not understand language in any deep sense. It used decomposition rules triggered by keywords in user input, then assembled responses from canned reassembly templates. The most famous script, DOCTOR, simulated a Rogerian psychotherapist by reflecting the user's words back as questions. Weizenbaum was startled when users, including his own secretary, attributed feelings and understanding to the program despite being told how it worked, an effect now called the ELIZA effect.

The next major milestone was SHRDLU, a system built by Terry Winograd at MIT between 1968 and 1970 and described in his 1972 book Understanding Natural Language.^[2] SHRDLU lived in a simulated micro-world of colored blocks, pyramids, and a robot arm. A user could type English commands such as "Pick up a big red block" or ask questions such as "Is there anything which is bigger than every pyramid but is not as wide as the thing that supports it?" The program would parse the input, plan an action in the blocks world, and either execute it or reply in English. SHRDLU was implemented in Micro-Planner, a procedural knowledge representation language embedded in LISP, and ran on a DEC PDP-6 with a graphics terminal. Its tight coupling of grammar, semantics, planning, and discourse memory was an unusually integrated approach for its time, and the published demonstrations created a wave of optimism about general NLU that did not survive contact with broader, messier text.

Several other systems explored ambitious symbolic approaches in the same period. SRI International's TACITUS, developed by Jerry Hobbs and colleagues in the 1980s, used abductive reasoning to interpret expository text and was applied to extraction tasks for the U.S. Defense Advanced Research Projects Agency (DARPA) Message Understanding Conferences (MUC) that ran from 1987 through 1997.^[3] The MIT Wheels project and Roger Schank's group at Yale explored conceptual dependency theory and scripts as ways to model story understanding. The Cyc project, started by Doug Lenat in 1984 at the Microelectronics and Computer Technology Corporation (MCC) in Austin, Texas, took the most extreme stance: that human-level NLU required a hand-encoded common-sense knowledge base, and Lenat's team spent decades writing axioms to fill it.^[4] By 2017 the Cyc knowledge base contained roughly 24.5 million assertions over 1.5 million terms, but the project never produced the general reading machine that Lenat had originally projected.

The symbolic period showed both the appeal and the limits of rule-based NLU. Hand-written grammars and ontologies could give very crisp behavior in narrow domains but degraded sharply on novel input, scaled poorly to new topics, and required specialized expertise to maintain. By the late 1980s the field began shifting toward statistical methods that could be trained automatically from text.

The statistical era (1990s and 2000s)

The statistical era of NLU was driven by two factors: the public release of large annotated corpora, and the success of probabilistic models in adjacent fields like speech recognition. The Penn Treebank, released by the University of Pennsylvania in 1992, gave researchers a million-word corpus of Wall Street Journal text annotated with part-of-speech tags and syntactic parse trees.^[6] Other resources followed: PropBank for predicate-argument structure, FrameNet for semantic frames, OntoNotes for multi-layer linguistic annotation, and the CoNLL shared-task datasets for named entity recognition, chunking, and dependency parsing.^[30]

With data in hand, researchers turned to probabilistic graphical models. Hidden Markov Models (HMMs) became standard for sequence labeling tasks like part-of-speech tagging and named entity recognition. Conditional Random Fields (CRFs), introduced by Lafferty, McCallum, and Pereira in 2001, generalized HMMs to allow arbitrary overlapping features and quickly displaced HMMs for many sequence problems.^[5] Maximum-entropy classifiers handled sentence-level tasks like sentiment analysis. Statistical parsers, including the Collins parser (1999) and the Charniak parser, brought probabilistic parsing to broad-coverage English.

The statistical era also produced the first real progress on open-domain semantic tasks. The Penn Discourse Treebank tackled discourse relations. The PropBank-trained semantic role labeling work of Gildea and Jurafsky in 2002 turned predicate-argument structure into a learnable problem. The DARPA Message Understanding Conferences and later Automatic Content Extraction (ACE) programs made information extraction a measurable engineering discipline. Statistical NLU dominated through the 2000s and into the early 2010s, with support vector machines, logistic regression, and CRFs powering most production text classifiers and taggers.

Neural and deep learning (2013 to 2017)

The deep learning era of NLU started with word embeddings. Word2Vec, released by Tomas Mikolov and colleagues at Google in 2013, showed that simple neural networks could learn dense vector representations for words from raw text such that semantic relationships were preserved as linear directions in the embedding space.^[7] GloVe, from Stanford's Pennington, Socher, and Manning in 2014, achieved similar results with a different objective based on word co-occurrence counts.^[8] These embeddings became the standard input layer for almost every NLU model.

On top of word embeddings, researchers built recurrent neural networks (RNNs) for sequence modeling. The Long Short-Term Memory (LSTM), introduced by Hochreiter and Schmidhuber in 1997 and rediscovered by the deep learning community in the early 2010s, became dominant for sentence classification, sequence tagging, and machine translation.^[9] Bidirectional LSTMs (BiLSTMs), often combined with a CRF output layer, set state-of-the-art results on named entity recognition, semantic role labeling, and dependency parsing through 2017. The Stanford Natural Language Inference (SNLI) corpus, released by Bowman, Angeli, Potts, and Manning in 2015 with 570,000 human-written sentence pairs labeled entailment, contradiction, or neutral, became a popular benchmark for these BiLSTM models.^[10]

ELMo (Embeddings from Language Models), released by the Allen Institute for AI in 2018, was the bridge between fixed word embeddings and the modern era. ELMo trained a deep BiLSTM language model on a large corpus, then used the LSTM's internal states as contextualized word representations for downstream tasks. Adding ELMo to existing systems gave consistent gains on six benchmarks, and the paper coined the term "contextual representations" that the field still uses.^[12]

Transformer era (2017 to present)

The transformer architecture, introduced by Vaswani et al. at Google in their 2017 paper Attention Is All You Need, reorganized the field.^[13] Transformers replaced recurrence with self-attention, which let the model read the whole input in parallel and capture long-range dependencies more cleanly than RNNs.

BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018, applied the transformer encoder to NLU and immediately broke 11 NLP benchmarks. BERT used masked language modeling and next sentence prediction to pre-train a deep bidirectional encoder on 3.3 billion words of text, then fine-tuned the same network on small downstream datasets with a single linear classifier head per task.^[14] The pre-train-then-fine-tune paradigm displaced almost every prior approach to NLU within a year.

A flood of BERT-family encoders followed. RoBERTa from Facebook AI in 2019 improved BERT by training longer with more data and dropping next sentence prediction.^[15] ALBERT from Google and the Toyota Technological Institute reduced parameters with cross-layer sharing. DistilBERT from Hugging Face used knowledge distillation to halve the size while keeping 97% of the accuracy. ELECTRA from Stanford and Google replaced masked language modeling with a more sample-efficient replaced-token detection objective. DeBERTa from Microsoft Research added disentangled attention^[16] and was the first model to surpass the human baseline on the SuperGLUE benchmark in January 2021.^[24] ModernBERT, released by Answer.AI and LightOn in December 2024, applied six years of LLM-era architecture improvements to the encoder paradigm, including rotary positional embeddings, FlashAttention, and an 8,192-token native context.^[18]

The transformer era also produced the first generative models that could perform NLU as a side effect. GPT-2 and GPT-3 showed that decoder-only models trained on next-token prediction could perform reading comprehension, sentiment classification, and entailment without any task-specific fine-tuning, simply by reading a few examples in their context window.^[28] By 2023, GPT-4, Claude, Gemini, and other frontier LLMs could match or exceed fine-tuned encoders on many NLU benchmarks via prompting alone. The encoder-only paradigm did not disappear, but the relationship between NLU and NLG has become much tighter than it was a decade ago.

What are the core NLU tasks?

NLU is not a single task. It is a family of related problems, each with its own datasets, evaluation metrics, and dominant architectures. The most important categories:

Task	Goal	Typical evaluation	Example dataset
Intent classification	Pick which of a fixed set of user intents an utterance expresses	Accuracy or F1	SNIPS, ATIS, MASSIVE
Named entity recognition (NER)	Tag spans of text as person, organization, location, etc.	Span F1	CoNLL-2003, OntoNotes 5.0
Semantic parsing	Convert a sentence into a logical form, SQL query, or executable program	Exact match, execution accuracy	GeoQuery, Spider, ATIS
Question answering (QA)	Return an answer to a natural-language question	Exact match, F1	SQuAD, Natural Questions, HotpotQA
Natural language inference (NLI)	Decide whether a hypothesis is entailed by, contradicts, or is neutral to a premise	Accuracy	SNLI, MNLI, ANLI
Sentiment analysis	Classify text by polarity or emotion	Accuracy or macro F1	SST-2, IMDb, Yelp Review
Coreference resolution	Cluster mentions in a document that refer to the same entity	CoNLL F1 (MUC + B-cubed + CEAF)	OntoNotes, GAP
Semantic role labeling (SRL)	Identify predicate-argument structure (who did what to whom, where, when)	Span F1	PropBank, FrameNet
Word sense disambiguation	Pick the correct sense of an ambiguous word in context	Accuracy	SemCor, WiC
Relation extraction	Identify relationships between entities in text	Precision, recall, F1	TACRED, FewRel
Slot filling	Extract structured argument values for a known intent	Slot F1	ATIS, SNIPS, MultiWOZ
Discourse parsing	Identify rhetorical or discourse relations between text segments	F1	Penn Discourse Treebank
Textual entailment	Same as NLI; often used in search and verification pipelines	Accuracy	RTE, SciTail
Machine reading comprehension	Answer questions over a passage by extracting or generating spans	EM, F1	SQuAD 1.1/2.0, RACE, DROP
Topic classification	Assign documents to broad topic labels	Accuracy	AG News, DBpedia

Intent classification and slot filling are the bread and butter of voice assistants and chatbots, where the system needs to decide what the user wants and what arguments they supplied. For the utterance "book a flight from Boston to Denver tomorrow morning," intent classification labels the request BookFlight while slot filling extracts the structured arguments origin=Boston, destination=Denver, and date=tomorrow morning. NER and relation extraction power most information-extraction pipelines in healthcare, legal, and finance. NLI has become the workhorse benchmark for testing whether a model captures sentence-level semantics. Question answering and machine reading comprehension overlap heavily and are the closest things to a general-purpose NLU evaluation.

Modern approaches

BERT-family encoders

For most production NLU work in 2026, the default architecture is still a fine-tuned BERT-family encoder. The recipe is well understood: take a pre-trained checkpoint such as bert-base-uncased, roberta-base, deberta-v3-base, or modernbert-base; add one task-specific output head (a linear layer for classification, a token-level head for tagging, span pointers for extractive QA); and fine-tune all parameters on the task dataset for one to four epochs.

Why do encoders still dominate production NLU?

They are small enough (typically 100M to 400M parameters) to run cheaply on CPUs or modest GPUs at high throughput.
They produce structured outputs (class labels, BIO tags, spans) without parsing tricks.
They are deterministic at inference time, so engineering teams can validate and monitor them with traditional ML ops practices.
They consistently match or beat prompted LLMs on standard NLU benchmarks while costing 100 to 1,000 times less per inference call.

DeBERTa-v3, released by Microsoft in 2021 and accepted at ICLR 2023, was the strongest fine-tuning baseline for several years.^[17] ModernBERT replaced absolute position embeddings with rotary positional embeddings (RoPE), pushed the native context to 8,192 tokens, and trained on 2 trillion tokens, and has displaced DeBERTa-v3 in many new projects since its December 2024 release.^[18] Multilingual variants such as XLM-R and mDeBERTa-v3 fill the same role for cross-lingual NLU.

Sequence-to-sequence models

A second family treats every NLU task as text-to-text: the input is the natural-language question or instruction plus the source text, and the output is a string. T5 (Text-To-Text Transfer Transformer), released by Google in 2019, was the most influential example, and frames NER, classification, QA, and translation all as conditional generation.^[29] FLAN-T5 added instruction tuning on top. Sequence-to-sequence models are slower than encoders for fixed-format outputs but more flexible because the same model can produce any string.

Large language models and zero-shot NLU

Decoder-only large language models such as GPT-4, Claude, Gemini, and Llama can perform NLU through prompting, with no training data, by reading a few examples or just an instruction in their context window. This is called zero-shot or few-shot NLU.

Zero-shot NLU has reshaped how product teams think about new tasks. Before LLMs, adding a new intent to a chatbot required collecting hundreds of examples, training a model, and validating it. With a strong LLM, a product manager can write a one-paragraph description of the intent and have a working classifier in minutes. Few-shot prompting, where the prompt includes 5 to 50 labeled examples, often gets within a few points of fine-tuned baselines on standard benchmarks.^[28]

The cost picture matters. A fine-tuned encoder might run for fractions of a cent per call at sustained throughput on a single GPU. A frontier LLM API call might cost 10 to 100 times more, with higher latency and less predictable behavior. The 2026 production pattern that has settled in most teams looks roughly like this: use prompted LLMs for prototyping, low-volume tasks, and complex reasoning; distill the LLM's behavior into a fine-tuned encoder once volume is high enough to justify the engineering investment.

Retrieval-augmented systems

NLU systems that need to ground their answers in external knowledge often combine an encoder for retrieval with an LLM for synthesis. The encoder produces dense embeddings for documents and queries, a vector index returns the top candidates, and the LLM reads the retrieved passages to produce a final answer. This pattern, called retrieval-augmented generation, powers most enterprise search, document QA, and chatbot-over-documentation systems built since 2023. The retrieval encoder is almost always a BERT-family model fine-tuned with contrastive objectives.

How is NLU measured?

NLU progress has been measured against a series of public benchmarks, each one designed to stay ahead of the previous best system.

What is GLUE? (2018)

The General Language Understanding Evaluation benchmark, introduced by Wang, Singh, Michael, Hill, Levy, and Bowman in April 2018 (arXiv:1804.07461), bundled nine sentence-level and sentence-pair English tasks behind a single average score.^[19] The authors described it as "a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models."^[19] GLUE was the first widely adopted multi-task NLU leaderboard and made it easy to compare models across the field. (See GLUE benchmark for the dedicated article.)

Task	Type	Metric
CoLA	Linguistic acceptability	Matthews correlation
SST-2	Sentiment classification	Accuracy
MRPC	Paraphrase detection	F1 / accuracy
STS-B	Semantic textual similarity	Pearson / Spearman
QQP	Quora question paraphrase	F1 / accuracy
MNLI	Multi-genre NLI	Accuracy (matched + mismatched)
QNLI	Question NLI from SQuAD	Accuracy
RTE	Recognizing Textual Entailment	Accuracy
WNLI	Winograd NLI	Accuracy

BERT-Large hit a GLUE average of 80.5 in late 2018, jumping 7.7 points over the previous best.^[14] RoBERTa pushed it to 88.5 in 2019 and was already past the average human baseline of 87.1. By mid-2019 the field had to retire GLUE because the headroom was gone.

What is SuperGLUE? (2019)

SuperGLUE, introduced by Wang et al. in May 2019 (arXiv:1905.00537), was designed to be "stickier" than GLUE.^[20] The authors noted that performance on GLUE "has recently surpassed the level of non-expert humans, suggesting limited headroom for further research," and responded with "a new set of more difficult language understanding tasks."^[20] SuperGLUE dropped the easier GLUE tasks, kept the two hardest (RTE and WNLI in modified form), and added six new tasks that required more reasoning, coreference, or commonsense knowledge, for eight tasks in total. (See SuperGLUE for the dedicated article.)

Task	Type	Metric
BoolQ	Yes/no question answering	Accuracy
CB	CommitmentBank NLI	F1 / accuracy
COPA	Choice of Plausible Alternatives	Accuracy
MultiRC	Multi-sentence reading comprehension	F1 / EM
ReCoRD	Reading comprehension with commonsense	F1 / EM
RTE	Recognizing Textual Entailment	Accuracy
WiC	Word in Context (sense disambiguation)	Accuracy
WSC	Winograd Schema Challenge	Accuracy

The human baseline on SuperGLUE was 89.8. On January 6, 2021, Microsoft's DeBERTa with 1.5 billion parameters became the first single model to surpass it, scoring 89.9 on the test set.^[24] An ensemble DeBERTa configuration reached 90.3 shortly after, and Google's T5-based system reached 90.2. By 2022, multiple models had cleared 90, and the benchmark was effectively saturated.

What is SQuAD?

The Stanford Question Answering Dataset (SQuAD), released by Rajpurkar, Zhang, Lopyrev, and Liang at EMNLP 2016, was the dominant reading-comprehension benchmark of the late 2010s.^[21] SQuAD 1.1 contained over 100,000 question-answer pairs over 500 Wikipedia articles, with each answer being a contiguous span of the source passage. The metric was a pair of token-level scores: Exact Match (EM) and F1. The original logistic regression baseline scored 51 F1; the human baseline was around 86.8 F1; BERT-Large reached 93.2 F1 in 2018, the first model to clear the human number.^[14]

SQuAD 2.0, released in 2018, added 50,000 unanswerable questions written adversarially to look answerable, forcing models to decide when to abstain.^[22] SQuAD 2.0 also fell to BERT and its successors within months, but it remains a useful diagnostic for extractive QA.

NLI corpora

Natural language inference is one of the most heavily used NLU benchmarks because it captures sentence-level entailment in a clean three-class format. Two corpora dominate:

Corpus	Year	Size	Genres
SNLI (Stanford NLI)	2015 (Bowman et al., EMNLP)	570,000 sentence pairs	Image captions
MultiNLI (MNLI)	2018 (Williams, Nangia, Bowman, NAACL)	433,000 sentence pairs	Ten genres of spoken and written English
ANLI (Adversarial NLI)	2020 (Nie et al.)	169,000 sentence pairs	Adversarially collected against fine-tuned models

MNLI is the workhorse: it is large enough to fine-tune big encoders, covers diverse domains, and is included in GLUE.^[11] ANLI was created in three rounds by humans actively trying to fool BERT, RoBERTa, and DeBERTa, and remains harder than MNLI for current models.^[23]

Other benchmarks

Several other suites have become important for specific subareas of NLU:

Benchmark	Year	Focus
CoNLL-2003	2003	English and German NER, four entity types
OntoNotes 5.0	2013	Multi-layer annotation, NER, coreference, SRL
Natural Questions	2019	Open-domain QA from real Google queries
HotpotQA	2018	Multi-hop QA over Wikipedia
DROP	2019	Discrete reasoning over paragraphs
RACE	2017	Reading comprehension from Chinese English exams
TriviaQA	2017	Open-domain QA from trivia
MASSIVE	2022	51-language intent classification and slot filling
MMLU	2020	57-subject multiple-choice exam, often used to measure LLM NLU
BIG-Bench	2022	200+ task collaborative benchmark for LLMs

By 2024 most academic interest had shifted toward LLM-era evaluations like MMLU, BIG-Bench Hard, HELM, MMMU, and the various agent benchmarks, all of which include NLU subtasks among broader reasoning tests.

Where is NLU used in production?

NLU technology has moved from research labs into production at almost every consumer software company. The dominant deployments are voice assistants and chatbot platforms, where NLU sits between the user's words and the action the application takes.

System	Owner	Released	NLU role	Notes
Alexa	Amazon	2014	Wake-word, ASR, intent classification, slot filling	Powers Echo devices and the underlying engine for Amazon Lex
Google Assistant	Google	2016	Query understanding, dialog management	Uses BERT-family encoders for query rewriting and intent detection
Siri	Apple	2011	On-device intent and entity recognition	Heavily on-device since iOS 15 (2021)
Cortana	Microsoft	2014	Productivity-focused intent and entity NLU	Repositioned around Microsoft 365 in 2023
Dialogflow	Google Cloud	2016 (acquired API.ai 2016)	Hosted NLU for intents, entities, and contexts	Dialogflow CX adds state machines
Amazon Lex	AWS	2017	Hosted ASR + NLU using Alexa technology	Lex V2 launched 2020, generative AI features added in 2023
Wit.ai	Meta	2013 (acquired by Facebook 2015)	Hosted NLU for intents and entities	Free for developers, used inside Facebook Messenger
Microsoft LUIS	Microsoft	2016	Language Understanding Intelligent Service	Superseded by Azure AI Language Conversational Language Understanding (CLU) in 2023
IBM Watson Assistant	IBM	2017	Enterprise hosted NLU and dialog	Built on Watson NLU APIs
Rasa	Rasa Technologies	2016 (Rasa NLU open source)	Self-hosted intent + entity pipeline (DIET classifier)	Popular for privacy-sensitive deployments
Snips	Snips (acquired by Sonos 2019)	2016	On-device NLU with the SNIPS NLU library	Open-sourced after the acquisition
Botpress	Botpress	2017	Open-source conversational NLU	Now LLM-augmented

Voice assistants

Alexa, Google Assistant, and Siri share a common high-level pipeline. A wake-word detector decides whether the user is talking to the device. An automatic speech recognition (ASR) module converts audio to text. An NLU module classifies the intent (Play music, Set timer, Get weather), extracts the slot values (artist name, duration, location), and routes the request to a domain-specific skill or handler. A natural language generation step formats the textual response, and a text-to-speech engine speaks it back.

The NLU components in modern voice assistants are typically BERT-family encoders or smaller transformer models fine-tuned on millions of labeled utterances. They run with strict latency budgets, often under 100 milliseconds, which rules out direct LLM inference for the hot path. Amazon Lex exposes the same underlying NLU technology that powers Alexa to AWS developers, who use it to build their own conversational interfaces over the same intent-and-slot model.^[27]

Chatbot platforms

Dialogflow, Lex, LUIS (now Azure AI CLU), Wit.ai, IBM Watson Assistant, and Rasa all expose NLU as a hosted or self-hosted service. The interface is similar across vendors:

The developer defines a set of intents (Order pizza, Check order status) with example utterances for each.
The developer defines entities (PizzaSize, ToppingName) and either hand-lists their values or attaches a regex or system entity type.
The platform trains an internal classifier and tagger on the supplied data.
At runtime, the developer's app sends user text to the platform, which returns a structured payload containing the predicted intent, confidence score, and slot values.

Dialogflow's CX product adds explicit state machines on top, so that the platform can drive multi-turn conversations with prompts and validation.^[26] Rasa's open-source DIET (Dual Intent and Entity Transformer) classifier is the most popular self-hosted alternative; it lets companies keep training data on premises rather than sending it to a third-party cloud.^[25] Since 2023, every major platform has added LLM-backed features: Lex's generative slot fulfillment, Dialogflow's generative agents, Rasa Pro's CALM (Conversational AI with Language Models) framework, and Watson Assistant's generative AI add-ons all let teams mix classic intent-based NLU with LLM-driven open conversation.

Industry-specific NLU

Domain-specific NLU systems are common in healthcare, finance, legal, and customer support. Healthcare deployments use BioBERT, ClinicalBERT, or commercial systems like Amazon Comprehend Medical to extract drugs, dosages, ICD codes, and adverse events from clinical notes. Financial services use FinBERT-family models to classify earnings call sentiment, extract entities from filings, and detect risk language. Legal deployments use LegalBERT-family models for contract clause classification and discovery. All of these continue to rely on fine-tuned encoders rather than prompted LLMs because the data is sensitive, the volumes are high, and the structured output requirements are strict.

How have LLMs changed NLU?

The biggest shift in NLU since 2023 has been the rise of large language models as general-purpose NLU engines. Decoder-only transformers like GPT-4, Claude 3.5 Sonnet, and Gemini 2.5 can perform almost any NLU task by reading an instruction and a few examples. This capability has changed how teams build, evaluate, and ship NLU products.

The practical effects:

Cold start. New tasks no longer need a labeled dataset to get a working baseline. A product team can write a prompt, get a 70-90% accurate classifier in an afternoon, and decide whether the task is worth more investment.
Long-tail coverage. LLMs handle rare intents, novel phrasings, and code-switched input that would have required custom data collection in the encoder era.
Reasoning and multi-hop QA. Prompted LLMs are usually better than encoders on tasks that require chained inference, world knowledge, or numerical reasoning.
Multilingual transfer. Frontier LLMs handle dozens of languages without language-specific fine-tuning, although accuracy still varies.
Cost and latency. LLM inference is slow and expensive compared to encoders. Production systems often use LLMs for low-volume or complex queries and route high-volume traffic to a smaller distilled model.

LLM-based NLU has its own failure modes. Models hallucinate plausible-sounding but wrong entities, misclassify because of subtle wording in the prompt, and behave inconsistently across versions. They are also harder to audit, since the same input can produce different outputs at different temperatures or after a model update. As a result, regulated industries continue to prefer fine-tuned encoders or hybrid systems where an LLM proposes an answer and a verifier checks it.

A growing pattern in 2025 and 2026 is to use LLMs as data labelers for fine-tuning smaller encoders. The LLM produces synthetic intents, slot annotations, or paraphrases, and a BERT or DeBERTa model is fine-tuned on the resulting dataset. The final encoder is faster, cheaper, and more predictable than the LLM, but inherits much of its NLU quality. This LLM-as-teacher distillation has become a standard playbook for production teams that need both quality and throughput.

What are the open problems in NLU?

NLU has open problems even in the LLM era.

Ambiguity. Human language is genuinely ambiguous, and resolving meaning often requires context the model does not have. The same query ("Book me a table") can mean a restaurant reservation, a software booking system, or a literal request to construct a table. Production systems handle this with conversational context and user profiles, but residual errors remain.

Long-tail intents. Real-world traffic has a long tail of rare phrasings and edge cases. Encoder-based intent classifiers can miss these because they were not in training; LLM-based systems hallucinate intents that do not exist in the underlying API. Both failure modes require careful evaluation against production data, not just public benchmarks.

Multilingual coverage. English NLU is much better than NLU in lower-resource languages. Multilingual encoders like XLM-R and mDeBERTa-v3 narrow the gap, and frontier LLMs are surprisingly capable in many languages, but accuracy still drops on languages with limited training data and on dialects that are underrepresented online.

Domain shift. A model trained on news text often degrades on tweets, on legal text, on clinical notes, or on chat logs. Domain adaptation, continued pre-training, or in-domain fine-tuning is usually needed for production accuracy.

World knowledge and reasoning. Pure NLU cannot answer factual questions on its own; the model needs either a knowledge base, retrieval over documents, or training data that contains the fact. Even LLMs make confident factual errors, which is why retrieval-augmented setups have become the default for knowledge-intensive QA.

Bias and fairness. Pre-trained encoders and LLMs encode biases from their training data: gender, race, religion, and disability associations have all been documented in BERT-family models and modern LLMs. NLU systems deployed in high-stakes settings (hiring, healthcare, criminal justice) need explicit bias evaluation and mitigation.

Evaluation drift. Standard NLU benchmarks like GLUE, SuperGLUE, and SQuAD are saturated. New benchmarks like MMLU, BIG-Bench Hard, and HELM are also being saturated by frontier LLMs. The field is in the middle of figuring out what the next round of meaningful evaluations should look like, especially for generative and agentic systems where there is no single correct answer.

Grounding and embodiment. Symbolic NLU systems like SHRDLU were grounded in a simulated world, so the meaning of "red block" was anchored to a specific object. Modern LLMs lack this grounding by default, which is one reason multimodal models that process images, video, audio, and text together are an active area of NLU research in 2026.

Robustness. Adversarial inputs (typos, paraphrases, noise) can flip predictions, especially for narrow encoder models trained on a single style. Adversarial datasets like ANLI and Checklist were designed to expose this fragility.^[23]

References

Weizenbaum, J. (1966). "ELIZA: A Computer Program for the Study of Natural Language Communication Between Man and Machine." Communications of the ACM, 9(1), 36-45. https://dl.acm.org/doi/10.1145/365153.365168 ↩
Winograd, T. (1972). Understanding Natural Language. Academic Press, New York. https://archive.org/details/understandingnat0000wino ↩
Hobbs, J.R., Stickel, M.E., Appelt, D.E., & Martin, P. (1993). "Interpretation as Abduction." Artificial Intelligence, 63(1-2), 69-142. ↩
Lenat, D.B. (1995). "CYC: A Large-Scale Investment in Knowledge Infrastructure." Communications of the ACM, 38(11), 33-38. https://dl.acm.org/doi/10.1145/219717.219745 ↩
Lafferty, J., McCallum, A., & Pereira, F. (2001). "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data." Proceedings of ICML 2001, pp. 282-289. ↩
Marcus, M.P., Marcinkiewicz, M.A., & Santorini, B. (1993). "Building a Large Annotated Corpus of English: The Penn Treebank." Computational Linguistics, 19(2), 313-330. ↩
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781. https://arxiv.org/abs/1301.3781 ↩
Pennington, J., Socher, R., & Manning, C.D. (2014). "GloVe: Global Vectors for Word Representation." Proceedings of EMNLP 2014, pp. 1532-1543. ↩
Hochreiter, S., & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735-1780. ↩
Bowman, S.R., Angeli, G., Potts, C., & Manning, C.D. (2015). "A Large Annotated Corpus for Learning Natural Language Inference." Proceedings of EMNLP 2015. https://nlp.stanford.edu/projects/snli/ ↩
Williams, A., Nangia, N., & Bowman, S.R. (2018). "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference." Proceedings of NAACL-HLT 2018. https://arxiv.org/abs/1704.05426 ↩
Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). "Deep Contextualized Word Representations." Proceedings of NAACL-HLT 2018 (ELMo). https://arxiv.org/abs/1802.05365 ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in NeurIPS 30. https://arxiv.org/abs/1706.03762 ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT 2019. https://arxiv.org/abs/1810.04805 ↩
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv:1907.11692. https://arxiv.org/abs/1907.11692 ↩
He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." Proceedings of ICLR 2021. https://arxiv.org/abs/2006.03654 ↩
He, P., Gao, J., & Chen, W. (2023). "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." Proceedings of ICLR 2023. https://arxiv.org/abs/2111.09543 ↩
Warner, B., Chaffin, A., Clavié, B., Cooper, O., Adams, G., Roman, R., et al. (2024). "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." arXiv:2412.13663. https://arxiv.org/abs/2412.13663 ↩
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S.R. (2019). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." Proceedings of ICLR 2019. https://arxiv.org/abs/1804.07461 ↩
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S.R. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." Advances in NeurIPS 32. https://arxiv.org/abs/1905.00537 ↩
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). "SQuAD: 100,000+ Questions for Machine Comprehension of Text." Proceedings of EMNLP 2016. https://arxiv.org/abs/1606.05250 ↩
Rajpurkar, P., Jia, R., & Liang, P. (2018). "Know What You Don't Know: Unanswerable Questions for SQuAD." Proceedings of ACL 2018 (SQuAD 2.0). https://arxiv.org/abs/1806.03822 ↩
Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., & Kiela, D. (2020). "Adversarial NLI: A New Benchmark for Natural Language Understanding." Proceedings of ACL 2020. https://arxiv.org/abs/1910.14599 ↩
Microsoft Research. (2021). "Microsoft DeBERTa Surpasses Human Performance on the SuperGLUE Benchmark." https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/ ↩
Bunk, T., Varshneya, D., Vlasov, V., & Nichol, A. (2020). "DIET: Lightweight Language Understanding for Dialogue Systems." arXiv:2004.09936 (Rasa DIET classifier). https://arxiv.org/abs/2004.09936 ↩
Google Cloud. (2024). "Dialogflow CX Documentation." https://cloud.google.com/dialogflow/docs ↩
Amazon Web Services. (2024). "Amazon Lex Developer Guide." https://docs.aws.amazon.com/lex/latest/dg/what-is.html ↩
Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). "Language Models are Few-Shot Learners." Advances in NeurIPS 33 (GPT-3). https://arxiv.org/abs/2005.14165 ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 21, 1-67 (T5). https://arxiv.org/abs/1910.10683 ↩
Tjong Kim Sang, E.F., & De Meulder, F. (2003). "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition." Proceedings of CoNLL-2003. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

What links here

AI in agriculture Acronyms CloudWalk Technology Cogram Data labeling Embodied AI MLPerf Machine learning terms/All Machine learning terms/Natural Language Processing Murf AI Otter.ai Terms Yitu Technology iFlytek

How is NLU different from NLP and NLG?

History

Early symbolic systems (1960s and 1970s)

The statistical era (1990s and 2000s)

Neural and deep learning (2013 to 2017)

Transformer era (2017 to present)

What are the core NLU tasks?

Modern approaches

BERT-family encoders

Sequence-to-sequence models

Large language models and zero-shot NLU

Retrieval-augmented systems

How is NLU measured?

What is GLUE? (2018)

What is SuperGLUE? (2019)

What is SQuAD?

NLI corpora

Other benchmarks

Where is NLU used in production?

Voice assistants

Chatbot platforms

Industry-specific NLU

How have LLMs changed NLU?

What are the open problems in NLU?

References

Improve this article

Related Articles

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Context window

Large Language Model

MathArena

What links here

Related Articles

Agentic Context Engineering

Claude Sonnet 4.5

Computer-use agent

Context window

Large Language Model

MathArena

What links here