NLU
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v5 · 6,081 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v5 · 6,081 words
Add missing citations, update stale details, or suggest a clearer explanation.
Natural Language Understanding (NLU) is the subfield of artificial intelligence and natural language processing that focuses on machine reading comprehension: turning unstructured human language into structured representations that a computer can act on. NLU spans tasks such as intent classification, named entity recognition, semantic parsing, question answering, natural language inference, sentiment analysis, coreference resolution, and semantic role labeling. The field has moved through three distinct eras of method, from hand-written symbolic rules in the 1960s and 1970s, to statistical models trained on labeled corpora during the 1990s and 2000s, to deep neural transformer architectures and large language models since the late 2010s.
NLU is sometimes treated as a synonym for NLP, but the two are not identical. NLP is the broader umbrella that includes NLU plus tasks that produce or transform language, such as machine translation, speech synthesis, summarization, and natural language generation (NLG). NLU concerns the comprehension half of that pipeline: parsing meaning from input text or speech. NLG, by contrast, takes structured input and renders it as fluent language. Modern systems, especially large language models, tend to fold both directions into a single model, but the distinction still organizes how researchers think about evaluation, datasets, and product features.
NLU is the technology that lets a search engine understand what a user really meant by a sloppy query, lets a voice assistant turn the words "set a timer for ten minutes" into a function call with a duration argument, lets a customer-support bot route an angry email to a human agent, and lets a medical record system identify drug names and dosages buried in clinical notes. Most production NLU in 2026 still runs on fine-tuned BERT-family encoders for classification and tagging, with prompted LLMs layered on top for harder reasoning and zero-shot tasks.
The relationship between these three terms is a frequent source of confusion. The simplest framing is hierarchical: NLP is the parent field, and NLU and NLG are two of its main child areas.
| Term | Direction | Typical input | Typical output | Example tasks |
|---|---|---|---|---|
| NLP | Both | Text or speech | Text, speech, or structured data | Translation, summarization, classification, generation |
| NLU | Comprehension | Text or speech | Structured meaning representation | Intent detection, entity extraction, parsing, question answering, sentiment, NLI |
| NLG | Production | Structured data or representation | Fluent text or speech | Report generation, dialogue responses, captioning, summarization |
In the symbolic era these were sometimes implemented as completely separate modules with a clean interface. A pipeline might use an NLU front end to convert a user utterance into a logical form, run inference over a knowledge base, then hand the result to an NLG back end that produced an English answer. Modern transformer-based systems usually blur the boundary because the same network learns to do both, but the conceptual split still shows up in product organizations, vendor categories, and benchmark design. Industry analysts and cloud platforms continue to label products as "NLU services" when they expose intent and entity APIs, and as "NLG services" when they expose templated or generative text output.
The roots of NLU lie in the 1950s and 1960s, when researchers tried to teach computers to handle human language using hand-written grammars and dictionaries. The earliest widely cited program was ELIZA, a pattern-matching chatbot written by Joseph Weizenbaum at MIT and described in a January 1966 paper in Communications of the ACM. ELIZA did not understand language in any deep sense. It used decomposition rules triggered by keywords in user input, then assembled responses from canned reassembly templates. The most famous script, DOCTOR, simulated a Rogerian psychotherapist by reflecting the user's words back as questions. Weizenbaum was startled when users, including his own secretary, attributed feelings and understanding to the program despite being told how it worked, an effect now called the ELIZA effect.
The next major milestone was SHRDLU, a system built by Terry Winograd at MIT between 1968 and 1970 and described in his 1972 book Understanding Natural Language. SHRDLU lived in a simulated micro-world of colored blocks, pyramids, and a robot arm. A user could type English commands such as "Pick up a big red block" or ask questions such as "Is there anything which is bigger than every pyramid but is not as wide as the thing that supports it?" The program would parse the input, plan an action in the blocks world, and either execute it or reply in English. SHRDLU was implemented in Micro-Planner, a procedural knowledge representation language embedded in LISP, and ran on a DEC PDP-6 with a graphics terminal. Its tight coupling of grammar, semantics, planning, and discourse memory was an unusually integrated approach for its time, and the published demonstrations created a wave of optimism about general NLU that did not survive contact with broader, messier text.
Several other systems explored ambitious symbolic approaches in the same period. SRI International's TACITUS, developed by Jerry Hobbs and colleagues in the 1980s, used abductive reasoning to interpret expository text and was applied to extraction tasks for the U.S. Defense Advanced Research Projects Agency (DARPA) Message Understanding Conferences (MUC) that ran from 1987 through 1997. The MIT Wheels project and Roger Schank's group at Yale explored conceptual dependency theory and scripts as ways to model story understanding. The Cyc project, started by Doug Lenat in 1984 at the Microelectronics and Computer Technology Corporation (MCC) in Austin, Texas, took the most extreme stance: that human-level NLU required a hand-encoded common-sense knowledge base, and Lenat's team spent decades writing axioms to fill it. By 2017 the Cyc knowledge base contained roughly 24.5 million assertions over 1.5 million terms, but the project never produced the general reading machine that Lenat had originally projected.
The symbolic period showed both the appeal and the limits of rule-based NLU. Hand-written grammars and ontologies could give very crisp behavior in narrow domains but degraded sharply on novel input, scaled poorly to new topics, and required specialized expertise to maintain. By the late 1980s the field began shifting toward statistical methods that could be trained automatically from text.
The statistical era of NLU was driven by two factors: the public release of large annotated corpora, and the success of probabilistic models in adjacent fields like speech recognition. The Penn Treebank, released by the University of Pennsylvania in 1992, gave researchers a million-word corpus of Wall Street Journal text annotated with part-of-speech tags and syntactic parse trees. Other resources followed: PropBank for predicate-argument structure, FrameNet for semantic frames, OntoNotes for multi-layer linguistic annotation, and the CoNLL shared-task datasets for named entity recognition, chunking, and dependency parsing.
With data in hand, researchers turned to probabilistic graphical models. Hidden Markov Models (HMMs) became standard for sequence labeling tasks like part-of-speech tagging and named entity recognition. Conditional Random Fields (CRFs), introduced by Lafferty, McCallum, and Pereira in 2001, generalized HMMs to allow arbitrary overlapping features and quickly displaced HMMs for many sequence problems. Maximum-entropy classifiers handled sentence-level tasks like sentiment analysis. Statistical parsers, including the Collins parser (1999) and the Charniak parser, brought probabilistic parsing to broad-coverage English.
The statistical era also produced the first real progress on open-domain semantic tasks. The Penn Discourse Treebank tackled discourse relations. The PropBank-trained semantic role labeling work of Gildea and Jurafsky in 2002 turned predicate-argument structure into a learnable problem. The DARPA Message Understanding Conferences and later Automatic Content Extraction (ACE) programs made information extraction a measurable engineering discipline. Statistical NLU dominated through the 2000s and into the early 2010s, with support vector machines, logistic regression, and CRFs powering most production text classifiers and taggers.
The deep learning era of NLU started with word embeddings. Word2Vec, released by Tomas Mikolov and colleagues at Google in 2013, showed that simple neural networks could learn dense vector representations for words from raw text such that semantic relationships were preserved as linear directions in the embedding space. GloVe, from Stanford's Pennington, Socher, and Manning in 2014, achieved similar results with a different objective based on word co-occurrence counts. These embeddings became the standard input layer for almost every NLU model.
On top of word embeddings, researchers built recurrent neural networks (RNNs) for sequence modeling. The Long Short-Term Memory (LSTM), introduced by Hochreiter and Schmidhuber in 1997 and rediscovered by the deep learning community in the early 2010s, became dominant for sentence classification, sequence tagging, and machine translation. Bidirectional LSTMs (BiLSTMs), often combined with a CRF output layer, set state-of-the-art results on named entity recognition, semantic role labeling, and dependency parsing through 2017. The Stanford Natural Language Inference (SNLI) corpus, released by Bowman, Angeli, Potts, and Manning in 2015 with 570,000 human-written sentence pairs labeled entailment, contradiction, or neutral, became a popular benchmark for these BiLSTM models.
ELMo (Embeddings from Language Models), released by the Allen Institute for AI in 2018, was the bridge between fixed word embeddings and the modern era. ELMo trained a deep BiLSTM language model on a large corpus, then used the LSTM's internal states as contextualized word representations for downstream tasks. Adding ELMo to existing systems gave consistent gains on six benchmarks, and the paper coined the term "contextual representations" that the field still uses.
The transformer architecture, introduced by Vaswani et al. at Google in their 2017 paper Attention Is All You Need, reorganized the field. Transformers replaced recurrence with self-attention, which let the model read the whole input in parallel and capture long-range dependencies more cleanly than RNNs.
BERT (Bidirectional Encoder Representations from Transformers), released by Google in October 2018, applied the transformer encoder to NLU and immediately broke 11 NLP benchmarks. BERT used masked language modeling and next sentence prediction to pre-train a deep bidirectional encoder on 3.3 billion words of text, then fine-tuned the same network on small downstream datasets with a single linear classifier head per task. The pre-train-then-fine-tune paradigm displaced almost every prior approach to NLU within a year.
A flood of BERT-family encoders followed. RoBERTa from Facebook AI in 2019 improved BERT by training longer with more data and dropping next sentence prediction. ALBERT from Google and the Toyota Technological Institute reduced parameters with cross-layer sharing. DistilBERT from Hugging Face used knowledge distillation to halve the size while keeping 97% of the accuracy. ELECTRA from Stanford and Google replaced masked language modeling with a more sample-efficient replaced-token detection objective. DeBERTa from Microsoft Research added disentangled attention and was the first model to surpass the human baseline on the SuperGLUE benchmark in January 2021. ModernBERT, released by Answer.AI and LightOn in December 2024, applied six years of LLM-era architecture improvements to the encoder paradigm, including rotary positional embeddings, FlashAttention, and an 8,192-token native context.
The transformer era also produced the first generative models that could perform NLU as a side effect. GPT-2 and GPT-3 showed that decoder-only models trained on next-token prediction could perform reading comprehension, sentiment classification, and entailment without any task-specific fine-tuning, simply by reading a few examples in their context window. By 2023, GPT-4, Claude, Gemini, and other frontier LLMs could match or exceed fine-tuned encoders on many NLU benchmarks via prompting alone. The encoder-only paradigm did not disappear, but the relationship between NLU and NLG has become much tighter than it was a decade ago.
NLU is not a single task. It is a family of related problems, each with its own datasets, evaluation metrics, and dominant architectures. The most important categories:
| Task | Goal | Typical evaluation | Example dataset |
|---|---|---|---|
| Intent classification | Pick which of a fixed set of user intents an utterance expresses | Accuracy or F1 | SNIPS, ATIS, MASSIVE |
| Named entity recognition (NER) | Tag spans of text as person, organization, location, etc. | Span F1 | CoNLL-2003, OntoNotes 5.0 |
| Semantic parsing | Convert a sentence into a logical form, SQL query, or executable program | Exact match, execution accuracy | GeoQuery, Spider, ATIS |
| Question answering (QA) | Return an answer to a natural-language question | Exact match, F1 | SQuAD, Natural Questions, HotpotQA |
| Natural language inference (NLI) | Decide whether a hypothesis is entailed by, contradicts, or is neutral to a premise | Accuracy | SNLI, MNLI, ANLI |
| Sentiment analysis | Classify text by polarity or emotion | Accuracy or macro F1 | SST-2, IMDb, Yelp Review |
| Coreference resolution | Cluster mentions in a document that refer to the same entity | CoNLL F1 (MUC + B-cubed + CEAF) | OntoNotes, GAP |
| Semantic role labeling (SRL) | Identify predicate-argument structure (who did what to whom, where, when) | Span F1 | PropBank, FrameNet |
| Word sense disambiguation | Pick the correct sense of an ambiguous word in context | Accuracy | SemCor, WiC |
| Relation extraction | Identify relationships between entities in text | Precision, recall, F1 | TACRED, FewRel |
| Slot filling | Extract structured argument values for a known intent | Slot F1 | ATIS, SNIPS, MultiWOZ |
| Discourse parsing | Identify rhetorical or discourse relations between text segments | F1 | Penn Discourse Treebank |
| Textual entailment | Same as NLI; often used in search and verification pipelines | Accuracy | RTE, SciTail |
| Machine reading comprehension | Answer questions over a passage by extracting or generating spans | EM, F1 | SQuAD 1.1/2.0, RACE, DROP |
| Topic classification | Assign documents to broad topic labels | Accuracy | AG News, DBpedia |
Intent classification and slot filling are the bread and butter of voice assistants and chatbots, where the system needs to decide what the user wants and what arguments they supplied. NER and relation extraction power most information-extraction pipelines in healthcare, legal, and finance. NLI has become the workhorse benchmark for testing whether a model captures sentence-level semantics. Question answering and machine reading comprehension overlap heavily and are the closest things to a general-purpose NLU evaluation.
For most production NLU work in 2026, the default architecture is still a fine-tuned BERT-family encoder. The recipe is well understood: take a pre-trained checkpoint such as bert-base-uncased, roberta-base, deberta-v3-base, or modernbert-base; add one task-specific output head (a linear layer for classification, a token-level head for tagging, span pointers for extractive QA); and fine-tune all parameters on the task dataset for one to four epochs.
Why encoders dominate production:
DeBERTa-v3, released by Microsoft in 2021 and accepted at ICLR 2023, was the strongest fine-tuning baseline for several years. ModernBERT replaced absolute position embeddings with rotary positional embeddings (RoPE), pushed the native context to 8,192 tokens, and trained on 2 trillion tokens, and has displaced DeBERTa-v3 in many new projects since its December 2024 release. Multilingual variants such as XLM-R and mDeBERTa-v3 fill the same role for cross-lingual NLU.
A second family treats every NLU task as text-to-text: the input is the natural-language question or instruction plus the source text, and the output is a string. T5 (Text-To-Text Transfer Transformer), released by Google in 2019, was the most influential example, and frames NER, classification, QA, and translation all as conditional generation. FLAN-T5 added instruction tuning on top. Sequence-to-sequence models are slower than encoders for fixed-format outputs but more flexible because the same model can produce any string.
Decoder-only large language models such as GPT-4, Claude, Gemini, and Llama can perform NLU through prompting, with no training data, by reading a few examples or just an instruction in their context window. This is called zero-shot or few-shot NLU.
Zero-shot NLU has reshaped how product teams think about new tasks. Before LLMs, adding a new intent to a chatbot required collecting hundreds of examples, training a model, and validating it. With a strong LLM, a product manager can write a one-paragraph description of the intent and have a working classifier in minutes. Few-shot prompting, where the prompt includes 5 to 50 labeled examples, often gets within a few points of fine-tuned baselines on standard benchmarks.
The cost picture matters. A fine-tuned encoder might run for fractions of a cent per call at sustained throughput on a single GPU. A frontier LLM API call might cost 10 to 100 times more, with higher latency and less predictable behavior. The 2026 production pattern that has settled in most teams looks roughly like this: use prompted LLMs for prototyping, low-volume tasks, and complex reasoning; distill the LLM's behavior into a fine-tuned encoder once volume is high enough to justify the engineering investment.
NLU systems that need to ground their answers in external knowledge often combine an encoder for retrieval with an LLM for synthesis. The encoder produces dense embeddings for documents and queries, a vector index returns the top candidates, and the LLM reads the retrieved passages to produce a final answer. This pattern, called retrieval-augmented generation, powers most enterprise search, document QA, and chatbot-over-documentation systems built since 2023. The retrieval encoder is almost always a BERT-family model fine-tuned with contrastive objectives.
NLU progress has been measured against a series of public benchmarks, each one designed to stay ahead of the previous best system.
The General Language Understanding Evaluation benchmark, introduced by Wang, Singh, Michael, Hill, Levy, and Bowman in April 2018 (arXiv:1804.07461), bundled nine sentence-level and sentence-pair English tasks behind a single average score. GLUE was the first widely adopted multi-task NLU leaderboard and made it easy to compare models across the field.
| Task | Type | Metric |
|---|---|---|
| CoLA | Linguistic acceptability | Matthews correlation |
| SST-2 | Sentiment classification | Accuracy |
| MRPC | Paraphrase detection | F1 / accuracy |
| STS-B | Semantic textual similarity | Pearson / Spearman |
| QQP | Quora question paraphrase | F1 / accuracy |
| MNLI | Multi-genre NLI | Accuracy (matched + mismatched) |
| QNLI | Question NLI from SQuAD | Accuracy |
| RTE | Recognizing Textual Entailment | Accuracy |
| WNLI | Winograd NLI | Accuracy |
BERT-Large hit a GLUE average of 80.5 in late 2018, jumping 7.7 points over the previous best. RoBERTa pushed it to 88.5 in 2019 and was already past the average human baseline of 87.1. By mid-2019 the field had to retire GLUE because the headroom was gone.
SuperGLUE, introduced by Wang et al. in May 2019 (arXiv:1905.00537), was designed to be "stickier" than GLUE. It dropped the easier GLUE tasks, kept the two hardest (RTE and WNLI in modified form), and added six new tasks that required more reasoning, coreference, or commonsense knowledge.
| Task | Type | Metric |
|---|---|---|
| BoolQ | Yes/no question answering | Accuracy |
| CB | CommitmentBank NLI | F1 / accuracy |
| COPA | Choice of Plausible Alternatives | Accuracy |
| MultiRC | Multi-sentence reading comprehension | F1 / EM |
| ReCoRD | Reading comprehension with commonsense | F1 / EM |
| RTE | Recognizing Textual Entailment | Accuracy |
| WiC | Word in Context (sense disambiguation) | Accuracy |
| WSC | Winograd Schema Challenge | Accuracy |
The human baseline on SuperGLUE was 89.8. In January 2021, Microsoft's DeBERTa with 1.5 billion parameters became the first single model to surpass it, scoring 89.9 on the test set. An ensemble DeBERTa configuration reached 90.3 shortly after, and Google's T5-based system reached 90.2. By 2022, multiple models had cleared 90, and the benchmark was effectively saturated.
The Stanford Question Answering Dataset (SQuAD), released by Rajpurkar, Zhang, Lopyrev, and Liang at EMNLP 2016, was the dominant reading-comprehension benchmark of the late 2010s. SQuAD 1.1 contained over 100,000 question-answer pairs over 500 Wikipedia articles, with each answer being a contiguous span of the source passage. The metric was a pair of token-level scores: Exact Match (EM) and F1. The original logistic regression baseline scored 51 F1; the human baseline was around 86.8 F1; BERT-Large reached 93.2 F1 in 2018, the first model to clear the human number.
SQuAD 2.0, released in 2018, added 50,000 unanswerable questions written adversarially to look answerable, forcing models to decide when to abstain. SQuAD 2.0 also fell to BERT and its successors within months, but it remains a useful diagnostic for extractive QA.
Natural language inference is one of the most heavily used NLU benchmarks because it captures sentence-level entailment in a clean three-class format. Two corpora dominate:
| Corpus | Year | Size | Genres |
|---|---|---|---|
| SNLI (Stanford NLI) | 2015 (Bowman et al., EMNLP) | 570,000 sentence pairs | Image captions |
| MultiNLI (MNLI) | 2018 (Williams, Nangia, Bowman, NAACL) | 433,000 sentence pairs | Ten genres of spoken and written English |
| ANLI (Adversarial NLI) | 2020 (Nie et al.) | 169,000 sentence pairs | Adversarially collected against fine-tuned models |
MNLI is the workhorse: it is large enough to fine-tune big encoders, covers diverse domains, and is included in GLUE. ANLI was created in three rounds by humans actively trying to fool BERT, RoBERTa, and DeBERTa, and remains harder than MNLI for current models.
Several other suites have become important for specific subareas of NLU:
| Benchmark | Year | Focus |
|---|---|---|
| CoNLL-2003 | 2003 | English and German NER, four entity types |
| OntoNotes 5.0 | 2013 | Multi-layer annotation, NER, coreference, SRL |
| Natural Questions | 2019 | Open-domain QA from real Google queries |
| HotpotQA | 2018 | Multi-hop QA over Wikipedia |
| DROP | 2019 | Discrete reasoning over paragraphs |
| RACE | 2017 | Reading comprehension from Chinese English exams |
| TriviaQA | 2017 | Open-domain QA from trivia |
| MASSIVE | 2022 | 51-language intent classification and slot filling |
| MMLU | 2020 | 57-subject multiple-choice exam, often used to measure LLM NLU |
| BIG-Bench | 2022 | 200+ task collaborative benchmark for LLMs |
By 2024 most academic interest had shifted toward LLM-era evaluations like MMLU, BIG-Bench Hard, HELM, MMMU, and the various agent benchmarks, all of which include NLU subtasks among broader reasoning tests.
NLU technology has moved from research labs into production at almost every consumer software company. The dominant deployments are voice assistants and chatbot platforms, where NLU sits between the user's words and the action the application takes.
| System | Owner | Released | NLU role | Notes |
|---|---|---|---|---|
| Alexa | Amazon | 2014 | Wake-word, ASR, intent classification, slot filling | Powers Echo devices and the underlying engine for Amazon Lex |
| Google Assistant | 2016 | Query understanding, dialog management | Uses BERT-family encoders for query rewriting and intent detection | |
| Siri | Apple | 2011 | On-device intent and entity recognition | Heavily on-device since iOS 15 (2021) |
| Cortana | Microsoft | 2014 | Productivity-focused intent and entity NLU | Repositioned around Microsoft 365 in 2023 |
| Dialogflow | Google Cloud | 2016 (acquired API.ai 2016) | Hosted NLU for intents, entities, and contexts | Dialogflow CX adds state machines |
| Amazon Lex | AWS | 2017 | Hosted ASR + NLU using Alexa technology | Lex V2 launched 2020, generative AI features added in 2023 |
| Wit.ai | Meta | 2013 (acquired by Facebook 2015) | Hosted NLU for intents and entities | Free for developers, used inside Facebook Messenger |
| Microsoft LUIS | Microsoft | 2016 | Language Understanding Intelligent Service | Superseded by Azure AI Language Conversational Language Understanding (CLU) in 2023 |
| IBM Watson Assistant | IBM | 2017 | Enterprise hosted NLU and dialog | Built on Watson NLU APIs |
| Rasa | Rasa Technologies | 2016 (Rasa NLU open source) | Self-hosted intent + entity pipeline (DIET classifier) | Popular for privacy-sensitive deployments |
| Snips | Snips (acquired by Sonos 2019) | 2016 | On-device NLU with the SNIPS NLU library | Open-sourced after the acquisition |
| Botpress | Botpress | 2017 | Open-source conversational NLU | Now LLM-augmented |
Alexa, Google Assistant, and Siri share a common high-level pipeline. A wake-word detector decides whether the user is talking to the device. An automatic speech recognition (ASR) module converts audio to text. An NLU module classifies the intent (Play music, Set timer, Get weather), extracts the slot values (artist name, duration, location), and routes the request to a domain-specific skill or handler. A natural language generation step formats the textual response, and a text-to-speech engine speaks it back.
The NLU components in modern voice assistants are typically BERT-family encoders or smaller transformer models fine-tuned on millions of labeled utterances. They run with strict latency budgets, often under 100 milliseconds, which rules out direct LLM inference for the hot path. Amazon Lex exposes the same underlying NLU technology that powers Alexa to AWS developers, who use it to build their own conversational interfaces over the same intent-and-slot model.
Dialogflow, Lex, LUIS (now Azure AI CLU), Wit.ai, IBM Watson Assistant, and Rasa all expose NLU as a hosted or self-hosted service. The interface is similar across vendors:
Dialogflow's CX product adds explicit state machines on top, so that the platform can drive multi-turn conversations with prompts and validation. Rasa's open-source DIET (Dual Intent and Entity Transformer) classifier is the most popular self-hosted alternative; it lets companies keep training data on premises rather than sending it to a third-party cloud. Since 2023, every major platform has added LLM-backed features: Lex's generative slot fulfillment, Dialogflow's generative agents, Rasa Pro's CALM (Conversational AI with Language Models) framework, and Watson Assistant's generative AI add-ons all let teams mix classic intent-based NLU with LLM-driven open conversation.
Domain-specific NLU systems are common in healthcare, finance, legal, and customer support. Healthcare deployments use BioBERT, ClinicalBERT, or commercial systems like Amazon Comprehend Medical to extract drugs, dosages, ICD codes, and adverse events from clinical notes. Financial services use FinBERT-family models to classify earnings call sentiment, extract entities from filings, and detect risk language. Legal deployments use LegalBERT-family models for contract clause classification and discovery. All of these continue to rely on fine-tuned encoders rather than prompted LLMs because the data is sensitive, the volumes are high, and the structured output requirements are strict.
The biggest shift in NLU since 2023 has been the rise of large language models as general-purpose NLU engines. Decoder-only transformers like GPT-4, Claude 3.5 Sonnet, and Gemini 2.5 can perform almost any NLU task by reading an instruction and a few examples. This capability has changed how teams build, evaluate, and ship NLU products.
The practical effects:
LLM-based NLU has its own failure modes. Models hallucinate plausible-sounding but wrong entities, misclassify because of subtle wording in the prompt, and behave inconsistently across versions. They are also harder to audit, since the same input can produce different outputs at different temperatures or after a model update. As a result, regulated industries continue to prefer fine-tuned encoders or hybrid systems where an LLM proposes an answer and a verifier checks it.
A growing pattern in 2025 and 2026 is to use LLMs as data labelers for fine-tuning smaller encoders. The LLM produces synthetic intents, slot annotations, or paraphrases, and a BERT or DeBERTa model is fine-tuned on the resulting dataset. The final encoder is faster, cheaper, and more predictable than the LLM, but inherits much of its NLU quality. This LLM-as-teacher distillation has become a standard playbook for production teams that need both quality and throughput.
NLU has open problems even in the LLM era.
Ambiguity. Human language is genuinely ambiguous, and resolving meaning often requires context the model does not have. The same query ("Book me a table") can mean a restaurant reservation, a software booking system, or a literal request to construct a table. Production systems handle this with conversational context and user profiles, but residual errors remain.
Long-tail intents. Real-world traffic has a long tail of rare phrasings and edge cases. Encoder-based intent classifiers can miss these because they were not in training; LLM-based systems hallucinate intents that do not exist in the underlying API. Both failure modes require careful evaluation against production data, not just public benchmarks.
Multilingual coverage. English NLU is much better than NLU in lower-resource languages. Multilingual encoders like XLM-R and mDeBERTa-v3 narrow the gap, and frontier LLMs are surprisingly capable in many languages, but accuracy still drops on languages with limited training data and on dialects that are underrepresented online.
Domain shift. A model trained on news text often degrades on tweets, on legal text, on clinical notes, or on chat logs. Domain adaptation, continued pre-training, or in-domain fine-tuning is usually needed for production accuracy.
World knowledge and reasoning. Pure NLU cannot answer factual questions on its own; the model needs either a knowledge base, retrieval over documents, or training data that contains the fact. Even LLMs make confident factual errors, which is why retrieval-augmented setups have become the default for knowledge-intensive QA.
Bias and fairness. Pre-trained encoders and LLMs encode biases from their training data: gender, race, religion, and disability associations have all been documented in BERT-family models and modern LLMs. NLU systems deployed in high-stakes settings (hiring, healthcare, criminal justice) need explicit bias evaluation and mitigation.
Evaluation drift. Standard NLU benchmarks like GLUE, SuperGLUE, and SQuAD are saturated. New benchmarks like MMLU, BIG-Bench Hard, and HELM are also being saturated by frontier LLMs. The field is in the middle of figuring out what the next round of meaningful evaluations should look like, especially for generative and agentic systems where there is no single correct answer.
Grounding and embodiment. Symbolic NLU systems like SHRDLU were grounded in a simulated world, so the meaning of "red block" was anchored to a specific object. Modern LLMs lack this grounding by default, which is one reason multimodal models that process images, video, audio, and text together are an active area of NLU research in 2026.
Robustness. Adversarial inputs (typos, paraphrases, noise) can flip predictions, especially for narrow encoder models trained on a single style. Adversarial datasets like ANLI and Checklist were designed to expose this fragility.