Natural language processing

Artificial Intelligence Machine Learning Natural Language Processing

29 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v4 · 5,763 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Natural language processing (NLP) is the branch of artificial intelligence that enables computers to read, understand, generate, and respond to human language in both text and speech. It combines computer science, computational linguistics, and machine learning so that machines can process language in ways that are useful and meaningful. The field dates to the 1950s but was transformed in 2017 by the Transformer architecture, which underlies today's large language models such as GPT, Claude, and Gemini.

NLP spans tasks as varied as machine translation, sentiment analysis, question answering, text summarization, and dialogue systems. Over the past decade the field has consolidated dramatically: where researchers once built a separate task-specific model for each problem, a single pre-trained large language model can now handle dozens of tasks through prompting and fine-tuning.

What is NLP used for?

NLP is used to power search engines, machine translation services, voice assistants, chatbots, and the large language models behind tools like ChatGPT. Its core jobs fall into two groups: low-level building blocks (tokenization, part-of-speech tagging, named entity recognition, and parsing) and application-level tasks that serve end users (translation, summarization, question answering, sentiment analysis, and conversational dialogue). In production, NLP underpins Google Search query understanding, real-time translation in over 130 languages, clinical documentation tools, contract and legal-document review, content moderation at the scale of billions of posts per day, and financial analysis of news and earnings calls. The same Transformer-based models that read and classify text now also generate it, which is why modern chatbots can write, summarize, translate, and answer questions within a single system.

History

The history of NLP stretches back to the earliest days of computing. Its development can be roughly divided into several eras, each defined by the dominant paradigm of the time.

Early foundations (1950s-1960s)

Alan Turing's 1950 paper "Computing Machinery and Intelligence" posed the question "Can machines think?" and proposed what became known as the Turing test, a measure of a machine's ability to exhibit intelligent behavior indistinguishable from a human ^[1]. This paper laid conceptual groundwork for NLP, even though practical systems were still years away.

The Georgetown-IBM experiment of January 7, 1954 was one of the first public demonstrations of machine translation. Researchers at Georgetown University and IBM used an IBM 701 computer to automatically translate more than 60 Russian sentences into English ^[2]. The system relied on just six grammar rules and a vocabulary of 250 lexical items covering fields like politics, law, chemistry, and military affairs ^[2]. Despite its limited scope, the demonstration generated enormous public interest and raised expectations that fully automatic, high-quality translation would be achievable within a few years.

Those expectations proved premature. In 1966, the ALPAC report (from the Automatic Language Processing Advisory Committee, a panel of seven scientists led by John R. Pierce) concluded that machine translation was slower, less accurate, and more expensive than human translation ^[3]. The report's findings led to a dramatic reduction in MT research funding in the United States for roughly two decades, contributing to what some historians describe as the beginning of the first AI winter.

Also in the mid-1960s, Joseph Weizenbaum at MIT created ELIZA (1964-1966), one of the earliest programs to attempt natural language conversation ^[4]. ELIZA's most famous script, DOCTOR, simulated a Rogerian psychotherapist by using pattern matching and substitution rules to reflect a user's statements back as questions. Weizenbaum chose the psychotherapy framing specifically to sidestep the need for real-world knowledge. Despite its simplicity, ELIZA convinced some users it genuinely understood them. Weizenbaum reported in his 1966 paper that "some subjects have been very hard to convince that ELIZA (with its present script) is not human" ^[4], an early observation of what is now called the ELIZA effect.

Rule-based era (1960s-1980s)

From the 1960s through the 1980s, most NLP systems relied on hand-crafted rules. Linguists and computer scientists wrote detailed grammars and lexicons that specified how language should be parsed and interpreted. Systems like SHRDLU (1970), developed by Terry Winograd at MIT, could understand and respond to English commands within a constrained "blocks world" environment. SHRDLU could answer questions about the objects in its world, follow instructions to move them, and even explain its reasoning.

Other notable systems from this period include LUNAR (1972), which answered questions about lunar soil samples in natural English, and various expert systems that incorporated NLP components. The rule-based approach worked reasonably well for narrow, well-defined domains, but it scaled poorly. Human language is enormously complex and varied; writing rules to cover every possible construction, idiom, and ambiguity proved impractical for open-domain applications.

Statistical revolution (1990s-2000s)

The late 1980s and 1990s brought a paradigm shift. Increased computational power and the growing availability of digitized text corpora enabled researchers to move away from hand-written rules and toward statistical methods. Rather than encoding linguistic knowledge explicitly, statistical NLP systems learned patterns from data.

Hidden Markov Models (HMMs) became widely used for tasks like part-of-speech tagging and speech recognition. Probabilistic context-free grammars enabled statistical parsing. The IBM Models for machine translation (developed by Peter Brown and colleagues at IBM Research in the early 1990s) showed that translation could be treated as a statistical problem, estimating the probability that a sentence in one language corresponded to a sentence in another ^[5].

This era also saw the emergence of practical NLP applications at scale. Search engines began using statistical NLP techniques for query understanding and document ranking. In 2006, Google launched its Statistical Machine Translation system, which leveraged vast amounts of parallel text data to improve translation quality significantly over earlier rule-based approaches.

Period	Dominant Approach	Key Characteristics	Example Systems
1950s-1960s	Early experiments	First MT demonstrations, simple pattern matching	Georgetown-IBM, ELIZA
1960s-1980s	Rule-based	Hand-crafted grammars, expert systems	SHRDLU, LUNAR
1990s-2000s	Statistical	Probabilistic models, corpus-based learning	IBM MT Models, HMM taggers
2010s	Neural / deep learning	Word embeddings, sequence models	Word2Vec, seq2seq, attention
2017-present	Transformer-based	Pre-trained models, transfer learning	BERT, GPT, T5

Neural NLP and the deep learning era (2010s)

The 2010s saw deep learning methods overtake traditional statistical approaches across nearly every NLP task. Recurrent neural networks (RNNs), and later Long Short-Term Memory (LSTM) networks, proved effective at modeling sequential data like text. The sequence-to-sequence (seq2seq) architecture, introduced by Sutskever, Vinyals, and Le in 2014, enabled neural machine translation systems that could map an input sequence in one language to an output sequence in another ^[6].

A key development was the introduction of the attention mechanism by Bahdanau, Cho, and Bengio in 2014 ^[7]. Attention allowed models to focus on relevant parts of the input when generating each element of the output, rather than compressing the entire input into a single fixed-length vector. This dramatically improved performance on tasks like translation, especially for longer sentences.

How does modern NLP differ from rule-based methods?

Modern NLP learns language patterns automatically from large amounts of text, whereas rule-based NLP depended on grammars and dictionaries written by hand. Rule-based systems were precise inside their designed scope but brittle the moment they met language outside their rules, and extending them was labor-intensive. Statistical and neural systems replaced that hand-coding with models trained on data, and the current pre-trained paradigm goes a step further: one large model is trained once on vast unlabeled text, then adapted to many tasks through fine-tuning or prompting. The practical consequence is that practitioners now spend less time designing task-specific architectures and more time selecting a pre-trained model, crafting prompts, and fine-tuning on task-specific data.

The transformer revolution (2017-present)

The publication of "Attention Is All You Need" by Vaswani et al. in June 2017 introduced the Transformer architecture, described in the paper as "based solely on attention mechanisms, dispensing with recurrence and convolutions entirely" ^[8]. The paper, authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin at Google Brain, showed that Transformer models were not only more effective than RNN-based models on translation tasks but also far more parallelizable, reducing training times substantially. On the WMT 2014 English-to-German translation task, the Transformer reached 28.4 BLEU, more than 2 BLEU above the previous best results ^[8].

The Transformer architecture became the foundation for virtually all subsequent breakthroughs in NLP. It enabled the era of large-scale pre-trained language models that could be fine-tuned for specific tasks, fundamentally changing how NLP research and applications are conducted.

Core tasks

NLP encompasses a wide range of tasks. Some are low-level building blocks used within larger systems; others are application-level tasks that directly serve end users.

Tokenization

Tokenization is the process of breaking raw text into smaller units called tokens. These tokens might be words, subwords, or characters, depending on the approach. Modern NLP systems often use subword tokenization methods like Byte Pair Encoding (BPE) or WordPiece, which handle rare and out-of-vocabulary words by splitting them into known subword units. For example, the word "unhappiness" might be tokenized as "un", "happi", and "ness". Tokenization is typically the first step in any NLP pipeline.

Part-of-speech tagging

Part-of-speech (POS) tagging assigns grammatical labels (noun, verb, adjective, etc.) to each word in a sentence. For instance, in "The cat sat on the mat," "cat" is tagged as a noun and "sat" as a verb. POS tagging was historically performed using HMMs and conditional random fields (CRFs). Modern systems use neural models that achieve accuracy above 97% on standard English benchmarks.

Named entity recognition

Named entity recognition (NER) identifies and classifies proper names and other specific entities in text into categories such as person, organization, location, date, and monetary value. For example, in the sentence "Apple was founded by Steve Jobs in Cupertino in 1976," a NER system should identify "Apple" as an organization, "Steve Jobs" as a person, "Cupertino" as a location, and "1976" as a date.

Syntactic parsing

Syntactic parsing (or simply parsing) analyzes the grammatical structure of a sentence, producing a tree that represents how words relate to one another. Constituency parsing breaks a sentence into nested phrases (noun phrase, verb phrase, etc.), while dependency parsing identifies directed relationships between words (e.g., "cat" is the subject of "sat"). Parsing is useful for information extraction, relation detection, and deeper language understanding.

Sentiment analysis

Sentiment analysis determines the emotional tone or opinion expressed in a piece of text. At its simplest, this means classifying text as positive, negative, or neutral. More fine-grained approaches identify specific aspects being discussed and the sentiment toward each. Sentiment analysis is widely used in business intelligence, social media monitoring, and product review analysis.

Machine translation

Machine translation (MT) is the task of automatically translating text from one natural language to another. It has been a driving force in NLP research since the field's inception. Modern neural machine translation systems, such as those powering Google Translate and DeepL, use Transformer-based models trained on billions of sentence pairs. While MT quality for high-resource language pairs (e.g., English-French, English-German) has improved enormously, translation for low-resource languages remains a significant challenge.

Question answering

Question answering (QA) systems take a natural language question and return an answer, often by extracting it from a given passage or knowledge base. Extractive QA models identify the span of text within a document that answers the question. Generative QA models, by contrast, produce an answer in free text. The Stanford Question Answering Dataset (SQuAD), introduced in 2016, became one of the most influential benchmarks in NLP for evaluating reading comprehension.

Text summarization

Text summarization condenses a longer document into a shorter version while preserving key information. Extractive summarization selects and concatenates the most important sentences from the source. Abstractive summarization generates new text that paraphrases and compresses the original content. LLMs have made abstractive summarization far more practical and fluent than earlier approaches.

Text classification

Text classification assigns predefined labels to documents or passages. Spam detection, topic categorization, and language identification are all forms of text classification. With pre-trained language models, it has become possible to build high-quality text classifiers with remarkably little labeled training data, sometimes just a few dozen examples per class.

Approaches over time

Rule-based systems

The earliest NLP systems relied on manually coded rules. Linguists specified grammars, dictionaries, and transformation rules. These systems were precise within their designed scope but brittle when confronted with language outside their rules. Maintaining and extending them was labor-intensive.

Statistical methods

Statistical NLP introduced probabilistic models that learned from data. Key techniques included:

N-gram models: Predicted the next word based on the previous n-1 words. Bigram and trigram models were widely used in speech recognition and language modeling.
Hidden Markov Models: Used for sequence labeling tasks like POS tagging and named entity recognition.
Conditional Random Fields (CRFs): Offered advantages over HMMs by modeling the conditional probability of the entire label sequence given the observation sequence.
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA): Used for topic modeling and dimensionality reduction of text representations.

Neural approaches

Neural NLP began gaining traction in the early 2010s. Key developments included:

Word embeddings: Dense vector representations that captured semantic relationships between words.
RNNs and LSTMs: Sequence models that could process variable-length text and maintain internal state.
Convolutional Neural Networks (CNNs) for text: Applied to text classification and other tasks where local patterns were informative.
Attention mechanisms: Allowed models to selectively focus on relevant parts of the input.

Pre-trained language models

The current dominant paradigm involves pre-training large neural models on vast amounts of unlabeled text, then fine-tuning or prompting them for specific tasks. This transfer learning approach, pioneered by models like ELMo, BERT, and GPT, has proven extraordinarily effective. A single pre-trained model can be adapted to dozens of different NLP tasks, often outperforming task-specific models trained from scratch.

The shift to pre-trained models has also changed the workflow for NLP practitioners. Rather than designing task-specific architectures, much of the work now involves selecting an appropriate pre-trained model, crafting prompts, and fine-tuning on task-specific data.

Key models and milestones

The following table summarizes major models that have shaped the trajectory of NLP.

Model	Year	Developer	Key Innovation	Parameters
Word2Vec	2013	Google (Mikolov et al.)	Efficient word embeddings via skip-gram and CBOW	N/A (embedding method)
GloVe	2014	Stanford (Pennington et al.)	Global co-occurrence statistics for word vectors	N/A (embedding method)
ELMo	2018	Allen AI (Peters et al.)	Deep contextualized word representations from biLMs	~94M
GPT-1	2018	OpenAI	Generative pre-training with Transformer decoder	117M
BERT	2018	Google (Devlin et al.)	Bidirectional pre-training with masked language modeling	110M (Base), 340M (Large)
GPT-2	2019	OpenAI	Scaled generative pre-training; demonstrated strong zero-shot performance	1.5B
T5	2019	Google (Raffel et al.)	Unified text-to-text framework for all NLP tasks	Up to 11B
GPT-3	2020	OpenAI	Massive scale; in-context learning; few-shot capabilities	175B
PaLM	2022	Google	Scaling with Pathways system; strong reasoning	540B
ChatGPT	2022	OpenAI	RLHF-tuned conversational model based on GPT-3.5	Undisclosed
GPT-4	2023	OpenAI	Multimodal (text + image); improved reasoning and safety	~1-1.8T (estimated)
Llama 2	2023	Meta	Open-weight large language model	7B to 70B
GPT-5	2025	OpenAI	Adaptive reasoning; million-token context	Undisclosed

Word2Vec (2013)

Word2Vec, developed by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeff Dean at Google, introduced efficient methods for learning dense vector representations of words from large corpora ^[9]. The paper proposed two architectures: Continuous Bag of Words (CBOW), which predicts a target word from its context, and Skip-gram, which predicts context words given a target word. Trained on the Google News corpus (approximately 6 billion tokens), Word2Vec demonstrated that the resulting vectors captured meaningful semantic relationships. The famous example: the vector for "king" minus "man" plus "woman" yielded a vector close to "queen." Word2Vec made word embeddings practical and accessible, and its influence on subsequent NLP research was enormous.

GloVe (2014)

GloVe (Global Vectors for Word Representation), developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford, took a different approach to word embeddings ^[10]. Rather than learning from local context windows like Word2Vec, GloVe trained on global word-word co-occurrence statistics from a corpus. The model combined advantages of matrix factorization methods (like LSA) with the local context window approach. GloVe achieved 75% accuracy on word analogy tasks and came with several pre-trained vector sets, including ones trained on Wikipedia and Common Crawl data with up to 840 billion tokens.

ELMo (2018)

ELMo (Embeddings from Language Models), developed by Matthew Peters and colleagues at the Allen Institute for AI, represented a major conceptual shift ^[11]. Unlike Word2Vec and GloVe, which assigned a single static vector to each word regardless of context, ELMo generated context-dependent representations. The model used a deep bidirectional LSTM language model and combined representations from all its internal layers using task-specific learned weights. Adding ELMo representations to existing models improved state-of-the-art results across six NLP tasks, with relative error reductions of up to 20%.

BERT (2018)

BERT (Bidirectional Encoder Representations from Transformers), developed by Jacob Devlin and colleagues at Google, applied the Transformer architecture to bidirectional pre-training ^[12]. The paper states that BERT is "designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers" ^[12]. BERT introduced two pre-training objectives: the masked language model (MLM) task, where 15% of input tokens are randomly masked and the model learns to predict them, and next sentence prediction (NSP). BERT-Base had 110 million parameters (12 layers, 768 hidden units, 12 attention heads), while BERT-Large had 340 million parameters (24 layers, 1024 hidden units, 16 attention heads). Pre-trained on the BooksCorpus and English Wikipedia, BERT achieved new state-of-the-art results on 11 NLP benchmarks, pushing the GLUE benchmark score to 80.5% (a 7.7 percentage point absolute improvement over the previous best) ^[12].

The GPT series

OpenAI's Generative Pre-trained Transformer series took a different architectural approach from BERT, using a causal language model (left-to-right) rather than a bidirectional one.

GPT-1 (June 2018) had 117 million parameters and demonstrated that unsupervised pre-training followed by supervised fine-tuning could yield strong results across multiple NLP tasks ^[13].

GPT-2 (February 2019) scaled up to 1.5 billion parameters and surprised researchers with its text generation fluency. OpenAI initially withheld the full model over concerns about misuse, though the staged release proceeded over the following months ^[14].

GPT-3 (June 2020) reached 175 billion parameters, described by its authors as "10x more than any previous non-sparse language model," and demonstrated remarkable few-shot and zero-shot capabilities ^[15]. Crucially, the model was "applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction" ^[15]. Given just a few examples in the prompt, GPT-3 could perform tasks it had never been explicitly trained for, from translation to code generation. This "in-context learning" ability was a qualitative shift in how NLP models could be used.

GPT-4 (March 2023) added multimodal capabilities, accepting both text and image inputs. It showed substantial improvements in reasoning, factual accuracy, and safety over GPT-3 ^[16].

GPT-5 (August 2025) introduced adaptive reasoning, automatically selecting the appropriate reasoning depth for a given task, and expanded the context window to support over one million tokens.

T5 (2019)

T5 (Text-to-Text Transfer Transformer), developed by Colin Raffel and colleagues at Google, proposed a unified framework in which every NLP task is cast as a text-to-text problem ^[17]. Classification, translation, summarization, and question answering were all framed as taking a text input and producing a text output. This simplified the architecture and training pipeline. The largest T5 model had 11 billion parameters and achieved state-of-the-art results on GLUE, SuperGLUE, SQuAD, and CNN/Daily Mail benchmarks. T5 was trained on the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl data.

Evaluation metrics and benchmarks

Measuring progress in NLP requires both automatic metrics and carefully designed benchmarks.

Automatic metrics

Metric	Full Name	Primary Use	How It Works
BLEU	Bilingual Evaluation Understudy	Machine translation	Measures n-gram overlap between system output and reference translations; scores range from 0 to 1
ROUGE	Recall-Oriented Understudy for Gisting Evaluation	Text summarization	Measures overlap of n-grams, word sequences, and word pairs between system and reference summaries
Perplexity	N/A	Language modeling	Measures how well a probability model predicts a sample; lower perplexity indicates a better model
F1 Score	N/A	Classification, NER, QA	Harmonic mean of precision and recall
METEOR	Metric for Evaluation of Translation with Explicit Ordering	Machine translation	Extends BLEU with stemming, synonymy, and word order

BLEU, introduced by Kishore Papineni and colleagues at IBM in 2002, became the standard metric for machine translation evaluation ^[18]. It computes precision of n-gram matches between a candidate translation and one or more reference translations, with a brevity penalty for overly short outputs. Despite known limitations (it correlates imperfectly with human judgment, especially at the sentence level), BLEU remains widely used. A survey found that 82% of machine translation papers published between 2019 and 2020 evaluated using BLEU, even though over 108 alternative MT metrics had been proposed.

ROUGE, developed by Chin-Yew Lin in 2004, is the standard metric for evaluating text summarization systems ^[19]. ROUGE-N measures n-gram recall, ROUGE-L measures the longest common subsequence, and ROUGE-W applies weighted longest common subsequence matching.

Perplexity is a standard metric for evaluating language models. It measures the exponentiated average negative log-likelihood of a sequence under the model. A language model with lower perplexity assigns higher probability to held-out test data, indicating better modeling of the language.

Benchmarks

GLUE (General Language Understanding Evaluation), introduced by Wang et al. in 2018, is a collection of nine tasks designed to test a model's language understanding capabilities ^[20]. Tasks include sentiment analysis (SST-2), textual entailment (MNLI, RTE), semantic similarity (STS-B, MRPC, QQP), and linguistic acceptability (CoLA). When GLUE was introduced, the best models scored around 69%. BERT pushed this to 80.5%, and by 2020, models had surpassed average human performance on the benchmark.

SuperGLUE, introduced by Wang et al. in 2019, was created because models had begun to saturate the original GLUE benchmark ^[21]. SuperGLUE includes more challenging tasks such as multi-sentence reading comprehension (MultiRC), word sense disambiguation (WiC), and causal reasoning (COPA). Like GLUE before it, SuperGLUE has also been largely saturated by modern LLMs.

More recent benchmarks have emerged to evaluate the broader capabilities of LLMs. MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. BIG-Bench provides a diverse set of over 200 tasks. HELM (Holistic Evaluation of Language Models) evaluates models across multiple dimensions including accuracy, calibration, robustness, fairness, and efficiency.

Applications

NLP technology has become embedded in a wide range of products and services.

Search engines

Modern search engines rely heavily on NLP to understand user queries and match them with relevant documents. In October 2019, Google announced that BERT was being used to improve understanding of search queries, affecting roughly 10% of English-language searches in the United States; Google vice president of search Pandu Nayak called it "one of the biggest leaps forward in the history of Search" ^[25]. By 2025, search systems combine retrieval-augmented generation (RAG) with semantic similarity matching to provide more direct, conversational answers to user queries.

Conversational AI and chatbots

Conversational systems have progressed from simple rule-based chatbots to LLM-powered assistants capable of sustained, contextual dialogue. Customer service chatbots handle millions of interactions daily, resolving routine issues and escalating complex ones to human agents. General-purpose assistants like ChatGPT, Claude, and Google Gemini can write, analyze, code, and reason across a broad range of topics.

Machine translation services

Neural machine translation powers services used by hundreds of millions of people. Google Translate supports over 130 languages. DeepL, launched in 2017, has gained recognition for high-quality translations, particularly among European languages. Real-time translation features are now integrated into video conferencing platforms, messaging apps, and web browsers.

Healthcare

NLP is increasingly used in clinical settings to extract information from unstructured medical records, support clinical decision-making, and assist with medical documentation. Major electronic health record vendors like Epic have released AI documentation tools that generate structured clinical notes from physician-patient conversations. NLP-powered systems analyze medical literature to support research and drug discovery. LLMs are also being used to provide health information to patients, though their use in clinical contexts requires careful validation.

Legal

Law firms and legal departments use NLP for contract analysis, legal research, document review during discovery, and due diligence. NLP systems can identify relevant clauses in thousands of contracts, flag potential risks, and extract key terms and conditions. Legal AI tools have reduced the time required for document review in large litigation matters from weeks to hours.

Content moderation

Social media platforms and online communities use NLP to detect hate speech, misinformation, spam, and other policy-violating content. These systems must operate at massive scale (billions of posts per day) and across many languages. The challenge of content moderation has grown alongside the scale of online platforms, and NLP-based systems remain imperfect, particularly for content that requires cultural context or understanding of sarcasm and irony.

Finance

Financial institutions use NLP for sentiment analysis of news and social media (to inform trading strategies), analysis of earnings calls and financial reports, fraud detection in communications, and regulatory compliance monitoring. Named entity recognition helps extract company names, financial figures, and dates from unstructured text.

Challenges

Ambiguity

Human language is inherently ambiguous at multiple levels. Lexical ambiguity arises when a word has multiple meanings: "bank" can refer to a financial institution or the edge of a river. Syntactic ambiguity occurs when a sentence can be parsed in multiple ways: "I saw the man with the telescope" could mean you used a telescope to see the man, or you saw a man who had a telescope. Pragmatic ambiguity involves interpreting the intended meaning behind an utterance, which often depends on context, shared knowledge, and social conventions. The phenomenon of crash blossoms (ambiguous newspaper headlines) illustrates how challenging syntactic ambiguity can be, even for humans.

Context and common sense

Understanding language often requires world knowledge and common-sense reasoning that goes beyond what is explicitly stated. Consider the sentence "The trophy wouldn't fit in the suitcase because it was too big." Humans easily resolve "it" as referring to the trophy, but this requires knowledge about the relative sizes of trophies and suitcases. This specific example comes from the Winograd Schema Challenge, a common-sense reasoning benchmark proposed by Hector Levesque in 2011 as a more robust alternative to the Turing test ^[26]. Despite progress, LLMs still struggle with certain types of common-sense reasoning, particularly those involving physical intuition, causal reasoning, and temporal understanding.

Multilingual and low-resource languages

Of the roughly 7,000 languages spoken worldwide, NLP tools and resources are concentrated on a small fraction, primarily English, Chinese, and a handful of European languages. Low-resource languages lack the digitized text corpora, annotated datasets, and pre-trained models that high-resource languages benefit from. Creating linguistic resources for these languages is time-consuming and expensive. Research has shown that LLMs are more prone to generating harmful or factually incorrect content in low-resource languages compared to high-resource ones ^[22]. Cross-lingual transfer learning and multilingual models like mBERT and XLM-R have made some progress, but significant performance gaps remain.

Bias and fairness

NLP models learn from data that reflects the biases of the societies that produced it. Word embeddings trained on large web corpora have been shown to encode gender stereotypes (for example, associating "nurse" more strongly with women and "engineer" with men) ^[23]. These biases can propagate into downstream applications, leading to unfair or discriminatory outcomes in areas like hiring, lending, content moderation, and criminal justice. Mitigating bias is an active area of research, but it remains a fundamental challenge because bias is deeply woven into language itself.

Hallucination and factual accuracy

LLMs sometimes generate text that is fluent and confident but factually incorrect, a problem commonly called "hallucination." Because these models learn statistical patterns in language rather than maintaining a verified knowledge base, they can produce plausible-sounding statements that are entirely fabricated. This is especially problematic in high-stakes domains like healthcare, law, and finance. Techniques like retrieval-augmented generation (RAG), which grounds model outputs in retrieved documents, and chain-of-thought prompting have helped reduce but not eliminate this problem.

Computational cost and environmental impact

Training large language models requires enormous computational resources. GPT-3's training is estimated to have consumed several thousand petaflop-days of compute. The energy consumption associated with training and running these models has raised environmental concerns. Researchers have called for greater transparency about the carbon footprint of NLP research and have explored techniques like model distillation, pruning, and efficient architectures to reduce computational costs.

Current state (2025-2026)

As of early 2026, NLP is dominated by large language models built on the Transformer architecture. The field has undergone a remarkable consolidation: where a decade ago researchers designed task-specific models for each NLP problem, today a single pre-trained LLM can handle dozens of tasks through prompting and fine-tuning.

Context windows have expanded dramatically. Where early Transformer models were limited to 512 or 1,024 tokens, current frontier models from OpenAI, Anthropic, and Google support context windows of one million tokens or more. This expansion has enabled new use cases like analyzing entire codebases, processing lengthy legal documents, and maintaining extended multi-turn conversations.

The multimodal model has become standard at the frontier. Models like GPT-5, Gemini 3, and Claude Opus process not just text but also images, audio, and video. The boundary between NLP and computer vision has blurred substantially.

Open-weight models have also advanced significantly. Meta's Llama series, Mistral's models, and others have made capable language models available for local deployment and customization, enabling organizations with strict data privacy requirements to benefit from LLM capabilities.

The NLP market has grown rapidly in commercial terms. Industry estimates placed the global NLP market at approximately $30 billion in 2025, expanding to approximately $34.8 billion in 2026, reflecting roughly 16% year-over-year growth ^[24]. Multimodal NLP models that process text alongside images, audio, and video are projected to post significantly higher growth rates than text-only models through the end of the decade.

Despite these advances, fundamental challenges persist. Bias, hallucination, multilingual equity, and the environmental cost of training remain active areas of research and concern. The field continues to evolve at a pace that makes even recent breakthroughs seem like distant history within a few years.

References

Turing, A.M. (1950). "Computing Machinery and Intelligence." *Mind*, 59(236), 433-460. ↩
Hutchins, J. (2004). "The first public demonstration of machine translation: the Georgetown-IBM system, 7th January 1954." *Proceedings of AMTA 2004*. ↩
ALPAC (1966). "Language and Machines: Computers in Translation and Linguistics." National Academy of Sciences, National Research Council, Publication 1416. ↩
Weizenbaum, J. (1966). "ELIZA: A computer program for the study of natural language communication between man and machine." *Communications of the ACM*, 9(1), 36-45. ↩
Brown, P.F. et al. (1993). "The Mathematics of Statistical Machine Translation: Parameter Estimation." *Computational Linguistics*, 19(2), 263-311. ↩
Sutskever, I., Vinyals, O., & Le, Q.V. (2014). "Sequence to Sequence Learning with Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*. ↩
Bahdanau, D., Cho, K., & Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." *arXiv:1409.0473*. ↩
Vaswani, A. et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:1706.03762. ↩
Mikolov, T. et al. (2013). "Efficient Estimation of Word Representations in Vector Space." *arXiv:1301.3781*. ↩
Pennington, J., Socher, R., & Manning, C.D. (2014). "GloVe: Global Vectors for Word Representation." *Proceedings of EMNLP 2014*, 1532-1543. ↩
Peters, M.E. et al. (2018). "Deep Contextualized Word Representations." *Proceedings of NAACL-HLT 2018*. ↩
Devlin, J. et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *arXiv:1810.04805*. ↩
Radford, A. et al. (2018). "Improving Language Understanding by Generative Pre-Training." OpenAI. ↩
Radford, A. et al. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI. ↩
Brown, T.B. et al. (2020). "Language Models are Few-Shot Learners." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:2005.14165. ↩
OpenAI (2023). "GPT-4 Technical Report." *arXiv:2303.08774*. ↩
Raffel, C. et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67. ↩
Papineni, K. et al. (2002). "BLEU: A Method for Automatic Evaluation of Machine Translation." *Proceedings of ACL 2002*. ↩
Lin, C.-Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries." *Text Summarization Branches Out: Proceedings of the ACL-04 Workshop*. ↩
Wang, A. et al. (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." *arXiv:1804.07461*. ↩
Wang, A. et al. (2019). "SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems." *Advances in Neural Information Processing Systems (NeurIPS)*. ↩
Yong, Z.X. et al. (2023). "Low-Resource Languages Jailbreak GPT-4." *arXiv:2310.02446*. ↩
Bolukbasi, T. et al. (2016). "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings." *Advances in Neural Information Processing Systems (NeurIPS)*. ↩
The Business Research Company (2025). "Natural Language Processing Market Report 2025." ↩
Nayak, P. (2019). "Understanding searches better than ever before." Google: The Keyword (official blog), October 25, 2019. ↩
Levesque, H., Davis, E., & Morgenstern, L. (2012). "The Winograd Schema Challenge." *Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2012)*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit