Natural language processing (NLP) is a subfield of artificial intelligence and linguistics concerned with the interaction between computers and human language. It encompasses a broad range of computational techniques for analyzing, understanding, generating, and manipulating natural language in both text and speech form. NLP draws on disciplines including computer science, computational linguistics, and machine learning, with the goal of enabling machines to process language in ways that are useful and meaningful.
The field spans tasks as varied as machine translation, sentiment analysis, question answering, text summarization, and dialogue systems. Over the past decade, NLP has been transformed by the rise of large language models (LLMs), which have pushed the boundaries of what machines can do with human language.
The history of NLP stretches back to the earliest days of computing. Its development can be roughly divided into several eras, each defined by the dominant paradigm of the time.
Alan Turing's 1950 paper "Computing Machinery and Intelligence" posed the question "Can machines think?" and proposed what became known as the Turing test, a measure of a machine's ability to exhibit intelligent behavior indistinguishable from a human [1]. This paper laid conceptual groundwork for NLP, even though practical systems were still years away.
The Georgetown-IBM experiment of January 7, 1954 was one of the first public demonstrations of machine translation. Researchers at Georgetown University and IBM used an IBM 701 computer to automatically translate over 60 Russian sentences into English [2]. The system relied on just six grammar rules and a vocabulary of 250 lexical items covering fields like politics, law, chemistry, and military affairs. Despite its limited scope, the demonstration generated enormous public interest and raised expectations that fully automatic, high-quality translation would be achievable within a few years.
Those expectations proved premature. In 1966, the ALPAC report (from the Automatic Language Processing Advisory Committee, a panel of seven scientists led by John R. Pierce) concluded that machine translation was slower, less accurate, and more expensive than human translation [3]. The report's findings led to a dramatic reduction in MT research funding in the United States for roughly two decades, contributing to what some historians describe as the beginning of the first AI winter.
Also in the mid-1960s, Joseph Weizenbaum at MIT created ELIZA (1964-1966), one of the earliest programs to attempt natural language conversation [4]. ELIZA's most famous script, DOCTOR, simulated a Rogerian psychotherapist by using pattern matching and substitution rules to reflect a user's statements back as questions. Weizenbaum chose the psychotherapy framing specifically to sidestep the need for real-world knowledge. Despite its simplicity, ELIZA convinced some users it genuinely understood them. Weizenbaum's own secretary reportedly asked him to leave the room so she could have a private conversation with the program.
From the 1960s through the 1980s, most NLP systems relied on hand-crafted rules. Linguists and computer scientists wrote detailed grammars and lexicons that specified how language should be parsed and interpreted. Systems like SHRDLU (1970), developed by Terry Winograd at MIT, could understand and respond to English commands within a constrained "blocks world" environment. SHRDLU could answer questions about the objects in its world, follow instructions to move them, and even explain its reasoning.
Other notable systems from this period include LUNAR (1972), which answered questions about lunar soil samples in natural English, and various expert systems that incorporated NLP components. The rule-based approach worked reasonably well for narrow, well-defined domains, but it scaled poorly. Human language is enormously complex and varied; writing rules to cover every possible construction, idiom, and ambiguity proved impractical for open-domain applications.
The late 1980s and 1990s brought a paradigm shift. Increased computational power and the growing availability of digitized text corpora enabled researchers to move away from hand-written rules and toward statistical methods. Rather than encoding linguistic knowledge explicitly, statistical NLP systems learned patterns from data.
Hidden Markov Models (HMMs) became widely used for tasks like part-of-speech tagging and speech recognition. Probabilistic context-free grammars enabled statistical parsing. The IBM Models for machine translation (developed by Peter Brown and colleagues at IBM Research in the early 1990s) showed that translation could be treated as a statistical problem, estimating the probability that a sentence in one language corresponded to a sentence in another [5].
This era also saw the emergence of practical NLP applications at scale. Search engines began using statistical NLP techniques for query understanding and document ranking. In 2006, Google launched its Statistical Machine Translation system, which leveraged vast amounts of parallel text data to improve translation quality significantly over earlier rule-based approaches.
| Period | Dominant Approach | Key Characteristics | Example Systems |
|---|---|---|---|
| 1950s-1960s | Early experiments | First MT demonstrations, simple pattern matching | Georgetown-IBM, ELIZA |
| 1960s-1980s | Rule-based | Hand-crafted grammars, expert systems | SHRDLU, LUNAR |
| 1990s-2000s | Statistical | Probabilistic models, corpus-based learning | IBM MT Models, HMM taggers |
| 2010s | Neural / deep learning | Word embeddings, sequence models | Word2Vec, seq2seq, attention |
| 2017-present | Transformer-based | Pre-trained models, transfer learning | BERT, GPT, T5 |
The 2010s saw deep learning methods overtake traditional statistical approaches across nearly every NLP task. Recurrent neural networks (RNNs), and later Long Short-Term Memory (LSTM) networks, proved effective at modeling sequential data like text. The sequence-to-sequence (seq2seq) architecture, introduced by Sutskever, Vinyals, and Le in 2014, enabled neural machine translation systems that could map an input sequence in one language to an output sequence in another [6].
A key development was the introduction of the attention mechanism by Bahdanau, Cho, and Bengio in 2014 [7]. Attention allowed models to focus on relevant parts of the input when generating each element of the output, rather than compressing the entire input into a single fixed-length vector. This dramatically improved performance on tasks like translation, especially for longer sentences.
The publication of "Attention Is All You Need" by Vaswani et al. in June 2017 introduced the Transformer architecture, which dispensed with recurrence and convolutions entirely in favor of self-attention mechanisms [8]. The paper, authored by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin at Google Brain, showed that Transformer models were not only more effective than RNN-based models on translation tasks but also far more parallelizable, reducing training times substantially.
The Transformer architecture became the foundation for virtually all subsequent breakthroughs in NLP. It enabled the era of large-scale pre-trained language models that could be fine-tuned for specific tasks, fundamentally changing how NLP research and applications are conducted.
NLP encompasses a wide range of tasks. Some are low-level building blocks used within larger systems; others are application-level tasks that directly serve end users.
Tokenization is the process of breaking raw text into smaller units called tokens. These tokens might be words, subwords, or characters, depending on the approach. Modern NLP systems often use subword tokenization methods like Byte Pair Encoding (BPE) or WordPiece, which handle rare and out-of-vocabulary words by splitting them into known subword units. For example, the word "unhappiness" might be tokenized as "un", "happi", and "ness". Tokenization is typically the first step in any NLP pipeline.
Part-of-speech (POS) tagging assigns grammatical labels (noun, verb, adjective, etc.) to each word in a sentence. For instance, in "The cat sat on the mat," "cat" is tagged as a noun and "sat" as a verb. POS tagging was historically performed using HMMs and conditional random fields (CRFs). Modern systems use neural models that achieve accuracy above 97% on standard English benchmarks.
Named entity recognition (NER) identifies and classifies proper names and other specific entities in text into categories such as person, organization, location, date, and monetary value. For example, in the sentence "Apple was founded by Steve Jobs in Cupertino in 1976," a NER system should identify "Apple" as an organization, "Steve Jobs" as a person, "Cupertino" as a location, and "1976" as a date.
Syntactic parsing (or simply parsing) analyzes the grammatical structure of a sentence, producing a tree that represents how words relate to one another. Constituency parsing breaks a sentence into nested phrases (noun phrase, verb phrase, etc.), while dependency parsing identifies directed relationships between words (e.g., "cat" is the subject of "sat"). Parsing is useful for information extraction, relation detection, and deeper language understanding.
Sentiment analysis determines the emotional tone or opinion expressed in a piece of text. At its simplest, this means classifying text as positive, negative, or neutral. More fine-grained approaches identify specific aspects being discussed and the sentiment toward each. Sentiment analysis is widely used in business intelligence, social media monitoring, and product review analysis.
Machine translation (MT) is the task of automatically translating text from one natural language to another. It has been a driving force in NLP research since the field's inception. Modern neural machine translation systems, such as those powering Google Translate and DeepL, use Transformer-based models trained on billions of sentence pairs. While MT quality for high-resource language pairs (e.g., English-French, English-German) has improved enormously, translation for low-resource languages remains a significant challenge.
Question answering (QA) systems take a natural language question and return an answer, often by extracting it from a given passage or knowledge base. Extractive QA models identify the span of text within a document that answers the question. Generative QA models, by contrast, produce an answer in free text. The Stanford Question Answering Dataset (SQuAD), introduced in 2016, became one of the most influential benchmarks in NLP for evaluating reading comprehension.
Text summarization condenses a longer document into a shorter version while preserving key information. Extractive summarization selects and concatenates the most important sentences from the source. Abstractive summarization generates new text that paraphrases and compresses the original content. LLMs have made abstractive summarization far more practical and fluent than earlier approaches.
Text classification assigns predefined labels to documents or passages. Spam detection, topic categorization, and language identification are all forms of text classification. With pre-trained language models, it has become possible to build high-quality text classifiers with remarkably little labeled training data, sometimes just a few dozen examples per class.
The earliest NLP systems relied on manually coded rules. Linguists specified grammars, dictionaries, and transformation rules. These systems were precise within their designed scope but brittle when confronted with language outside their rules. Maintaining and extending them was labor-intensive.
Statistical NLP introduced probabilistic models that learned from data. Key techniques included:
Neural NLP began gaining traction in the early 2010s. Key developments included:
The current dominant paradigm involves pre-training large neural models on vast amounts of unlabeled text, then fine-tuning or prompting them for specific tasks. This transfer learning approach, pioneered by models like ELMo, BERT, and GPT, has proven extraordinarily effective. A single pre-trained model can be adapted to dozens of different NLP tasks, often outperforming task-specific models trained from scratch.
The shift to pre-trained models has also changed the workflow for NLP practitioners. Rather than designing task-specific architectures, much of the work now involves selecting an appropriate pre-trained model, crafting prompts, and fine-tuning on task-specific data.
The following table summarizes major models that have shaped the trajectory of NLP.
| Model | Year | Developer | Key Innovation | Parameters |
|---|---|---|---|---|
| Word2Vec | 2013 | Google (Mikolov et al.) | Efficient word embeddings via skip-gram and CBOW | N/A (embedding method) |
| GloVe | 2014 | Stanford (Pennington et al.) | Global co-occurrence statistics for word vectors | N/A (embedding method) |
| ELMo | 2018 | Allen AI (Peters et al.) | Deep contextualized word representations from biLMs | ~94M |
| GPT-1 | 2018 | OpenAI | Generative pre-training with Transformer decoder | 117M |
| BERT | 2018 | Google (Devlin et al.) | Bidirectional pre-training with masked language modeling | 110M (Base), 340M (Large) |
| GPT-2 | 2019 | OpenAI | Scaled generative pre-training; demonstrated strong zero-shot performance | 1.5B |
| T5 | 2019 | Google (Raffel et al.) | Unified text-to-text framework for all NLP tasks | Up to 11B |
| GPT-3 | 2020 | OpenAI | Massive scale; in-context learning; few-shot capabilities | 175B |
| PaLM | 2022 | Scaling with Pathways system; strong reasoning | 540B | |
| ChatGPT | 2022 | OpenAI | RLHF-tuned conversational model based on GPT-3.5 | Undisclosed |
| GPT-4 | 2023 | OpenAI | Multimodal (text + image); improved reasoning and safety | ~1-1.8T (estimated) |
| Llama 2 | 2023 | Meta | Open-weight large language model | 7B to 70B |
| GPT-5 | 2025 | OpenAI | Adaptive reasoning; million-token context | Undisclosed |
Word2Vec, developed by Tomas Mikolov, Kai Chen, Greg Corrado, and Jeff Dean at Google, introduced efficient methods for learning dense vector representations of words from large corpora [9]. The paper proposed two architectures: Continuous Bag of Words (CBOW), which predicts a target word from its context, and Skip-gram, which predicts context words given a target word. Trained on the Google News corpus (approximately 6 billion tokens), Word2Vec demonstrated that the resulting vectors captured meaningful semantic relationships. The famous example: the vector for "king" minus "man" plus "woman" yielded a vector close to "queen." Word2Vec made word embeddings practical and accessible, and its influence on subsequent NLP research was enormous.
GloVe (Global Vectors for Word Representation), developed by Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford, took a different approach to word embeddings [10]. Rather than learning from local context windows like Word2Vec, GloVe trained on global word-word co-occurrence statistics from a corpus. The model combined advantages of matrix factorization methods (like LSA) with the local context window approach. GloVe achieved 75% accuracy on word analogy tasks and came with several pre-trained vector sets, including ones trained on Wikipedia and Common Crawl data with up to 840 billion tokens.
ELMo (Embeddings from Language Models), developed by Matthew Peters and colleagues at the Allen Institute for AI, represented a major conceptual shift [11]. Unlike Word2Vec and GloVe, which assigned a single static vector to each word regardless of context, ELMo generated context-dependent representations. The model used a deep bidirectional LSTM language model and combined representations from all its internal layers using task-specific learned weights. Adding ELMo representations to existing models improved state-of-the-art results across six NLP tasks, with relative error reductions of up to 20%.
BERT (Bidirectional Encoder Representations from Transformers), developed by Jacob Devlin and colleagues at Google, applied the Transformer architecture to bidirectional pre-training [12]. BERT introduced two pre-training objectives: the masked language model (MLM) task, where 15% of input tokens are randomly masked and the model learns to predict them, and next sentence prediction (NSP). BERT-Base had 110 million parameters (12 layers, 768 hidden units, 12 attention heads), while BERT-Large had 340 million parameters (24 layers, 1024 hidden units, 16 attention heads). Pre-trained on the BooksCorpus and English Wikipedia, BERT achieved new state-of-the-art results on 11 NLP benchmarks, pushing the GLUE benchmark score to 80.5% (a 7.7 percentage point improvement over the previous best).
OpenAI's Generative Pre-trained Transformer series took a different architectural approach from BERT, using a causal language model (left-to-right) rather than a bidirectional one.
GPT-1 (June 2018) had 117 million parameters and demonstrated that unsupervised pre-training followed by supervised fine-tuning could yield strong results across multiple NLP tasks [13].
GPT-2 (February 2019) scaled up to 1.5 billion parameters and surprised researchers with its text generation fluency. OpenAI initially withheld the full model over concerns about misuse, though the staged release proceeded over the following months [14].
GPT-3 (June 2020) reached 175 billion parameters and demonstrated remarkable few-shot and zero-shot capabilities [15]. Given just a few examples in the prompt, GPT-3 could perform tasks it had never been explicitly trained for, from translation to code generation. This "in-context learning" ability was a qualitative shift in how NLP models could be used.
GPT-4 (March 2023) added multimodal capabilities, accepting both text and image inputs. It showed substantial improvements in reasoning, factual accuracy, and safety over GPT-3 [16].
GPT-5 (August 2025) introduced adaptive reasoning, automatically selecting the appropriate reasoning depth for a given task, and expanded the context window to support over one million tokens.
T5 (Text-to-Text Transfer Transformer), developed by Colin Raffel and colleagues at Google, proposed a unified framework in which every NLP task is cast as a text-to-text problem [17]. Classification, translation, summarization, and question answering were all framed as taking a text input and producing a text output. This simplified the architecture and training pipeline. The largest T5 model had 11 billion parameters and achieved state-of-the-art results on GLUE, SuperGLUE, SQuAD, and CNN/Daily Mail benchmarks. T5 was trained on the Colossal Clean Crawled Corpus (C4), a cleaned version of Common Crawl data.
Measuring progress in NLP requires both automatic metrics and carefully designed benchmarks.
| Metric | Full Name | Primary Use | How It Works |
|---|---|---|---|
| BLEU | Bilingual Evaluation Understudy | Machine translation | Measures n-gram overlap between system output and reference translations; scores range from 0 to 1 |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation | Text summarization | Measures overlap of n-grams, word sequences, and word pairs between system and reference summaries |
| Perplexity | N/A | Language modeling | Measures how well a probability model predicts a sample; lower perplexity indicates a better model |
| F1 Score | N/A | Classification, NER, QA | Harmonic mean of precision and recall |
| METEOR | Metric for Evaluation of Translation with Explicit Ordering | Machine translation | Extends BLEU with stemming, synonymy, and word order |
BLEU, introduced by Kishore Papineni and colleagues at IBM in 2002, became the standard metric for machine translation evaluation [18]. It computes precision of n-gram matches between a candidate translation and one or more reference translations, with a brevity penalty for overly short outputs. Despite known limitations (it correlates imperfectly with human judgment, especially at the sentence level), BLEU remains widely used. A survey found that 82% of machine translation papers published between 2019 and 2020 evaluated using BLEU, even though over 108 alternative MT metrics had been proposed.
ROUGE, developed by Chin-Yew Lin in 2004, is the standard metric for evaluating text summarization systems [19]. ROUGE-N measures n-gram recall, ROUGE-L measures the longest common subsequence, and ROUGE-W applies weighted longest common subsequence matching.
Perplexity is a standard metric for evaluating language models. It measures the exponentiated average negative log-likelihood of a sequence under the model. A language model with lower perplexity assigns higher probability to held-out test data, indicating better modeling of the language.
GLUE (General Language Understanding Evaluation), introduced by Wang et al. in 2018, is a collection of nine tasks designed to test a model's language understanding capabilities [20]. Tasks include sentiment analysis (SST-2), textual entailment (MNLI, RTE), semantic similarity (STS-B, MRPC, QQP), and linguistic acceptability (CoLA). When GLUE was introduced, the best models scored around 69%. BERT pushed this to 80.5%, and by 2020, models had surpassed average human performance on the benchmark.
SuperGLUE, introduced by Wang et al. in 2019, was created because models had begun to saturate the original GLUE benchmark [21]. SuperGLUE includes more challenging tasks such as multi-sentence reading comprehension (MultiRC), word sense disambiguation (WiC), and causal reasoning (COPA). Like GLUE before it, SuperGLUE has also been largely saturated by modern LLMs.
More recent benchmarks have emerged to evaluate the broader capabilities of LLMs. MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects. BIG-Bench provides a diverse set of over 200 tasks. HELM (Holistic Evaluation of Language Models) evaluates models across multiple dimensions including accuracy, calibration, robustness, fairness, and efficiency.
NLP technology has become embedded in a wide range of products and services.
Modern search engines rely heavily on NLP to understand user queries and match them with relevant documents. In 2019, Google announced that BERT was being used to improve understanding of search queries, affecting roughly 10% of English-language searches. By 2025, search systems combine retrieval-augmented generation (RAG) with semantic similarity matching to provide more direct, conversational answers to user queries.
Conversational systems have progressed from simple rule-based chatbots to LLM-powered assistants capable of sustained, contextual dialogue. Customer service chatbots handle millions of interactions daily, resolving routine issues and escalating complex ones to human agents. General-purpose assistants like ChatGPT, Claude, and Google Gemini can write, analyze, code, and reason across a broad range of topics.
Neural machine translation powers services used by hundreds of millions of people. Google Translate supports over 130 languages. DeepL, launched in 2017, has gained recognition for high-quality translations, particularly among European languages. Real-time translation features are now integrated into video conferencing platforms, messaging apps, and web browsers.
NLP is increasingly used in clinical settings to extract information from unstructured medical records, support clinical decision-making, and assist with medical documentation. Major electronic health record vendors like Epic have released AI documentation tools that generate structured clinical notes from physician-patient conversations. NLP-powered systems analyze medical literature to support research and drug discovery. LLMs are also being used to provide health information to patients, though their use in clinical contexts requires careful validation.
Law firms and legal departments use NLP for contract analysis, legal research, document review during discovery, and due diligence. NLP systems can identify relevant clauses in thousands of contracts, flag potential risks, and extract key terms and conditions. Legal AI tools have reduced the time required for document review in large litigation matters from weeks to hours.
Social media platforms and online communities use NLP to detect hate speech, misinformation, spam, and other policy-violating content. These systems must operate at massive scale (billions of posts per day) and across many languages. The challenge of content moderation has grown alongside the scale of online platforms, and NLP-based systems remain imperfect, particularly for content that requires cultural context or understanding of sarcasm and irony.
Financial institutions use NLP for sentiment analysis of news and social media (to inform trading strategies), analysis of earnings calls and financial reports, fraud detection in communications, and regulatory compliance monitoring. Named entity recognition helps extract company names, financial figures, and dates from unstructured text.
Human language is inherently ambiguous at multiple levels. Lexical ambiguity arises when a word has multiple meanings: "bank" can refer to a financial institution or the edge of a river. Syntactic ambiguity occurs when a sentence can be parsed in multiple ways: "I saw the man with the telescope" could mean you used a telescope to see the man, or you saw a man who had a telescope. Pragmatic ambiguity involves interpreting the intended meaning behind an utterance, which often depends on context, shared knowledge, and social conventions. The phenomenon of crash blossoms (ambiguous newspaper headlines) illustrates how challenging syntactic ambiguity can be, even for humans.
Understanding language often requires world knowledge and common-sense reasoning that goes beyond what is explicitly stated. Consider the sentence "The trophy wouldn't fit in the suitcase because it was too big." Humans easily resolve "it" as referring to the trophy, but this requires knowledge about the relative sizes of trophies and suitcases. Despite progress, LLMs still struggle with certain types of common-sense reasoning, particularly those involving physical intuition, causal reasoning, and temporal understanding.
Of the roughly 7,000 languages spoken worldwide, NLP tools and resources are concentrated on a small fraction, primarily English, Chinese, and a handful of European languages. Low-resource languages lack the digitized text corpora, annotated datasets, and pre-trained models that high-resource languages benefit from. Creating linguistic resources for these languages is time-consuming and expensive. Research has shown that LLMs are more prone to generating harmful or factually incorrect content in low-resource languages compared to high-resource ones [22]. Cross-lingual transfer learning and multilingual models like mBERT and XLM-R have made some progress, but significant performance gaps remain.
NLP models learn from data that reflects the biases of the societies that produced it. Word embeddings trained on large web corpora have been shown to encode gender stereotypes (for example, associating "nurse" more strongly with women and "engineer" with men) [23]. These biases can propagate into downstream applications, leading to unfair or discriminatory outcomes in areas like hiring, lending, content moderation, and criminal justice. Mitigating bias is an active area of research, but it remains a fundamental challenge because bias is deeply woven into language itself.
LLMs sometimes generate text that is fluent and confident but factually incorrect, a problem commonly called "hallucination." Because these models learn statistical patterns in language rather than maintaining a verified knowledge base, they can produce plausible-sounding statements that are entirely fabricated. This is especially problematic in high-stakes domains like healthcare, law, and finance. Techniques like retrieval-augmented generation (RAG), which grounds model outputs in retrieved documents, and chain-of-thought prompting have helped reduce but not eliminate this problem.
Training large language models requires enormous computational resources. GPT-3's training is estimated to have consumed several thousand petaflop-days of compute. The energy consumption associated with training and running these models has raised environmental concerns. Researchers have called for greater transparency about the carbon footprint of NLP research and have explored techniques like model distillation, pruning, and efficient architectures to reduce computational costs.
As of early 2026, NLP is dominated by large language models built on the Transformer architecture. The field has undergone a remarkable consolidation: where a decade ago researchers designed task-specific models for each NLP problem, today a single pre-trained LLM can handle dozens of tasks through prompting and fine-tuning.
Context windows have expanded dramatically. Where early Transformer models were limited to 512 or 1,024 tokens, current frontier models from OpenAI, Anthropic, and Google support context windows of one million tokens or more. This expansion has enabled new use cases like analyzing entire codebases, processing lengthy legal documents, and maintaining extended multi-turn conversations.
The multimodal model has become standard at the frontier. Models like GPT-5, Gemini 3, and Claude Opus process not just text but also images, audio, and video. The boundary between NLP and computer vision has blurred substantially.
Open-weight models have also advanced significantly. Meta's Llama series, Mistral's models, and others have made capable language models available for local deployment and customization, enabling organizations with strict data privacy requirements to benefit from LLM capabilities.
The NLP market has grown rapidly in commercial terms. Industry estimates placed the global NLP market at approximately $30 billion in 2025, with projections pointing toward continued strong growth through the end of the decade [24].
Despite these advances, fundamental challenges persist. Bias, hallucination, multilingual equity, and the environmental cost of training remain active areas of research and concern. The field continues to evolve at a pace that makes even recent breakthroughs seem like distant history within a few years.
See also: Natural Language Processing terms
See also: Natural Language Processing Models