Machine learning terms/Natural Language Processing
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 4,243 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
No citations yet
Review status
Needs citations
Revision
v6 · 4,243 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Natural Language Processing (NLP) is the subfield of artificial intelligence and machine learning concerned with enabling computers to read, interpret, generate, and reason about human language in text and speech form.[18] The core machine-learning terms used across modern NLP are: tokens and tokenization (how text is split into model-readable units), embeddings (dense vectors that encode meaning), attention and the Transformer (the architecture behind nearly every state-of-the-art system since 2017),[1] language models (which assign probabilities to token sequences), and the evaluation vocabulary of perplexity, BLEU, and benchmark suites such as GLUE and MMLU. NLP sits at the intersection of computer science, linguistics, and statistics,[16] and it powers nearly every consumer-facing product that uses written or spoken language, from web search and machine translation to chatbots like ChatGPT, Claude, and Gemini.
This page is the gateway hub for NLP-related entries on the AI Wiki. It introduces the central ideas, surveys the modern landscape dominated by large language models, and provides a curated index of every NLP concept and model with its own dedicated wiki page.
The history of NLP spans roughly seven decades and is conventionally divided into four eras:
| Era | Years | Dominant approach | Representative milestones |
|---|---|---|---|
| Symbolic / rule-based | 1950s to late 1980s | Hand-written grammars and logic | Georgetown-IBM translation demo (1954), ELIZA (1966), SHRDLU (1970) |
| Statistical | late 1980s to 2010 | Probability theory, HMMs, n-gram models, CRFs | IBM Candide SMT, Penn Treebank (1993), maximum-entropy taggers |
| Neural / distributional | 2010 to 2017 | Word embeddings, RNNs, LSTMs, seq2seq | Word2Vec (2013), GloVe (2014), Bahdanau attention (2014), GNMT (2016) |
| Transformer / foundation-model | 2017 to present | Self-attention, Transformer, large-scale pre-training, instruction tuning | "Attention Is All You Need" (2017), BERT (2018), GPT-3 (2020), ChatGPT (2022), GPT-4 (2023) |
The field shifted decisively in 2017 when Google researchers published "Attention Is All You Need", introducing the Transformer architecture.[1] The paper proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely", and reported 28.4 BLEU on WMT 2014 English-to-German and 41.8 BLEU on English-to-French, a new state of the art at the time.[1] Almost every state-of-the-art NLP system since then is a Transformer variant.
A handful of building blocks underlie nearly every modern NLP system.
A token is the atomic unit a language model reads or writes. Tokens may be characters, whole words, or, most commonly, subword pieces. Modern models almost always use subword tokenization algorithms such as byte pair encoding (BPE), WordPiece, and SentencePiece / Unigram.[9] Counting is done in tokens, not words. As a rough rule of thumb for English, OpenAI estimates 1 token is about 4 characters or roughly 0.75 words, so 100 tokens correspond to about 75 words.[19]
An embedding is a dense vector of real numbers representing meaning in a continuous space. Geometric proximity encodes semantic similarity: vectors for king and queen land near each other.[6] Embeddings are produced by an embedding layer and live in an embedding space whose dimensionality is typically 256 to 12,288. See embedding vector and vector embeddings.
A language model assigns probabilities to sequences of tokens. Given a context, it can either score the likelihood of a continuation or sample new text. Three architectural families dominate:
| Family | Direction | Training objective | Canonical example |
|---|---|---|---|
| Causal language model (autoregressive, left-to-right) | Unidirectional | Predict the next token | GPT, LLaMA, Claude |
| Masked language model | Bidirectional | Reconstruct masked tokens | BERT, RoBERTa, DeBERTa |
| Encoder-decoder (seq2seq) | Bidirectional encoder, unidirectional decoder | Map input sequence to output sequence | T5, BART, original Transformer |
A bidirectional language model reads context from both sides at once and is well suited to understanding tasks; a unidirectional language model is well suited to generation.
Attention is a mechanism that lets a model weight different parts of its input when computing a representation. Self-attention lets every token in a sequence attend to every other token; multi-head self-attention runs many such operations in parallel so the model can capture different relations simultaneously. Bahdanau attention (2014) was the first widely cited attention mechanism in NLP;[8] the modern scaled-dot-product version was popularized by the 2017 Transformer paper.[1] Because attention has no inherent notion of word order, models add a positional encoding so position is preserved.
The quality of a tokenizer has measurable effects on downstream performance, training cost, and multilingual coverage. The dominant approaches are summarized below.
| Algorithm | Idea | Used by |
|---|---|---|
| Whitespace / word | Split on spaces and punctuation | Classical NLP, early Word2Vec |
| Byte pair encoding (BPE) | Iteratively merge the most frequent adjacent symbol pair | GPT-2, GPT-3, GPT-4, LLaMA, most open-weights LLMs |
| WordPiece | BPE-like merging based on likelihood gain | BERT, DistilBERT |
| SentencePiece (Unigram) | Probabilistic model that prunes a candidate subword vocabulary | T5, ALBERT, XLNet, mBART |
| Byte-level BPE | Operates on raw UTF-8 bytes for full Unicode coverage | GPT-2 onward, Claude |
| Character / byte | One token per character or byte | ByT5, CANINE |
Classical tokenization includes bigram and trigram splitting, and the broader n-gram family that powered statistical language models from the 1990s through the early 2010s. A famous illustration of why tokenization is hard in raw text is the crash blossom: an ambiguous newspaper headline that humans parse easily but that confuses naive parsers.
Embedding methods evolved from static lookup tables to deeply contextual representations.[7]
| Generation | Method | Year | Notes |
|---|---|---|---|
| Static word vectors | Word2Vec (CBOW, skip-gram) | 2013 | Mikolov et al., Google. Trained on raw text with a shallow network |
| Static word vectors | GloVe | 2014 | Pennington, Socher, Manning at Stanford. Factorizes the global co-occurrence matrix |
| Static word vectors | fastText | 2016 | Facebook AI Research. Adds character n-grams for morphology |
| Contextual | ELMo | 2018 | Bidirectional LSTM language model from AI2 |
| Contextual | BERT embeddings | 2018 | Each token vector depends on full sentence context |
| Sentence-level | Sentence-BERT (SBERT) | 2019 | Siamese BERT producing fixed-size sentence vectors |
| Production | OpenAI text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large | 2022 to 2024 | High-dimensional API embeddings for retrieval |
| Production | Cohere Embed, Voyage AI, Google Gemini Embeddings, BGE, E5, GTE | 2023 to 2026 | Open and commercial dense retrievers |
Beyond dense embeddings, NLP also uses sparse representations such as bag of words, TF-IDF, and BM25. These are the basis of classical information retrieval, and they remain competitive when combined with dense vectors in hybrid retrieval.[17] A sparse feature is one whose value is mostly zero, which is the common case for one-hot or bag-of-words encodings.
The Transformer introduced in Attention Is All You Need replaced recurrence with self-attention.[1] It scales effectively on GPU and TPU hardware and has become the universal backbone of modern NLP. Major model families follow.
| Model | Year | Organization | Highlights |
|---|---|---|---|
| BERT | 2018 | Set new state of the art on 11 NLU tasks and pushed the GLUE score to 80.5, a 7.7-point absolute gain; introduced masked language modeling at scale[2] | |
| RoBERTa | 2019 | Meta AI | BERT trained longer with more data and dynamic masking |
| ALBERT | 2019 | Parameter sharing for compact BERT | |
| DistilBERT | 2019 | Hugging Face | Knowledge-distilled BERT, 40% smaller |
| XLNet | 2019 | CMU and Google | Permutation language modeling |
| ELECTRA | 2020 | Replaced-token detection objective | |
| DeBERTa | 2020 | Microsoft | Disentangled attention and enhanced masked decoding |
| Model | Year | Organization | Notes |
|---|---|---|---|
| GPT-1 | 2018 | OpenAI | First Generative Pre-trained Transformer |
| GPT-2 | 2019 | OpenAI | 1.5 billion parameters; demonstrated zero-shot transfer |
| GPT-3 | 2020 | OpenAI | 175 billion parameters (10x larger than any prior dense LM), trained on about 570 GB of filtered text; popularized few-shot in-context learning[4] |
| LaMDA | 2021 | Dialogue-tuned 137 billion parameter model | |
| PaLM | 2022 | 540 billion parameters | |
| LLaMA | 2023 | Meta | Open-weights foundation models |
| GPT-4 | 2023 | OpenAI | Multimodal; behind much of ChatGPT |
| Claude | 2023 to present | Anthropic | Constitutional AI alignment; Claude Opus 4.7 is the current frontier |
| Gemini | 2023 to present | Google DeepMind | Native multimodality; Gemini 3 is the latest line |
| Mistral, Mixtral | 2023 to present | Mistral AI | Efficient open-weights and mixture-of-experts |
| Falcon, Qwen, DeepSeek, GLM-4.5, Kimi, Phi, Gemma | 2023 to present | Various | Open-weights ecosystem |
| GPT-5 and successors | 2024 to present | OpenAI | Frontier reasoning models; GPT-5.5 |
| Model | Year | Organization | Notes |
|---|---|---|---|
| Original Transformer | 2017 | Built for machine translation | |
| T5 | 2019 | "Text-to-text" unification of tasks | |
| BART | 2019 | Meta | Denoising autoencoder for generation |
| mBART, mT5 | 2020 | Various | Multilingual encoder-decoders |
| Flan-T5, UL2 | 2022 | Instruction-tuned T5 variants |
Classical and modern NLP share a backbone of canonical tasks. The same Transformer model can usually be fine-tuned or prompted to handle each.[5]
| Task | Description | Example wiki entry |
|---|---|---|
| Text classification | Assign a label to a document, e.g. spam vs ham | Sentiment analysis |
| Named entity recognition (NER) | Identify spans referring to people, places, organizations, dates | CoNLL-2003, OntoNotes |
| Part-of-speech (POS) tagging | Label each token with its grammatical category | Penn Treebank tags |
| Syntactic parsing | Build a constituency or dependency tree | Universal Dependencies |
| Coreference resolution | Cluster mentions referring to the same entity | OntoNotes coreference |
| Word sense disambiguation | Pick the correct sense of an ambiguous word | WordNet senses |
| Sentiment analysis | Classify polarity, opinion, or emotion | SST, IMDB |
| Text summarization | Produce a shorter version preserving meaning | CNN/DailyMail, XSum |
| Machine translation | Translate text between languages | WMT |
| Question answering | Answer questions from context or open domain | SQuAD, NaturalQuestions |
| Information retrieval | Retrieve relevant documents for a query | MS MARCO, BEIR |
| Sequence-to-sequence generation | Map any input sequence to any output sequence | Translation, summarization, paraphrasing |
| Dialogue and chat | Multi-turn conversational generation | ChatGPT, Claude |
| Speech recognition | Convert audio to text | Whisper, Wav2Vec |
| Topic modeling | Discover latent themes in a corpus | Latent Dirichlet allocation |
Beyond these, many recent benchmarks evaluate higher-level skills such as code generation (HumanEval, MBPP), math word problems (GSM8K, MATH), and instruction following (IFEval).
NLP is often split into two halves: natural language understanding (NLU) covers reading and reasoning, served by encoder models like BERT[2] and benchmarks like GLUE and SuperGLUE; natural language generation (NLG) covers producing fluent text, served by decoder models like GPT and benchmarks like MT-Bench. Modern frontier systems blend both.
The rise of large language models introduced a set of techniques that did not exist in the pre-2018 NLP toolkit.
Most modern NLP systems follow a two- or three-stage recipe:
Parameter-efficient fine-tuning saves compute and memory by training only a small subset of weights. The most common approaches are LoRA (Low-Rank Adaptation), QLoRA, and adapter modules.[13] Multi-task pre-training is sometimes called meta-learning, and some pipelines use staged training to introduce data or capabilities in phases. Model merging lets practitioners combine separately trained checkpoints into a single model without further training.
Prompt engineering is the practice of designing input text so that an LLM produces a desired output. Important techniques include:
Retrieval-augmented generation (RAG) combines an LLM with an external knowledge base.[11] At query time, a retriever finds relevant documents using dense or sparse embeddings, and the LLM conditions on them to produce a grounded answer. Frameworks like LangChain and LlamaIndex provide building blocks for RAG pipelines, and vector databases such as Pinecone, Weaviate, Qdrant, and Milvus store embeddings at scale. RAG is now standard practice for chatbots that need fresh, factual, or proprietary information.
When an LLM can call external functions or APIs, it becomes capable of tool use. Higher-level AI agents plan, decompose tasks, and orchestrate multi-step actions. Subtypes include computer-use agents, AI browser agents, and agentic workflows. Persistent state is handled through agent memory and knowledge editing techniques.
A model's context window is the maximum number of tokens it can attend to at once. GPT-3 launched with 2,048 tokens; Claude 2 introduced 100,000-token windows in 2023; modern frontier systems including Claude Opus 4.7, Gemini 3 Pro, and GPT-5.5 handle context windows in the 1 million token range. Benchmarks such as LongBench and RULER measure long-context performance.
Scaling laws, formalized by Kaplan et al.[14] and refined by Chinchilla scaling,[15] describe how loss falls predictably with more compute, data, and parameters. The Chinchilla result showed that for a fixed compute budget, model size and training tokens should be scaled in equal proportion: a 70 billion parameter Chinchilla trained on 4x more data outperformed the 280 billion parameter Gopher on nearly every task.[15] See also the Scaling Laws for Neural Language Models paper. On the inference side, optimization techniques include speculative decoding, tensor parallelism, model parallelism, pipelining, test-time compute scaling for reasoning models, quantization formats like GGUF, and runtimes like Ollama and LM Studio.
Given a probability distribution over the next token, a decoding strategy decides which token to actually emit. The choice strongly affects fluency, diversity, and factuality.
| Strategy | How it works | Typical use |
|---|---|---|
| Greedy decoding | Always pick the highest-probability token | Deterministic short answers |
| Beam search | Maintain the top k hypotheses at each step | Translation, summarization |
| Sampling | Draw a token according to the distribution | Creative generation |
| Temperature | Sharpen (low) or flatten (high) the distribution before sampling | Controls randomness |
| Top-k sampling | Sample only from the k most probable tokens | Reduces tail noise |
| Top-p (nucleus) sampling | Sample from the smallest set whose cumulative probability exceeds p | Default for many chatbots |
| Typical sampling | Sample tokens close to the conditional entropy | Coherent open-ended generation |
| Mirostat | Adaptively keep perplexity near a target | Avoids local repetition |
| Contrastive search | Combine likelihood with degeneration penalty | Long-form generation |
| Speculative decoding | Use a small draft model and verify with a large model | Faster inference, same distribution |
NLP evaluation is hard because language tasks are open-ended. Common metrics and benchmarks include:
| Benchmark | Year | What it measures |
|---|---|---|
| GLUE | 2018 | General NLU |
| SuperGLUE | 2019 | Harder NLU |
| MMLU and MMLU-Pro | 2020 to 2024 | 57-subject multiple-choice knowledge |
| TruthfulQA | 2021 | Truthfulness on misleading questions |
| HellaSwag | 2019 | Commonsense sentence completion |
| BIG-Bench Hard | 2022 | 23 hard reasoning tasks |
| HumanEval | 2021 | Python code generation |
| MBPP | 2021 | Mostly basic Python problems |
| GSM8K | 2021 | Grade-school math word problems |
| MATH | 2021 | Competition mathematics |
| BoolQ, DROP, TriviaQA, SimpleQA | various | Reading comprehension and QA |
| GPQA Diamond | 2023 | Graduate-level science questions |
| MGSM | 2022 | Multilingual grade-school math |
| PubMedQA, LegalBench | various | Domain-specific QA |
| LongBench, RULER | 2023 to 2024 | Long-context retrieval and reasoning |
| LiveBench | 2024 | Continuously refreshed contamination-resistant tasks |
| MT-Bench | 2023 | LLM-as-judge multi-turn dialogue |
| JailbreakBench, AdvBench | 2024 | Robustness to adversarial prompts |
A growing literature studies LLM-as-judge evaluation, where one model rates the outputs of another, with MT-Bench and Chatbot Arena leading the way.
A modality is a channel of input or output. A multimodal model handles more than one. Frontier systems including GPT-4, Gemini, Claude Opus 4.7, and Llama 3.2 accept images and audio alongside text. Speech systems include Whisper and Wav2Vec.
NLP underlies a wide range of products. Selected categories with representative wiki entries:
| Domain | Example systems |
|---|---|
| Chatbots and assistants | ChatGPT, Claude, Gemini, Grok, Kimi, Doubao, Microsoft 365 Copilot |
| Search and retrieval | Google Search, Perplexity, Bing Chat, You.com |
| Code generation | GitHub Copilot, Codex, Code Llama, StarCoder, Codestral |
| Translation | Google Translate, DeepL, Microsoft Translator, Meta NLLB |
| Writing tools | QuillBot, Grammarly, Notion AI |
| Speech | Whisper, Apple Dictation, Amazon Transcribe |
| Enterprise | Customer-support automation, contract review, LegalBench, PubMedQA-style domain assistants |
| Routing and aggregation | OpenRouter, LangChain, LlamaIndex, CrewAI |
The modern NLP ecosystem is shaped by a handful of frontier labs and a much larger open-weights community.
| Organization | Selected NLP work |
|---|---|
| OpenAI | GPT family, ChatGPT, Codex, Whisper, o1 and o3 reasoning models |
| Anthropic | Claude family, Constitutional AI, RLHF research |
| Google and Google DeepMind | BERT, T5, LaMDA, PaLM, Gemini, Gemma |
| Meta AI | LLaMA, Llama 3, Llama 4 Scout and Maverick, RoBERTa, BART |
| Mistral AI | Mistral, Mixtral, Codestral, Mistral Medium 3 |
| xAI | Grok, Grok 3, Grok 4 |
| DeepSeek, Moonshot AI, Baidu AI, Huawei AI, MiniMax, Doubao | Chinese frontier and open-weights labs |
| AI2, EleutherAI | Open research and open models |
| Inflection AI, Reka AI | Specialized assistants and multimodal models |
| Apple | Apple Foundation Models |
The original gateway list is preserved below. Each link points to a dedicated AI Wiki entry.
In addition to the original list, the following AI Wiki entries cover related NLP concepts, models, benchmarks, and tools.