See also: Machine learning terms
Natural Language Processing (NLP) is the subfield of artificial intelligence and machine learning concerned with enabling computers to read, interpret, generate, and reason about human language in text and speech form. NLP sits at the intersection of computer science, linguistics, and statistics. It powers nearly every consumer-facing product that uses written or spoken language, from web search and machine translation to chatbots like ChatGPT, Claude, and Gemini.
This page is the gateway hub for NLP-related entries on the AI Wiki. It introduces the central ideas, surveys the modern landscape dominated by large language models, and provides a curated index of every NLP concept and model with its own dedicated wiki page.
brief history of natural language processing
The history of NLP spans roughly seven decades and is conventionally divided into four eras:
| Era | Years | Dominant approach | Representative milestones |
|---|
| Symbolic / rule-based | 1950s to late 1980s | Hand-written grammars and logic | Georgetown-IBM translation demo (1954), ELIZA (1966), SHRDLU (1970) |
| Statistical | late 1980s to 2010 | Probability theory, HMMs, n-gram models, CRFs | IBM Candide SMT, Penn Treebank (1993), maximum-entropy taggers |
| Neural / distributional | 2010 to 2017 | Word embeddings, RNNs, LSTMs, seq2seq | Word2Vec (2013), GloVe (2014), Bahdanau attention (2014), GNMT (2016) |
| Transformer / foundation-model | 2017 to present | Self-attention, Transformer, large-scale pre-training, instruction tuning | “Attention Is All You Need” (2017), BERT (2018), GPT-3 (2020), ChatGPT (2022), GPT-4 (2023) |
The field shifted decisively in 2017 when Google researchers published “Attention Is All You Need”, introducing the Transformer architecture. Almost every state-of-the-art NLP system since then is a Transformer variant.
foundational concepts
A handful of building blocks underlie nearly every modern NLP system.
tokens and tokenization
A token is the atomic unit a language model reads or writes. Tokens may be characters, whole words, or, most commonly, subword pieces. Modern models almost always use subword tokenization algorithms such as byte pair encoding (BPE), WordPiece, and SentencePiece / Unigram. Counting is done in tokens, not words; English text averages roughly 0.75 words per token.
embeddings
An embedding is a dense vector of real numbers representing meaning in a continuous space. Geometric proximity encodes semantic similarity: vectors for king and queen land near each other. Embeddings are produced by an embedding layer and live in an embedding space whose dimensionality is typically 256 to 12,288. See embedding vector and vector embeddings.
language models
A language model assigns probabilities to sequences of tokens. Given a context, it can either score the likelihood of a continuation or sample new text. Three architectural families dominate:
A bidirectional language model reads context from both sides at once and is well suited to understanding tasks; a unidirectional language model is well suited to generation.
Attention is a mechanism that lets a model weight different parts of its input when computing a representation. Self-attention lets every token in a sequence attend to every other token; multi-head self-attention runs many such operations in parallel so the model can capture different relations simultaneously. Bahdanau attention (2014) was the first widely cited attention mechanism in NLP; the modern scaled-dot-product version was popularized by the 2017 Transformer paper. Because attention has no inherent notion of word order, models add a positional encoding so position is preserved.
tokenization in depth
The quality of a tokenizer has measurable effects on downstream performance, training cost, and multilingual coverage. The dominant approaches are summarized below.
| Algorithm | Idea | Used by |
|---|
| Whitespace / word | Split on spaces and punctuation | Classical NLP, early Word2Vec |
| Byte pair encoding (BPE) | Iteratively merge the most frequent adjacent symbol pair | GPT-2, GPT-3, GPT-4, LLaMA, most open-weights LLMs |
| WordPiece | BPE-like merging based on likelihood gain | BERT, DistilBERT |
| SentencePiece (Unigram) | Probabilistic model that prunes a candidate subword vocabulary | T5, ALBERT, XLNet, mBART |
| Byte-level BPE | Operates on raw UTF-8 bytes for full Unicode coverage | GPT-2 onward, Claude |
| Character / byte | One token per character or byte | ByT5, CANINE |
Classical tokenization includes bigram and trigram splitting, and the broader n-gram family that powered statistical language models from the 1990s through the early 2010s. A famous illustration of why tokenization is hard in raw text is the crash blossom: an ambiguous newspaper headline that humans parse easily but that confuses naive parsers.
word and sentence embeddings
Embedding methods evolved from static lookup tables to deeply contextual representations.
| Generation | Method | Year | Notes |
|---|
| Static word vectors | Word2Vec (CBOW, skip-gram) | 2013 | Mikolov et al., Google. Trained on raw text with a shallow network |
| Static word vectors | GloVe | 2014 | Pennington, Socher, Manning at Stanford. Factorizes the global co-occurrence matrix |
| Static word vectors | fastText | 2016 | Facebook AI Research. Adds character n-grams for morphology |
| Contextual | ELMo | 2018 | Bidirectional LSTM language model from AI2 |
| Contextual | BERT embeddings | 2018 | Each token vector depends on full sentence context |
| Sentence-level | Sentence-BERT (SBERT) | 2019 | Siamese BERT producing fixed-size sentence vectors |
| Production | OpenAI text-embedding-ada-002, text-embedding-3-small, text-embedding-3-large | 2022 to 2024 | High-dimensional API embeddings for retrieval |
| Production | Cohere Embed, Voyage AI, Google Gemini Embeddings, BGE, E5, GTE | 2023 to 2026 | Open and commercial dense retrievers |
Beyond dense embeddings, NLP also uses sparse representations such as bag of words, TF-IDF, and BM25. These are the basis of classical information retrieval, and they remain competitive when combined with dense vectors in hybrid retrieval. A sparse feature is one whose value is mostly zero, which is the common case for one-hot or bag-of-words encodings.
The Transformer introduced in Attention Is All You Need replaced recurrence with self-attention. It scales effectively on GPU and TPU hardware and has become the universal backbone of modern NLP. Major model families follow.
encoder-only (masked) models
| Model | Year | Organization | Highlights |
|---|
| BERT | 2018 | Google | Set state of the art on 11 NLU tasks; introduced masked language modeling at scale |
| RoBERTa | 2019 | Meta AI | BERT trained longer with more data and dynamic masking |
| ALBERT | 2019 | Google | Parameter sharing for compact BERT |
| DistilBERT | 2019 | Hugging Face | Knowledge-distilled BERT, 40% smaller |
| XLNet | 2019 | CMU and Google | Permutation language modeling |
| ELECTRA | 2020 | Google | Replaced-token detection objective |
| DeBERTa | 2020 | Microsoft | Disentangled attention and enhanced masked decoding |
decoder-only (causal / autoregressive) models
| Model | Year | Organization | Notes |
|---|
| GPT-1 | 2018 | OpenAI | First Generative Pre-trained Transformer |
| GPT-2 | 2019 | OpenAI | 1.5 billion parameters; demonstrated zero-shot transfer |
| GPT-3 | 2020 | OpenAI | 175 billion parameters; popularized few-shot in-context learning |
| LaMDA | 2021 | Google | Dialogue-tuned 137 billion parameter model |
| PaLM | 2022 | Google | 540 billion parameters |
| LLaMA | 2023 | Meta | Open-weights foundation models |
| GPT-4 | 2023 | OpenAI | Multimodal; behind much of ChatGPT |
| Claude | 2023 to present | Anthropic | Constitutional AI alignment; Claude Opus 4.7 is the current frontier |
| Gemini | 2023 to present | Google DeepMind | Native multimodality; Gemini 3 is the latest line |
| Mistral, Mixtral | 2023 to present | Mistral AI | Efficient open-weights and mixture-of-experts |
| Falcon, Qwen, DeepSeek, GLM-4.5, Kimi, Phi, Gemma | 2023 to present | Various | Open-weights ecosystem |
| GPT-5 and successors | 2024 to present | OpenAI | Frontier reasoning models; GPT-5.5 |
encoder-decoder models
| Model | Year | Organization | Notes |
|---|
| Original Transformer | 2017 | Google | Built for machine translation |
| T5 | 2019 | Google | “Text-to-text” unification of tasks |
| BART | 2019 | Meta | Denoising autoencoder for generation |
| mBART, mT5 | 2020 | Various | Multilingual encoder-decoders |
| Flan-T5, UL2 | 2022 | Google | Instruction-tuned T5 variants |
core NLP tasks
Classical and modern NLP share a backbone of canonical tasks. The same Transformer model can usually be fine-tuned or prompted to handle each.
| Task | Description | Example wiki entry |
|---|
| Text classification | Assign a label to a document, e.g. spam vs ham | Sentiment analysis |
| Named entity recognition (NER) | Identify spans referring to people, places, organizations, dates | CoNLL-2003, OntoNotes |
| Part-of-speech (POS) tagging | Label each token with its grammatical category | Penn Treebank tags |
| Syntactic parsing | Build a constituency or dependency tree | Universal Dependencies |
| Coreference resolution | Cluster mentions referring to the same entity | OntoNotes coreference |
| Word sense disambiguation | Pick the correct sense of an ambiguous word | WordNet senses |
| Sentiment analysis | Classify polarity, opinion, or emotion | SST, IMDB |
| Text summarization | Produce a shorter version preserving meaning | CNN/DailyMail, XSum |
| Machine translation | Translate text between languages | WMT |
| Question answering | Answer questions from context or open domain | SQuAD, NaturalQuestions |
| Information retrieval | Retrieve relevant documents for a query | MS MARCO, BEIR |
| Sequence-to-sequence generation | Map any input sequence to any output sequence | Translation, summarization, paraphrasing |
| Dialogue and chat | Multi-turn conversational generation | ChatGPT, Claude |
| Speech recognition | Convert audio to text | Whisper, Wav2Vec |
| Topic modeling | Discover latent themes in a corpus | Latent Dirichlet allocation |
Beyond these, many recent benchmarks evaluate higher-level skills such as code generation (HumanEval, MBPP), math word problems (GSM8K, MATH), and instruction following (IFEval).
natural language understanding vs natural language generation
NLP is often split into two halves: natural language understanding (NLU) covers reading and reasoning, served by encoder models like BERT and benchmarks like GLUE and SuperGLUE; natural language generation (NLG) covers producing fluent text, served by decoder models like GPT and benchmarks like MT-Bench. Modern frontier systems blend both.
modern LLM techniques
The rise of large language models introduced a set of techniques that did not exist in the pre-2018 NLP toolkit.
pre-training, fine-tuning, and alignment
Most modern NLP systems follow a two- or three-stage recipe:
- Pre-training on trillions of tokens of web text, code, and books, often from sources like Common Crawl, FineWeb, C4, The Pile, and proprietary corpora. The standard objectives are next-token prediction (causal) or masked-token reconstruction (encoder).
- Supervised fine-tuning (SFT) on a smaller curated dataset of high-quality instructions and demonstrations.
- Preference optimization or RLHF to align the model with human preferences. Variants include Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), and Constitutional AI.
Parameter-efficient fine-tuning saves compute and memory by training only a small subset of weights. The most common approaches are LoRA (Low-Rank Adaptation), QLoRA, and adapter modules. Multi-task pre-training is sometimes called meta-learning, and some pipelines use staged training to introduce data or capabilities in phases. Model merging lets practitioners combine separately trained checkpoints into a single model without further training.
prompting
Prompt engineering is the practice of designing input text so that an LLM produces a desired output. Important techniques include:
- Zero-shot prompting: ask the model directly without examples.
- Few-shot or in-context learning: include several worked examples in the prompt.
- Chain-of-thought prompting: ask the model to reason step by step. Introduced by Wei et al. (2022).
- Meta prompting: use the model itself to design prompts.
- System prompt: a fixed instruction prepended to every conversation that sets persona, tone, and constraints.
- Structured output: force the model to emit JSON, XML, or another schema.
retrieval-augmented generation
Retrieval-augmented generation (RAG) combines an LLM with an external knowledge base. At query time, a retriever finds relevant documents using dense or sparse embeddings, and the LLM conditions on them to produce a grounded answer. Frameworks like LangChain and LlamaIndex provide building blocks for RAG pipelines, and vector databases such as Pinecone, Weaviate, Qdrant, and Milvus store embeddings at scale. RAG is now standard practice for chatbots that need fresh, factual, or proprietary information.
When an LLM can call external functions or APIs, it becomes capable of tool use. Higher-level AI agents plan, decompose tasks, and orchestrate multi-step actions. Subtypes include computer-use agents, AI browser agents, and agentic workflows. Persistent state is handled through agent memory and knowledge editing techniques.
context windows and long-context techniques
A model’s context window is the maximum number of tokens it can attend to at once. GPT-3 launched with 2,048 tokens; Claude 2 introduced 100,000-token windows in 2023; modern frontier systems including Claude Opus 4.7, Gemini 3 Pro, and GPT-5.5 handle context windows in the 1 million token range. Benchmarks such as LongBench and RULER measure long-context performance.
scaling and inference efficiency
Scaling laws, formalized by Kaplan et al. and refined by Chinchilla scaling, describe how loss falls predictably with more compute, data, and parameters. See also the Scaling Laws for Neural Language Models paper. On the inference side, optimization techniques include speculative decoding, tensor parallelism, model parallelism, pipelining, test-time compute scaling for reasoning models, quantization formats like GGUF, and runtimes like Ollama and LM Studio.
decoding strategies
Given a probability distribution over the next token, a decoding strategy decides which token to actually emit. The choice strongly affects fluency, diversity, and factuality.
| Strategy | How it works | Typical use |
|---|
| Greedy decoding | Always pick the highest-probability token | Deterministic short answers |
| Beam search | Maintain the top k hypotheses at each step | Translation, summarization |
| Sampling | Draw a token according to the distribution | Creative generation |
| Temperature | Sharpen (low) or flatten (high) the distribution before sampling | Controls randomness |
| Top-k sampling | Sample only from the k most probable tokens | Reduces tail noise |
| Top-p (nucleus) sampling | Sample from the smallest set whose cumulative probability exceeds p | Default for many chatbots |
| Typical sampling | Sample tokens close to the conditional entropy | Coherent open-ended generation |
| Mirostat | Adaptively keep perplexity near a target | Avoids local repetition |
| Contrastive search | Combine likelihood with degeneration penalty | Long-form generation |
| Speculative decoding | Use a small draft model and verify with a large model | Faster inference, same distribution |
evaluation
NLP evaluation is hard because language tasks are open-ended. Common metrics and benchmarks include:
intrinsic metrics
- Perplexity: exponentiated cross-entropy of a language model on held-out text. Lower is better.
- Cross-entropy and bits-per-character / bits-per-byte.
task metrics
- BLEU (Bilingual Evaluation Understudy): n-gram precision against reference translations.
- ROUGE: n-gram recall, used for summarization (ROUGE-1, ROUGE-2, ROUGE-L).
- METEOR: alignment-based metric using stemming and synonyms.
- chrF and chrF++: character-level F-score, robust across languages.
- BERTScore and BLEURT: model-based semantic similarity.
- F1, exact match, and accuracy for classification and QA.
benchmark suites
A growing literature studies LLM-as-judge evaluation, where one model rates the outputs of another, with MT-Bench and Chatbot Arena leading the way.
modalities and multimodal NLP
A modality is a channel of input or output. A multimodal model handles more than one. Frontier systems including GPT-4, Gemini, Claude Opus 4.7, and Llama 3.2 accept images and audio alongside text. Speech systems include Whisper and Wav2Vec.
applications
NLP underlies a wide range of products. Selected categories with representative wiki entries:
| Domain | Example systems |
|---|
| Chatbots and assistants | ChatGPT, Claude, Gemini, Grok, Kimi, Doubao, Microsoft 365 Copilot |
| Search and retrieval | Google Search, Perplexity, Bing Chat, You.com |
| Code generation | GitHub Copilot, Codex, Code Llama, StarCoder, Codestral |
| Translation | Google Translate, DeepL, Microsoft Translator, Meta NLLB |
| Writing tools | QuillBot, Grammarly, Notion AI |
| Speech | Whisper, Apple Dictation, Amazon Transcribe |
| Enterprise | Customer-support automation, contract review, LegalBench, PubMedQA-style domain assistants |
| Routing and aggregation | OpenRouter, LangChain, LlamaIndex, CrewAI |
organizations and ecosystems
The modern NLP ecosystem is shaped by a handful of frontier labs and a much larger open-weights community.
| Organization | Selected NLP work |
|---|
| OpenAI | GPT family, ChatGPT, Codex, Whisper, o1 and o3 reasoning models |
| Anthropic | Claude family, Constitutional AI, RLHF research |
| Google and Google DeepMind | BERT, T5, LaMDA, PaLM, Gemini, Gemma |
| Meta AI | LLaMA, Llama 3, Llama 4 Scout and Maverick, RoBERTa, BART |
| Mistral AI | Mistral, Mixtral, Codestral, Mistral Medium 3 |
| xAI | Grok, Grok 3, Grok 4 |
| DeepSeek, Moonshot AI, Baidu AI, Huawei AI, MiniMax, Doubao | Chinese frontier and open-weights labs |
| AI2, EleutherAI | Open research and open models |
| Inflection AI, Reka AI | Specialized assistants and multimodal models |
| Apple | Apple Foundation Models |
index of NLP term wiki pages
The original gateway list is preserved below. Each link points to a dedicated AI Wiki entry.
extended index of related NLP wiki pages
In addition to the original list, the following AI Wiki entries cover related NLP concepts, models, benchmarks, and tools.
| Topic area | Wiki pages |
|---|
| Foundational concepts | Embeddings, Vector embeddings, Positional encoding, Byte pair encoding, TF-IDF, Bahdanau attention, Sequence model, Similarity measure, Latent Dirichlet allocation, Information retrieval |
| Frontier and open-weights LLMs | GPT, GPT-1, GPT-2, GPT-3, GPT-4, GPT-5, GPT-5.5, Claude, Claude Opus 4.7, Gemini, Gemini 3 Pro, LLaMA, Llama 4 Scout and Maverick, Mistral AI, Mixtral, Falcon, Qwen3, Phi-4, Gemma 2, DeepSeek V4, Kimi K2, GLM-4.5, PaLM, LaMDA, Vicuna, BERT, RoBERTa, ALBERT, DeBERTa, XLNet, ELECTRA, ELMo, Grok 4, gpt-oss, OpenAI o3, OpenAI o-series |
| Reasoning, alignment, and training | Chain-of-thought prompting, In-context learning, Prompt engineering, Meta prompting, System prompt, RLHF, Supervised fine-tuning, LoRA, QLoRA, Knowledge editing, Test-time compute, Speculative decoding, Tensor parallelism, Model merging, Foundation models, Frontier models, Scaling laws, Chinchilla scaling |
| NLP tasks | Sentiment analysis, Named entity recognition, Text summarization, Machine translation, Question answering, Speech recognition, Whisper, Wav2Vec |
| Tools and infrastructure | LangChain, LlamaIndex, OpenRouter, Ollama, LM Studio, GGUF, CrewAI, Tool use, AI agents, AI browser agent, Computer-use agent, Agentic workflow, Agent memory, Structured output, Context window, RAG |
| Benchmarks | GLUE, SuperGLUE, MMLU-Pro, HellaSwag, BIG-Bench Hard, GSM8K, MATH, MBPP, TruthfulQA, TriviaQA, BoolQ, DROP, SimpleQA, GPQA Diamond, MGSM, LongBench, RULER, LiveBench, IFEval, PubMedQA, LegalBench, JailbreakBench, AdvBench, MT-Bench, BLEU, Perplexity |
| Datasets, organizations, applications | Common Crawl, FineWeb, OpenAI, Anthropic, Meta AI, xAI, DeepSeek, Moonshot AI, Baidu AI, Huawei AI, MiniMax, Reka AI, Inflection AI, AI2, EleutherAI, Apple Foundation Models, ChatGPT, Microsoft 365 Copilot, GitHub Copilot, Codex, Code Llama, Codestral, StarCoder, QuillBot, Perplexity |
further reading and external references
- Jurafsky, D., and Martin, J. H. Speech and Language Processing, 3rd ed. draft. https://web.stanford.edu/~jurafsky/slp3/.
- Stanford CS224n, https://web.stanford.edu/class/cs224n/.
- Eisenstein, J. Introduction to Natural Language Processing. MIT Press, 2019.
- Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information Retrieval. Cambridge University Press, 2008.
- Wikipedia, “Natural language processing”, https://en.wikipedia.org/wiki/Natural_language_processing.
- Wikipedia, “Transformer (deep learning architecture)”.
- Wikipedia, “Large language model”.
- Hugging Face NLP Course, https://huggingface.co/learn/nlp-course.
- Google AI, Machine Learning Glossary: Language.
references
- Vaswani, A., et al. “Attention Is All You Need”. NeurIPS 30, 2017. arXiv:1706.03762.
- Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”. NAACL-HLT 2019. arXiv:1810.04805.
- Radford, A., et al. “Improving Language Understanding by Generative Pre-Training”. OpenAI, 2018.
- Brown, T., et al. “Language Models are Few-Shot Learners”. NeurIPS 2020. arXiv:2005.14165.
- Raffel, C., et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer”. JMLR 21, 2020.
- Mikolov, T., et al. “Efficient Estimation of Word Representations in Vector Space”. ICLR 2013.
- Pennington, J., Socher, R., and Manning, C. D. “GloVe: Global Vectors for Word Representation”. EMNLP 2014.
- Bahdanau, D., Cho, K., and Bengio, Y. “Neural Machine Translation by Jointly Learning to Align and Translate”. ICLR 2015.
- Sennrich, R., Haddow, B., and Birch, A. “Neural Machine Translation of Rare Words with Subword Units”. ACL 2016.
- Wei, J., et al. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”. NeurIPS 2022. arXiv:2201.11903.
- Lewis, P., et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. NeurIPS 2020. arXiv:2005.11401.
- Ouyang, L., et al. “Training Language Models to Follow Instructions with Human Feedback”. NeurIPS 2022.
- Hu, E. J., et al. “LoRA: Low-Rank Adaptation of Large Language Models”. ICLR 2022. arXiv:2106.09685.
- Kaplan, J., et al. “Scaling Laws for Neural Language Models”. arXiv:2001.08361, 2020.
- Hoffmann, J., et al. “Training Compute-Optimal Large Language Models”. NeurIPS 2022. arXiv:2203.15556.
- Jurafsky, D., and Martin, J. H. Speech and Language Processing. 3rd ed. draft, Stanford.
- Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information Retrieval. Cambridge University Press, 2008.
- Wikipedia contributors. “Natural language processing”. Wikipedia, accessed 2026-05-09.