# Machine learning terms/Natural Language Processing

> Source: https://aiwiki.ai/wiki/machine_learning_terms_natural_language_processing
> Updated: 2026-06-25
> Categories: Large Language Models, Machine Learning, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Natural Language Processing** (NLP) is the subfield of [artificial intelligence](/wiki/artificial_intelligence) and [machine learning](/wiki/machine_learning) concerned with enabling computers to read, interpret, generate, and reason about human language in text and speech form.[18] The core machine-learning terms used across modern NLP are: [tokens](/wiki/token) and [tokenization](/wiki/tokenization) (how text is split into model-readable units), [embeddings](/wiki/embeddings) (dense vectors that encode meaning), [attention](/wiki/attention) and the [Transformer](/wiki/transformer) (the architecture behind nearly every state-of-the-art system since 2017),[1] [language models](/wiki/language_model) (which assign probabilities to token sequences), and the evaluation vocabulary of [perplexity](/wiki/perplexity), [BLEU](/wiki/bleu), and benchmark suites such as [GLUE](/wiki/glue_benchmark) and [MMLU](/wiki/mmlu). NLP sits at the intersection of [computer science](/wiki/computer_science), [linguistics](/wiki/linguistics), and statistics,[16] and it powers nearly every consumer-facing product that uses written or spoken language, from web search and machine translation to chatbots like [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), and [Gemini](/wiki/gemini).

This page is the gateway hub for NLP-related entries on the AI Wiki. It introduces the central ideas, surveys the modern landscape dominated by [large language models](/wiki/large_language_model), and provides a curated index of every NLP concept and model with its own dedicated wiki page.

## How did natural language processing evolve?

The history of NLP spans roughly seven decades and is conventionally divided into four eras:

| Era | Years | Dominant approach | Representative milestones |
|---|---|---|---|
| Symbolic / rule-based | 1950s to late 1980s | Hand-written grammars and logic | Georgetown-IBM translation demo (1954), [ELIZA](/wiki/eliza) (1966), SHRDLU (1970) |
| Statistical | late 1980s to 2010 | Probability theory, [HMMs](/wiki/hidden_markov_model), [n-gram](/wiki/n-gram) models, [CRFs](/wiki/conditional_random_field) | IBM Candide SMT, Penn Treebank (1993), maximum-entropy taggers |
| Neural / distributional | 2010 to 2017 | [Word embeddings](/wiki/word_embedding), [RNNs](/wiki/recurrent_neural_network), [LSTMs](/wiki/long_short-term_memory), [seq2seq](/wiki/sequence-to-sequence_task) | Word2Vec (2013), GloVe (2014), Bahdanau attention (2014), GNMT (2016) |
| Transformer / foundation-model | 2017 to present | [Self-attention](/wiki/self-attention), [Transformer](/wiki/transformer), large-scale pre-training, instruction tuning | "Attention Is All You Need" (2017), [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) (2018), [GPT-3](/wiki/gpt-3) (2020), [ChatGPT](/wiki/chatgpt) (2022), [GPT-4](/wiki/gpt-4) (2023) |

The field shifted decisively in 2017 when Google researchers published "[Attention Is All You Need](/wiki/attention_is_all_you_need_transformer)", introducing the [Transformer](/wiki/transformer) architecture.[1] The paper proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely", and reported 28.4 BLEU on WMT 2014 English-to-German and 41.8 BLEU on English-to-French, a new state of the art at the time.[1] Almost every state-of-the-art NLP system since then is a Transformer variant.

## What are the core NLP concepts and building blocks?

A handful of building blocks underlie nearly every modern NLP system.

### tokens and tokenization

A [token](/wiki/token) is the atomic unit a language model reads or writes. Tokens may be characters, whole words, or, most commonly, subword pieces. Modern models almost always use subword [tokenization](/wiki/tokenization) algorithms such as [byte pair encoding](/wiki/byte_pair_encoding) (BPE), WordPiece, and SentencePiece / Unigram.[9] Counting is done in tokens, not words. As a rough rule of thumb for English, OpenAI estimates 1 token is about 4 characters or roughly 0.75 words, so 100 tokens correspond to about 75 words.[19]

### embeddings

An [embedding](/wiki/embeddings) is a dense vector of real numbers representing meaning in a continuous space. Geometric proximity encodes semantic similarity: vectors for *king* and *queen* land near each other.[6] Embeddings are produced by an [embedding layer](/wiki/embedding_layer) and live in an [embedding space](/wiki/embedding_space) whose dimensionality is typically 256 to 12,288. See [embedding vector](/wiki/embedding_vector) and [vector embeddings](/wiki/vector_embeddings).

### language models

A [language model](/wiki/language_model) assigns probabilities to sequences of tokens. Given a context, it can either score the likelihood of a continuation or sample new text. Three architectural families dominate:

| Family | Direction | Training objective | Canonical example |
|---|---|---|---|
| [Causal language model](/wiki/causal_language_model) (autoregressive, left-to-right) | [Unidirectional](/wiki/unidirectional) | Predict the next token | [GPT](/wiki/gpt_generative_pre-trained_transformer), [LLaMA](/wiki/llama), [Claude](/wiki/claude) |
| [Masked language model](/wiki/masked_language_model) | [Bidirectional](/wiki/bidirectional) | Reconstruct masked tokens | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [RoBERTa](/wiki/roberta), [DeBERTa](/wiki/deberta) |
| Encoder-decoder (seq2seq) | Bidirectional encoder, unidirectional decoder | Map input sequence to output sequence | [T5](/wiki/t5), BART, original [Transformer](/wiki/transformer) |

A [bidirectional language model](/wiki/bidirectional_language_model) reads context from both sides at once and is well suited to understanding tasks; a [unidirectional language model](/wiki/unidirectional_language_model) is well suited to generation.

### attention and transformers

[Attention](/wiki/attention) is a mechanism that lets a model weight different parts of its input when computing a representation. [Self-attention](/wiki/self-attention) lets every token in a sequence attend to every other token; [multi-head self-attention](/wiki/multi-head_self-attention) runs many such operations in parallel so the model can capture different relations simultaneously. [Bahdanau attention](/wiki/bahdanau_attention) (2014) was the first widely cited attention mechanism in NLP;[8] the modern scaled-dot-product version was popularized by the 2017 Transformer paper.[1] Because attention has no inherent notion of word order, models add a [positional encoding](/wiki/positional_encoding) so position is preserved.

## How does tokenization work in detail?

The quality of a tokenizer has measurable effects on downstream performance, training cost, and multilingual coverage. The dominant approaches are summarized below.

| Algorithm | Idea | Used by |
|---|---|---|
| Whitespace / word | Split on spaces and punctuation | Classical NLP, early Word2Vec |
| [Byte pair encoding](/wiki/byte_pair_encoding) (BPE) | Iteratively merge the most frequent adjacent symbol pair | [GPT-2](/wiki/gpt-2), [GPT-3](/wiki/gpt-3), [GPT-4](/wiki/gpt-4), [LLaMA](/wiki/llama), most open-weights LLMs |
| WordPiece | BPE-like merging based on likelihood gain | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [DistilBERT](/wiki/distilbert) |
| SentencePiece (Unigram) | Probabilistic model that prunes a candidate subword vocabulary | [T5](/wiki/t5), [ALBERT](/wiki/albert), XLNet, mBART |
| Byte-level BPE | Operates on raw UTF-8 bytes for full Unicode coverage | [GPT-2](/wiki/gpt-2) onward, [Claude](/wiki/claude) |
| Character / byte | One token per character or byte | ByT5, CANINE |

Classical tokenization includes [bigram](/wiki/bigram) and [trigram](/wiki/trigram) splitting, and the broader [n-gram](/wiki/n-gram) family that powered statistical language models from the 1990s through the early 2010s. A famous illustration of why tokenization is hard in raw text is the [crash blossom](/wiki/crash_blossom): an ambiguous newspaper headline that humans parse easily but that confuses naive parsers.

## What are word and sentence embeddings?

Embedding methods evolved from static lookup tables to deeply contextual representations.[7]

| Generation | Method | Year | Notes |
|---|---|---|---|
| Static word vectors | Word2Vec (CBOW, skip-gram) | 2013 | Mikolov et al., Google. Trained on raw text with a shallow network |
| Static word vectors | GloVe | 2014 | Pennington, Socher, Manning at Stanford. Factorizes the global co-occurrence matrix |
| Static word vectors | fastText | 2016 | Facebook AI Research. Adds character n-grams for morphology |
| Contextual | [ELMo](/wiki/elmo) | 2018 | Bidirectional LSTM language model from AI2 |
| Contextual | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) embeddings | 2018 | Each token vector depends on full sentence context |
| Sentence-level | Sentence-BERT (SBERT) | 2019 | Siamese BERT producing fixed-size sentence vectors |
| Production | OpenAI `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large` | 2022 to 2024 | High-dimensional API embeddings for retrieval |
| Production | Cohere Embed, Voyage AI, Google Gemini Embeddings, BGE, E5, GTE | 2023 to 2026 | Open and commercial dense retrievers |

Beyond dense embeddings, NLP also uses [sparse representations](/wiki/sparse_representation) such as [bag of words](/wiki/bag_of_words), [TF-IDF](/wiki/tf_idf), and [BM25](/wiki/bm25). These are the basis of classical [information retrieval](/wiki/information_retrieval), and they remain competitive when combined with dense vectors in hybrid retrieval.[17] A [sparse feature](/wiki/sparse_feature) is one whose value is mostly zero, which is the common case for one-hot or bag-of-words encodings.

## What are the main NLP model families?

The [Transformer](/wiki/transformer) introduced in [Attention Is All You Need](/wiki/attention_is_all_you_need_transformer) replaced recurrence with self-attention.[1] It scales effectively on GPU and TPU hardware and has become the universal backbone of modern NLP. Major model families follow.

### encoder-only (masked) models

| Model | Year | Organization | Highlights |
|---|---|---|---|
| [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) | 2018 | Google | Set new state of the art on 11 NLU tasks and pushed the [GLUE](/wiki/glue_benchmark) score to 80.5, a 7.7-point absolute gain; introduced masked language modeling at scale[2] |
| [RoBERTa](/wiki/roberta) | 2019 | Meta AI | BERT trained longer with more data and dynamic masking |
| [ALBERT](/wiki/albert) | 2019 | Google | Parameter sharing for compact BERT |
| [DistilBERT](/wiki/distilbert) | 2019 | Hugging Face | Knowledge-distilled BERT, 40% smaller |
| [XLNet](/wiki/xlnet) | 2019 | CMU and Google | Permutation language modeling |
| [ELECTRA](/wiki/electra) | 2020 | Google | Replaced-token detection objective |
| [DeBERTa](/wiki/deberta) | 2020 | Microsoft | Disentangled attention and enhanced masked decoding |

### decoder-only (causal / autoregressive) models

| Model | Year | Organization | Notes |
|---|---|---|---|
| [GPT-1](/wiki/gpt-1) | 2018 | OpenAI | First Generative Pre-trained Transformer |
| [GPT-2](/wiki/gpt-2) | 2019 | OpenAI | 1.5 billion parameters; demonstrated zero-shot transfer |
| [GPT-3](/wiki/gpt-3) | 2020 | OpenAI | 175 billion parameters (10x larger than any prior dense LM), trained on about 570 GB of filtered text; popularized few-shot [in-context learning](/wiki/in_context_learning)[4] |
| [LaMDA](/wiki/lamda) | 2021 | Google | Dialogue-tuned 137 billion parameter model |
| [PaLM](/wiki/palm) | 2022 | Google | 540 billion parameters |
| [LLaMA](/wiki/llama) | 2023 | Meta | Open-weights foundation models |
| [GPT-4](/wiki/gpt-4) | 2023 | OpenAI | Multimodal; behind much of [ChatGPT](/wiki/chatgpt) |
| [Claude](/wiki/claude) | 2023 to present | [Anthropic](/wiki/anthropic) | Constitutional AI alignment; [Claude Opus 4.7](/wiki/claude_opus_4_7) is the current frontier |
| [Gemini](/wiki/gemini) | 2023 to present | Google DeepMind | Native multimodality; [Gemini 3](/wiki/gemini_3) is the latest line |
| [Mistral](/wiki/mistral_ai), [Mixtral](/wiki/mixtral) | 2023 to present | Mistral AI | Efficient open-weights and mixture-of-experts |
| [Falcon](/wiki/falcon), [Qwen](/wiki/qwen), [DeepSeek](/wiki/deepseek), [GLM-4.5](/wiki/glm_4_5), [Kimi](/wiki/kimi), [Phi](/wiki/phi), [Gemma](/wiki/gemma) | 2023 to present | Various | Open-weights ecosystem |
| [GPT-5](/wiki/gpt-5) and successors | 2024 to present | OpenAI | Frontier reasoning models; [GPT-5.5](/wiki/gpt-5.5) |

### encoder-decoder models

| Model | Year | Organization | Notes |
|---|---|---|---|
| Original Transformer | 2017 | Google | Built for [machine translation](/wiki/machine_translation) |
| [T5](/wiki/t5) | 2019 | Google | "Text-to-text" unification of tasks |
| BART | 2019 | Meta | [Denoising](/wiki/denoising) autoencoder for generation |
| mBART, mT5 | 2020 | Various | Multilingual encoder-decoders |
| Flan-T5, UL2 | 2022 | Google | Instruction-tuned T5 variants |

## What are the core NLP tasks?

Classical and modern NLP share a backbone of canonical tasks. The same Transformer model can usually be fine-tuned or prompted to handle each.[5]

| Task | Description | Example wiki entry |
|---|---|---|
| Text classification | Assign a label to a document, e.g. spam vs ham | [Sentiment analysis](/wiki/sentiment_analysis) |
| [Named entity recognition](/wiki/named_entity_recognition) (NER) | Identify spans referring to people, places, organizations, dates | CoNLL-2003, OntoNotes |
| Part-of-speech (POS) tagging | Label each token with its grammatical category | Penn Treebank tags |
| Syntactic parsing | Build a constituency or dependency tree | Universal Dependencies |
| Coreference resolution | Cluster mentions referring to the same entity | OntoNotes coreference |
| Word sense disambiguation | Pick the correct sense of an ambiguous word | WordNet senses |
| [Sentiment analysis](/wiki/sentiment_analysis) | Classify polarity, opinion, or emotion | SST, IMDB |
| [Text summarization](/wiki/text_summarization) | Produce a shorter version preserving meaning | CNN/DailyMail, XSum |
| [Machine translation](/wiki/machine_translation) | Translate text between languages | WMT |
| [Question answering](/wiki/question_answering) | Answer questions from context or open domain | SQuAD, NaturalQuestions |
| [Information retrieval](/wiki/information_retrieval) | Retrieve relevant documents for a query | MS MARCO, BEIR |
| [Sequence-to-sequence](/wiki/sequence-to-sequence_task) generation | Map any input sequence to any output sequence | Translation, summarization, paraphrasing |
| Dialogue and chat | Multi-turn conversational generation | [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude) |
| [Speech recognition](/wiki/speech_recognition) | Convert audio to text | [Whisper](/wiki/whisper), [Wav2Vec](/wiki/wav2vec) |
| Topic modeling | Discover latent themes in a corpus | [Latent Dirichlet allocation](/wiki/latent_dirichlet_allocation) |

Beyond these, many recent benchmarks evaluate higher-level skills such as code generation ([HumanEval](/wiki/humaneval), [MBPP](/wiki/mbpp)), math word problems ([GSM8K](/wiki/gsm8k), [MATH](/wiki/math_benchmark)), and instruction following ([IFEval](/wiki/ifeval)).

## How do natural language understanding and natural language generation differ?

NLP is often split into two halves: [natural language understanding](/wiki/natural_language_understanding) ([NLU](/wiki/nlu)) covers reading and reasoning, served by encoder models like [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers)[2] and benchmarks like [GLUE](/wiki/glue_benchmark) and [SuperGLUE](/wiki/superglue); natural language generation (NLG) covers producing fluent text, served by decoder models like [GPT](/wiki/gpt) and benchmarks like [MT-Bench](/wiki/mt_bench). Modern frontier systems blend both.

## What techniques are specific to large language models?

The rise of [large language models](/wiki/large_language_model) introduced a set of techniques that did not exist in the pre-2018 NLP toolkit.

### pre-training, fine-tuning, and alignment

Most modern NLP systems follow a two- or three-stage recipe:

1. **Pre-training** on trillions of tokens of web text, code, and books, often from sources like [Common Crawl](/wiki/common_crawl), [FineWeb](/wiki/fineweb), C4, The Pile, and proprietary corpora. The standard objectives are next-token prediction (causal)[3] or masked-token reconstruction (encoder).[2]
2. **[Supervised fine-tuning](/wiki/supervised_fine-tuning)** (SFT) on a smaller curated dataset of high-quality instructions and demonstrations.
3. **Preference optimization or [RLHF](/wiki/rlhf)** to align the model with human preferences.[12] Variants include Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), and Constitutional AI.

Parameter-efficient fine-tuning saves compute and memory by training only a small subset of weights. The most common approaches are [LoRA](/wiki/lora) (Low-Rank Adaptation), [QLoRA](/wiki/qlora), and adapter modules.[13] Multi-task pre-training is sometimes called [meta-learning](/wiki/meta-learning), and some pipelines use [staged training](/wiki/staged_training) to introduce data or capabilities in phases. [Model merging](/wiki/model_merging) lets practitioners combine separately trained checkpoints into a single model without further training.

### prompting

[Prompt engineering](/wiki/prompt_engineering) is the practice of designing input text so that an LLM produces a desired output. Important techniques include:

- Zero-shot prompting: ask the model directly without examples.
- Few-shot or [in-context learning](/wiki/in_context_learning): include several worked examples in the prompt.[4]
- [Chain-of-thought](/wiki/chain_of_thought) prompting: ask the model to reason step by step. Introduced by Wei et al. (2022); prompting a 540B-parameter model with eight worked examples raised the [GSM8K](/wiki/gsm8k) math solve rate from 18% to 57%.[10]
- [Meta prompting](/wiki/meta_prompting): use the model itself to design prompts.
- [System prompt](/wiki/system_prompt): a fixed instruction prepended to every conversation that sets persona, tone, and constraints.
- [Structured output](/wiki/structured_output): force the model to emit JSON, XML, or another schema.

### retrieval-augmented generation

[Retrieval-augmented generation](/wiki/retrieval_augmented_generation_rag) (RAG) combines an LLM with an external knowledge base.[11] At query time, a retriever finds relevant documents using dense or sparse embeddings, and the LLM conditions on them to produce a grounded answer. Frameworks like [LangChain](/wiki/langchain) and [LlamaIndex](/wiki/llamaindex) provide building blocks for RAG pipelines, and [vector databases](/wiki/vector_database) such as Pinecone, Weaviate, Qdrant, and Milvus store embeddings at scale. RAG is now standard practice for chatbots that need fresh, factual, or proprietary information.

### tool use and agents

When an LLM can call external functions or APIs, it becomes capable of [tool use](/wiki/tool_use). Higher-level [AI agents](/wiki/ai_agents) plan, decompose tasks, and orchestrate multi-step actions. Subtypes include [computer-use agents](/wiki/computer-use_agent), [AI browser agents](/wiki/ai_browser_agent), and [agentic workflows](/wiki/agentic_workflow). Persistent state is handled through [agent memory](/wiki/agent_memory) and [knowledge editing](/wiki/knowledge_editing) techniques.

### context windows and long-context techniques

A model's [context window](/wiki/context_window) is the maximum number of tokens it can attend to at once. [GPT-3](/wiki/gpt-3) launched with 2,048 tokens; [Claude 2](/wiki/claude) introduced 100,000-token windows in 2023; modern frontier systems including [Claude Opus 4.7](/wiki/claude_opus_4_7), [Gemini 3 Pro](/wiki/gemini_3_pro), and [GPT-5.5](/wiki/gpt-5.5) handle context windows in the 1 million token range. Benchmarks such as [LongBench](/wiki/longbench) and [RULER](/wiki/ruler_benchmark) measure long-context performance.

### scaling and inference efficiency

[Scaling laws](/wiki/scaling_laws), formalized by Kaplan et al.[14] and refined by [Chinchilla scaling](/wiki/chinchilla_scaling),[15] describe how loss falls predictably with more compute, data, and parameters. The Chinchilla result showed that for a fixed compute budget, model size and training tokens should be scaled in equal proportion: a 70 billion parameter Chinchilla trained on 4x more data outperformed the 280 billion parameter Gopher on nearly every task.[15] See also the [Scaling Laws for Neural Language Models paper](/wiki/scaling_laws_paper). On the inference side, optimization techniques include [speculative decoding](/wiki/speculative_decoding), [tensor parallelism](/wiki/tensor_parallelism), [model parallelism](/wiki/model_parallelism), [pipelining](/wiki/pipelining), [test-time compute](/wiki/test_time_compute) scaling for reasoning models, quantization formats like [GGUF](/wiki/gguf), and runtimes like [Ollama](/wiki/ollama) and [LM Studio](/wiki/lmstudio).

## What are the main decoding strategies?

Given a probability distribution over the next token, a decoding strategy decides which token to actually emit. The choice strongly affects fluency, diversity, and factuality.

| Strategy | How it works | Typical use |
|---|---|---|
| Greedy decoding | Always pick the highest-probability token | Deterministic short answers |
| Beam search | Maintain the top *k* hypotheses at each step | Translation, summarization |
| Sampling | Draw a token according to the distribution | Creative generation |
| Temperature | Sharpen (low) or flatten (high) the distribution before sampling | Controls randomness |
| Top-k sampling | Sample only from the *k* most probable tokens | Reduces tail noise |
| Top-p (nucleus) sampling | Sample from the smallest set whose cumulative probability exceeds *p* | Default for many chatbots |
| Typical sampling | Sample tokens close to the conditional entropy | Coherent open-ended generation |
| Mirostat | Adaptively keep perplexity near a target | Avoids local repetition |
| Contrastive search | Combine likelihood with degeneration penalty | Long-form generation |
| Speculative decoding | Use a small draft model and verify with a large model | Faster inference, same distribution |

## How is NLP evaluated, and what are the key metrics?

NLP evaluation is hard because language tasks are open-ended. Common metrics and benchmarks include:

### intrinsic metrics

- [Perplexity](/wiki/perplexity): exponentiated cross-entropy of a language model on held-out text. Lower is better.
- Cross-entropy and bits-per-character / bits-per-byte.

### task metrics

- [BLEU](/wiki/bleu) (Bilingual Evaluation Understudy): n-gram precision against reference translations.
- ROUGE: n-gram recall, used for summarization (ROUGE-1, ROUGE-2, ROUGE-L).
- METEOR: alignment-based metric using stemming and synonyms.
- [chrF](/wiki/chrf) and chrF++: character-level F-score, robust across languages.
- BERTScore and BLEURT: model-based semantic similarity.
- F1, exact match, and accuracy for classification and QA.

### benchmark suites

| Benchmark | Year | What it measures |
|---|---|---|
| [GLUE](/wiki/glue_benchmark) | 2018 | General NLU |
| [SuperGLUE](/wiki/superglue) | 2019 | Harder NLU |
| [MMLU](/wiki/mmlu) and [MMLU-Pro](/wiki/mmlu-pro) | 2020 to 2024 | 57-subject multiple-choice knowledge |
| [TruthfulQA](/wiki/truthfulqa) | 2021 | Truthfulness on misleading questions |
| [HellaSwag](/wiki/hellaswag) | 2019 | Commonsense sentence completion |
| [BIG-Bench Hard](/wiki/big-bench-hard) | 2022 | 23 hard reasoning tasks |
| [HumanEval](/wiki/humaneval) | 2021 | Python code generation |
| [MBPP](/wiki/mbpp) | 2021 | Mostly basic Python problems |
| [GSM8K](/wiki/gsm8k) | 2021 | Grade-school math word problems |
| [MATH](/wiki/math_benchmark) | 2021 | Competition mathematics |
| [BoolQ](/wiki/boolq), [DROP](/wiki/drop), [TriviaQA](/wiki/triviaqa), [SimpleQA](/wiki/simpleqa) | various | Reading comprehension and QA |
| [GPQA Diamond](/wiki/gpqa_diamond) | 2023 | Graduate-level science questions |
| [MGSM](/wiki/mgsm) | 2022 | Multilingual grade-school math |
| [PubMedQA](/wiki/pubmedqa), [LegalBench](/wiki/legalbench) | various | Domain-specific QA |
| [LongBench](/wiki/longbench), [RULER](/wiki/ruler_benchmark) | 2023 to 2024 | Long-context retrieval and reasoning |
| [LiveBench](/wiki/livebench) | 2024 | Continuously refreshed contamination-resistant tasks |
| [MT-Bench](/wiki/mt_bench) | 2023 | LLM-as-judge multi-turn dialogue |
| [JailbreakBench](/wiki/jailbreakbench), [AdvBench](/wiki/advbench) | 2024 | Robustness to adversarial prompts |

A growing literature studies LLM-as-judge evaluation, where one model rates the outputs of another, with [MT-Bench](/wiki/mt_bench) and Chatbot Arena leading the way.

## What are modalities and multimodal NLP?

A [modality](/wiki/modality) is a channel of input or output. A [multimodal model](/wiki/multimodal_model) handles more than one. Frontier systems including [GPT-4](/wiki/gpt-4), [Gemini](/wiki/gemini), [Claude Opus 4.7](/wiki/claude_opus_4_7), and [Llama 3.2](/wiki/llama_3_2) accept images and audio alongside text. Speech systems include [Whisper](/wiki/whisper) and [Wav2Vec](/wiki/wav2vec).

## What are NLP used for? (applications)

NLP underlies a wide range of products. Selected categories with representative wiki entries:

| Domain | Example systems |
|---|---|
| Chatbots and assistants | [ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), [Gemini](/wiki/gemini), [Grok](/wiki/grok), [Kimi](/wiki/kimi), [Doubao](/wiki/doubao), [Microsoft 365 Copilot](/wiki/microsoft_365_copilot) |
| Search and retrieval | Google Search, [Perplexity](/wiki/perplexity), Bing Chat, You.com |
| Code generation | [GitHub Copilot](/wiki/github_copilot), [Codex](/wiki/codex), [Code Llama](/wiki/code_llama), [StarCoder](/wiki/starcoder), [Codestral](/wiki/codestral) |
| Translation | Google Translate, DeepL, Microsoft Translator, Meta NLLB |
| Writing tools | [QuillBot](/wiki/quillbot), Grammarly, Notion AI |
| Speech | [Whisper](/wiki/whisper), Apple Dictation, Amazon Transcribe |
| Enterprise | Customer-support automation, contract review, [LegalBench](/wiki/legalbench), [PubMedQA](/wiki/pubmedqa)-style domain assistants |
| Routing and aggregation | [OpenRouter](/wiki/openrouter), [LangChain](/wiki/langchain), [LlamaIndex](/wiki/llamaindex), [CrewAI](/wiki/crewai) |

## organizations and ecosystems

The modern NLP ecosystem is shaped by a handful of frontier labs and a much larger open-weights community.

| Organization | Selected NLP work |
|---|---|
| [OpenAI](/wiki/openai) | [GPT](/wiki/gpt) family, [ChatGPT](/wiki/chatgpt), [Codex](/wiki/codex), [Whisper](/wiki/whisper), [o1](/wiki/o1) and [o3](/wiki/o3) reasoning models |
| [Anthropic](/wiki/anthropic) | [Claude](/wiki/claude) family, Constitutional AI, RLHF research |
| Google and Google DeepMind | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [T5](/wiki/t5), [LaMDA](/wiki/lamda), [PaLM](/wiki/palm), [Gemini](/wiki/gemini), [Gemma](/wiki/gemma) |
| [Meta AI](/wiki/meta_ai) | [LLaMA](/wiki/llama), [Llama 3](/wiki/llama_3), [Llama 4 Scout and Maverick](/wiki/llama_4_scout_maverick), RoBERTa, BART |
| [Mistral AI](/wiki/mistral_ai) | [Mistral](/wiki/mistral_ai), [Mixtral](/wiki/mixtral), [Codestral](/wiki/codestral), [Mistral Medium 3](/wiki/mistral_medium_3) |
| [xAI](/wiki/xai) | [Grok](/wiki/grok), [Grok 3](/wiki/grok_3), [Grok 4](/wiki/grok_4) |
| [DeepSeek](/wiki/deepseek), [Moonshot AI](/wiki/moonshot_ai), [Baidu AI](/wiki/baidu_ai), [Huawei AI](/wiki/huawei_ai), [MiniMax](/wiki/minimax), [Doubao](/wiki/doubao) | Chinese frontier and open-weights labs |
| [AI2](/wiki/ai2), [EleutherAI](/wiki/eleutherai) | Open research and open models |
| [Inflection AI](/wiki/inflection_ai), [Reka AI](/wiki/reka_ai) | Specialized assistants and multimodal models |
| Apple | [Apple Foundation Models](/wiki/apple_foundation_models) |

## index of NLP term wiki pages

The original gateway list is preserved below. Each link points to a dedicated AI Wiki entry.

- [attention](/wiki/attention)
- [bag of words](/wiki/bag_of_words)
- [BERT (Bidirectional Encoder Representations from Transformers)](/wiki/bert_bidirectional_encoder_representations_from_transformers)
- [bigram](/wiki/bigram)
- [bidirectional](/wiki/bidirectional)
- [bidirectional language model](/wiki/bidirectional_language_model)
- [BLEU (Bilingual Evaluation Understudy)](/wiki/bleu_bilingual_evaluation_understudy)
- [causal language model](/wiki/causal_language_model)
- [crash blossom](/wiki/crash_blossom)
- [decoder](/wiki/decoder)
- [denoising](/wiki/denoising)
- [embedding layer](/wiki/embedding_layer)
- [embedding space](/wiki/embedding_space)
- [embedding vector](/wiki/embedding_vector)
- [encoder](/wiki/encoder)
- [GPT (Generative Pre-trained Transformer)](/wiki/gpt_generative_pre-trained_transformer)
- [LaMDA (Language Model for Dialogue Applications)](/wiki/lamda_language_model_for_dialogue_applications)
- [language model](/wiki/language_model)
- [large language model](/wiki/large_language_model)
- [masked language model](/wiki/masked_language_model)
- [meta-learning](/wiki/meta-learning)
- [modality](/wiki/modality)
- [model parallelism](/wiki/model_parallelism)
- [multi-head self-attention](/wiki/multi-head_self-attention)
- [multimodal model](/wiki/multimodal_model)
- [natural language understanding](/wiki/natural_language_understanding)
- [N-gram](/wiki/n-gram)
- [NLU](/wiki/nlu)
- [pipelining](/wiki/pipelining)
- [self-attention (also called self-attention layer)](/wiki/self-attention_also_called_self-attention_layer)
- [sentiment analysis](/wiki/sentiment_analysis)
- [sequence-to-sequence task](/wiki/sequence-to-sequence_task)
- [sparse feature](/wiki/sparse_feature)
- [sparse representation](/wiki/sparse_representation)
- [staged training](/wiki/staged_training)
- [token](/wiki/token)
- [Transformer](/wiki/transformer)
- [trigram](/wiki/trigram)
- [unidirectional](/wiki/unidirectional)
- [unidirectional language model](/wiki/unidirectional_language_model)
- [word embedding](/wiki/word_embedding)

### extended index of related NLP wiki pages

In addition to the original list, the following AI Wiki entries cover related NLP concepts, models, benchmarks, and tools.

| Topic area | Wiki pages |
|---|---|
| Foundational concepts | [Embeddings](/wiki/embeddings), [Vector embeddings](/wiki/vector_embeddings), [Positional encoding](/wiki/positional_encoding), [Byte pair encoding](/wiki/byte_pair_encoding), [TF-IDF](/wiki/tf_idf), [Bahdanau attention](/wiki/bahdanau_attention), [Sequence model](/wiki/sequence_model), [Similarity measure](/wiki/similarity_measure), [Latent Dirichlet allocation](/wiki/latent_dirichlet_allocation), [Information retrieval](/wiki/information_retrieval) |
| Frontier and open-weights LLMs | [GPT](/wiki/gpt), [GPT-1](/wiki/gpt-1), [GPT-2](/wiki/gpt-2), [GPT-3](/wiki/gpt-3), [GPT-4](/wiki/gpt-4), [GPT-5](/wiki/gpt-5), [GPT-5.5](/wiki/gpt-5.5), [Claude](/wiki/claude), [Claude Opus 4.7](/wiki/claude_opus_4_7), [Gemini](/wiki/gemini), [Gemini 3 Pro](/wiki/gemini_3_pro), [LLaMA](/wiki/llama), [Llama 4 Scout and Maverick](/wiki/llama_4_scout_maverick), [Mistral AI](/wiki/mistral_ai), [Mixtral](/wiki/mixtral), [Falcon](/wiki/falcon), [Qwen3](/wiki/qwen_3), [Phi-4](/wiki/phi_4), [Gemma 2](/wiki/gemma_2), [DeepSeek V4](/wiki/deepseek_v4), [Kimi K2](/wiki/kimi_k2), [GLM-4.5](/wiki/glm_4_5), [PaLM](/wiki/palm), [LaMDA](/wiki/lamda), [Vicuna](/wiki/vicuna), [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [RoBERTa](/wiki/roberta), [ALBERT](/wiki/albert), [DeBERTa](/wiki/deberta), [XLNet](/wiki/xlnet), [ELECTRA](/wiki/electra), [ELMo](/wiki/elmo), [Grok 4](/wiki/grok_4), [gpt-oss](/wiki/gpt_oss), [OpenAI o3](/wiki/o3), [OpenAI o-series](/wiki/openai_o-series) |
| Reasoning, alignment, and training | [Chain-of-thought prompting](/wiki/chain_of_thought), [In-context learning](/wiki/in_context_learning), [Prompt engineering](/wiki/prompt_engineering), [Meta prompting](/wiki/meta_prompting), [System prompt](/wiki/system_prompt), [RLHF](/wiki/rlhf), [Supervised fine-tuning](/wiki/supervised_fine-tuning), [LoRA](/wiki/lora), [QLoRA](/wiki/qlora), [Knowledge editing](/wiki/knowledge_editing), [Test-time compute](/wiki/test_time_compute), [Speculative decoding](/wiki/speculative_decoding), [Tensor parallelism](/wiki/tensor_parallelism), [Model merging](/wiki/model_merging), [Foundation models](/wiki/foundation_models), [Frontier models](/wiki/frontier_models), [Scaling laws](/wiki/scaling_laws), [Chinchilla scaling](/wiki/chinchilla_scaling) |
| NLP tasks | [Sentiment analysis](/wiki/sentiment_analysis), [Named entity recognition](/wiki/named_entity_recognition), [Text summarization](/wiki/text_summarization), [Machine translation](/wiki/machine_translation), [Question answering](/wiki/question_answering), [Speech recognition](/wiki/speech_recognition), [Whisper](/wiki/whisper), [Wav2Vec](/wiki/wav2vec) |
| Tools and infrastructure | [LangChain](/wiki/langchain), [LlamaIndex](/wiki/llamaindex), [OpenRouter](/wiki/openrouter), [Ollama](/wiki/ollama), [LM Studio](/wiki/lmstudio), [GGUF](/wiki/gguf), [CrewAI](/wiki/crewai), [Tool use](/wiki/tool_use), [AI agents](/wiki/ai_agents), [AI browser agent](/wiki/ai_browser_agent), [Computer-use agent](/wiki/computer-use_agent), [Agentic workflow](/wiki/agentic_workflow), [Agent memory](/wiki/agent_memory), [Structured output](/wiki/structured_output), [Context window](/wiki/context_window), [RAG](/wiki/retrieval_augmented_generation_rag) |
| Benchmarks | [GLUE](/wiki/glue_benchmark), [SuperGLUE](/wiki/superglue), [MMLU-Pro](/wiki/mmlu-pro), [HellaSwag](/wiki/hellaswag), [BIG-Bench Hard](/wiki/big-bench-hard), [GSM8K](/wiki/gsm8k), [MATH](/wiki/math_benchmark), [MBPP](/wiki/mbpp), [TruthfulQA](/wiki/truthfulqa), [TriviaQA](/wiki/triviaqa), [BoolQ](/wiki/boolq), [DROP](/wiki/drop), [SimpleQA](/wiki/simpleqa), [GPQA Diamond](/wiki/gpqa_diamond), [MGSM](/wiki/mgsm), [LongBench](/wiki/longbench), [RULER](/wiki/ruler_benchmark), [LiveBench](/wiki/livebench), [IFEval](/wiki/ifeval), [PubMedQA](/wiki/pubmedqa), [LegalBench](/wiki/legalbench), [JailbreakBench](/wiki/jailbreakbench), [AdvBench](/wiki/advbench), [MT-Bench](/wiki/mt_bench), [BLEU](/wiki/bleu), [Perplexity](/wiki/perplexity) |
| Datasets, organizations, applications | [Common Crawl](/wiki/common_crawl), [FineWeb](/wiki/fineweb), [OpenAI](/wiki/openai), [Anthropic](/wiki/anthropic), [Meta AI](/wiki/meta_ai), [xAI](/wiki/xai), [DeepSeek](/wiki/deepseek), [Moonshot AI](/wiki/moonshot_ai), [Baidu AI](/wiki/baidu_ai), [Huawei AI](/wiki/huawei_ai), [MiniMax](/wiki/minimax), [Reka AI](/wiki/reka_ai), [Inflection AI](/wiki/inflection_ai), [AI2](/wiki/ai2), [EleutherAI](/wiki/eleutherai), [Apple Foundation Models](/wiki/apple_foundation_models), [ChatGPT](/wiki/chatgpt), [Microsoft 365 Copilot](/wiki/microsoft_365_copilot), [GitHub Copilot](/wiki/github_copilot), [Codex](/wiki/codex), [Code Llama](/wiki/code_llama), [Codestral](/wiki/codestral), [StarCoder](/wiki/starcoder), [QuillBot](/wiki/quillbot), [Perplexity](/wiki/perplexity) |

## further reading and external references

- Jurafsky, D., and Martin, J. H. *Speech and Language Processing*, 3rd ed. draft. https://web.stanford.edu/~jurafsky/slp3/.
- Stanford CS224n, https://web.stanford.edu/class/cs224n/.
- Eisenstein, J. *Introduction to Natural Language Processing*. MIT Press, 2019.
- Manning, C. D., Raghavan, P., and Schütze, H. *Introduction to Information Retrieval*. Cambridge University Press, 2008.
- Wikipedia, "Natural language processing", https://en.wikipedia.org/wiki/Natural_language_processing.
- Wikipedia, "Transformer (deep learning architecture)".
- Wikipedia, "Large language model".
- Hugging Face *NLP Course*, https://huggingface.co/learn/nlp-course.
- Google AI, *Machine Learning Glossary: Language*.

## references

1. Vaswani, A., et al. "Attention Is All You Need". NeurIPS 30, 2017. arXiv:1706.03762.
2. Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL-HLT 2019. arXiv:1810.04805.
3. Radford, A., et al. "Improving Language Understanding by Generative Pre-Training". OpenAI, 2018.
4. Brown, T., et al. "Language Models are Few-Shot Learners". NeurIPS 2020. arXiv:2005.14165.
5. Raffel, C., et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR 21, 2020.
6. Mikolov, T., et al. "Efficient Estimation of Word Representations in Vector Space". ICLR 2013.
7. Pennington, J., Socher, R., and Manning, C. D. "GloVe: Global Vectors for Word Representation". EMNLP 2014.
8. Bahdanau, D., Cho, K., and Bengio, Y. "Neural Machine Translation by Jointly Learning to Align and Translate". ICLR 2015.
9. Sennrich, R., Haddow, B., and Birch, A. "Neural Machine Translation of Rare Words with Subword Units". ACL 2016.
10. Wei, J., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". NeurIPS 2022. arXiv:2201.11903.
11. Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". NeurIPS 2020. arXiv:2005.11401.
12. Ouyang, L., et al. "Training Language Models to Follow Instructions with Human Feedback". NeurIPS 2022.
13. Hu, E. J., et al. "LoRA: Low-Rank Adaptation of Large Language Models". ICLR 2022. arXiv:2106.09685.
14. Kaplan, J., et al. "Scaling Laws for Neural Language Models". arXiv:2001.08361, 2020.
15. Hoffmann, J., et al. "Training Compute-Optimal Large Language Models". NeurIPS 2022. arXiv:2203.15556.
16. Jurafsky, D., and Martin, J. H. *Speech and Language Processing*. 3rd ed. draft, Stanford.
17. Manning, C. D., Raghavan, P., and Schütze, H. *Introduction to Information Retrieval*. Cambridge University Press, 2008.
18. Wikipedia contributors. "Natural language processing". *Wikipedia*, accessed 2026-05-09.
19. OpenAI. "What are tokens and how to count them?". OpenAI Help Center, accessed 2026-06-25. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them.

