Machine learning terms/Natural Language Processing

Large Language Models Machine Learning Natural Language Processing

21 min read

Updated Jun 25, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 25, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v6 · 4,243 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

See also: Machine learning terms

Natural Language Processing (NLP) is the subfield of artificial intelligence and machine learning concerned with enabling computers to read, interpret, generate, and reason about human language in text and speech form.^[18] The core machine-learning terms used across modern NLP are: tokens and tokenization (how text is split into model-readable units), embeddings (dense vectors that encode meaning), attention and the Transformer (the architecture behind nearly every state-of-the-art system since 2017),^[1] language models (which assign probabilities to token sequences), and the evaluation vocabulary of perplexity, BLEU, and benchmark suites such as GLUE and MMLU. NLP sits at the intersection of computer science, linguistics, and statistics,^[16] and it powers nearly every consumer-facing product that uses written or spoken language, from web search and machine translation to chatbots like ChatGPT, Claude, and Gemini.

This page is the gateway hub for NLP-related entries on the AI Wiki. It introduces the central ideas, surveys the modern landscape dominated by large language models, and provides a curated index of every NLP concept and model with its own dedicated wiki page.

How did natural language processing evolve?

The history of NLP spans roughly seven decades and is conventionally divided into four eras:

Era	Years	Dominant approach	Representative milestones
Symbolic / rule-based	1950s to late 1980s	Hand-written grammars and logic	Georgetown-IBM translation demo (1954), ELIZA (1966), SHRDLU (1970)
Statistical	late 1980s to 2010	Probability theory, HMMs, n-gram models, CRFs	IBM Candide SMT, Penn Treebank (1993), maximum-entropy taggers
Neural / distributional	2010 to 2017	Word embeddings, RNNs, LSTMs, seq2seq	Word2Vec (2013), GloVe (2014), Bahdanau attention (2014), GNMT (2016)
Transformer / foundation-model	2017 to present	Self-attention, Transformer, large-scale pre-training, instruction tuning	"Attention Is All You Need" (2017), BERT (2018), GPT-3 (2020), ChatGPT (2022), GPT-4 (2023)

The field shifted decisively in 2017 when Google researchers published "Attention Is All You Need", introducing the Transformer architecture.^[1] The paper proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely", and reported 28.4 BLEU on WMT 2014 English-to-German and 41.8 BLEU on English-to-French, a new state of the art at the time.^[1] Almost every state-of-the-art NLP system since then is a Transformer variant.

What are the core NLP concepts and building blocks?

A handful of building blocks underlie nearly every modern NLP system.

tokens and tokenization

A token is the atomic unit a language model reads or writes. Tokens may be characters, whole words, or, most commonly, subword pieces. Modern models almost always use subword tokenization algorithms such as byte pair encoding (BPE), WordPiece, and SentencePiece / Unigram.^[9] Counting is done in tokens, not words. As a rough rule of thumb for English, OpenAI estimates 1 token is about 4 characters or roughly 0.75 words, so 100 tokens correspond to about 75 words.^[19]

embeddings

An embedding is a dense vector of real numbers representing meaning in a continuous space. Geometric proximity encodes semantic similarity: vectors for king and queen land near each other.^[6] Embeddings are produced by an embedding layer and live in an embedding space whose dimensionality is typically 256 to 12,288. See embedding vector and vector embeddings.

language models

A language model assigns probabilities to sequences of tokens. Given a context, it can either score the likelihood of a continuation or sample new text. Three architectural families dominate:

Family	Direction	Training objective	Canonical example
Causal language model (autoregressive, left-to-right)	Unidirectional	Predict the next token	GPT, LLaMA, Claude
Masked language model	Bidirectional	Reconstruct masked tokens	BERT, RoBERTa, DeBERTa
Encoder-decoder (seq2seq)	Bidirectional encoder, unidirectional decoder	Map input sequence to output sequence	T5, BART, original Transformer

A bidirectional language model reads context from both sides at once and is well suited to understanding tasks; a unidirectional language model is well suited to generation.

attention and transformers

Attention is a mechanism that lets a model weight different parts of its input when computing a representation. Self-attention lets every token in a sequence attend to every other token; multi-head self-attention runs many such operations in parallel so the model can capture different relations simultaneously. Bahdanau attention (2014) was the first widely cited attention mechanism in NLP;^[8] the modern scaled-dot-product version was popularized by the 2017 Transformer paper.^[1] Because attention has no inherent notion of word order, models add a positional encoding so position is preserved.

How does tokenization work in detail?

The quality of a tokenizer has measurable effects on downstream performance, training cost, and multilingual coverage. The dominant approaches are summarized below.

Algorithm	Idea	Used by
Whitespace / word	Split on spaces and punctuation	Classical NLP, early Word2Vec
Byte pair encoding (BPE)	Iteratively merge the most frequent adjacent symbol pair	GPT-2, GPT-3, GPT-4, LLaMA, most open-weights LLMs
WordPiece	BPE-like merging based on likelihood gain	BERT, DistilBERT
SentencePiece (Unigram)	Probabilistic model that prunes a candidate subword vocabulary	T5, ALBERT, XLNet, mBART
Byte-level BPE	Operates on raw UTF-8 bytes for full Unicode coverage	GPT-2 onward, Claude
Character / byte	One token per character or byte	ByT5, CANINE

Classical tokenization includes bigram and trigram splitting, and the broader n-gram family that powered statistical language models from the 1990s through the early 2010s. A famous illustration of why tokenization is hard in raw text is the crash blossom: an ambiguous newspaper headline that humans parse easily but that confuses naive parsers.

What are word and sentence embeddings?

Embedding methods evolved from static lookup tables to deeply contextual representations.^[7]

Generation	Method	Year	Notes
Static word vectors	Word2Vec (CBOW, skip-gram)	2013	Mikolov et al., Google. Trained on raw text with a shallow network
Static word vectors	GloVe	2014	Pennington, Socher, Manning at Stanford. Factorizes the global co-occurrence matrix
Static word vectors	fastText	2016	Facebook AI Research. Adds character n-grams for morphology
Contextual	ELMo	2018	Bidirectional LSTM language model from AI2
Contextual	BERT embeddings	2018	Each token vector depends on full sentence context
Sentence-level	Sentence-BERT (SBERT)	2019	Siamese BERT producing fixed-size sentence vectors
Production	OpenAI `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`	2022 to 2024	High-dimensional API embeddings for retrieval
Production	Cohere Embed, Voyage AI, Google Gemini Embeddings, BGE, E5, GTE	2023 to 2026	Open and commercial dense retrievers

Beyond dense embeddings, NLP also uses sparse representations such as bag of words, TF-IDF, and BM25. These are the basis of classical information retrieval, and they remain competitive when combined with dense vectors in hybrid retrieval.^[17] A sparse feature is one whose value is mostly zero, which is the common case for one-hot or bag-of-words encodings.

What are the main NLP model families?

The Transformer introduced in Attention Is All You Need replaced recurrence with self-attention.^[1] It scales effectively on GPU and TPU hardware and has become the universal backbone of modern NLP. Major model families follow.

encoder-only (masked) models

Model	Year	Organization	Highlights
BERT	2018	Google	Set new state of the art on 11 NLU tasks and pushed the GLUE score to 80.5, a 7.7-point absolute gain; introduced masked language modeling at scale^[2]
RoBERTa	2019	Meta AI	BERT trained longer with more data and dynamic masking
ALBERT	2019	Google	Parameter sharing for compact BERT
DistilBERT	2019	Hugging Face	Knowledge-distilled BERT, 40% smaller
XLNet	2019	CMU and Google	Permutation language modeling
ELECTRA	2020	Google	Replaced-token detection objective
DeBERTa	2020	Microsoft	Disentangled attention and enhanced masked decoding

decoder-only (causal / autoregressive) models

Model	Year	Organization	Notes
GPT-1	2018	OpenAI	First Generative Pre-trained Transformer
GPT-2	2019	OpenAI	1.5 billion parameters; demonstrated zero-shot transfer
GPT-3	2020	OpenAI	175 billion parameters (10x larger than any prior dense LM), trained on about 570 GB of filtered text; popularized few-shot in-context learning^[4]
LaMDA	2021	Google	Dialogue-tuned 137 billion parameter model
PaLM	2022	Google	540 billion parameters
LLaMA	2023	Meta	Open-weights foundation models
GPT-4	2023	OpenAI	Multimodal; behind much of ChatGPT
Claude	2023 to present	Anthropic	Constitutional AI alignment; Claude Opus 4.7 is the current frontier
Gemini	2023 to present	Google DeepMind	Native multimodality; Gemini 3 is the latest line
Mistral, Mixtral	2023 to present	Mistral AI	Efficient open-weights and mixture-of-experts
Falcon, Qwen, DeepSeek, GLM-4.5, Kimi, Phi, Gemma	2023 to present	Various	Open-weights ecosystem
GPT-5 and successors	2024 to present	OpenAI	Frontier reasoning models; GPT-5.5

encoder-decoder models

Model	Year	Organization	Notes
Original Transformer	2017	Google	Built for machine translation
T5	2019	Google	"Text-to-text" unification of tasks
BART	2019	Meta	Denoising autoencoder for generation
mBART, mT5	2020	Various	Multilingual encoder-decoders
Flan-T5, UL2	2022	Google	Instruction-tuned T5 variants

What are the core NLP tasks?

Classical and modern NLP share a backbone of canonical tasks. The same Transformer model can usually be fine-tuned or prompted to handle each.^[5]

Task	Description	Example wiki entry
Text classification	Assign a label to a document, e.g. spam vs ham	Sentiment analysis
Named entity recognition (NER)	Identify spans referring to people, places, organizations, dates	CoNLL-2003, OntoNotes
Part-of-speech (POS) tagging	Label each token with its grammatical category	Penn Treebank tags
Syntactic parsing	Build a constituency or dependency tree	Universal Dependencies
Coreference resolution	Cluster mentions referring to the same entity	OntoNotes coreference
Word sense disambiguation	Pick the correct sense of an ambiguous word	WordNet senses
Sentiment analysis	Classify polarity, opinion, or emotion	SST, IMDB
Text summarization	Produce a shorter version preserving meaning	CNN/DailyMail, XSum
Machine translation	Translate text between languages	WMT
Question answering	Answer questions from context or open domain	SQuAD, NaturalQuestions
Information retrieval	Retrieve relevant documents for a query	MS MARCO, BEIR
Sequence-to-sequence generation	Map any input sequence to any output sequence	Translation, summarization, paraphrasing
Dialogue and chat	Multi-turn conversational generation	ChatGPT, Claude
Speech recognition	Convert audio to text	Whisper, Wav2Vec
Topic modeling	Discover latent themes in a corpus	Latent Dirichlet allocation

Beyond these, many recent benchmarks evaluate higher-level skills such as code generation (HumanEval, MBPP), math word problems (GSM8K, MATH), and instruction following (IFEval).

How do natural language understanding and natural language generation differ?

NLP is often split into two halves: natural language understanding (NLU) covers reading and reasoning, served by encoder models like BERT^[2] and benchmarks like GLUE and SuperGLUE; natural language generation (NLG) covers producing fluent text, served by decoder models like GPT and benchmarks like MT-Bench. Modern frontier systems blend both.

What techniques are specific to large language models?

The rise of large language models introduced a set of techniques that did not exist in the pre-2018 NLP toolkit.

pre-training, fine-tuning, and alignment

Most modern NLP systems follow a two- or three-stage recipe:

Pre-training on trillions of tokens of web text, code, and books, often from sources like Common Crawl, FineWeb, C4, The Pile, and proprietary corpora. The standard objectives are next-token prediction (causal)^[3] or masked-token reconstruction (encoder).^[2]
Supervised fine-tuning (SFT) on a smaller curated dataset of high-quality instructions and demonstrations.
Preference optimization or RLHF to align the model with human preferences.^[12] Variants include Direct Preference Optimization (DPO), Identity Preference Optimization (IPO), and Constitutional AI.

Parameter-efficient fine-tuning saves compute and memory by training only a small subset of weights. The most common approaches are LoRA (Low-Rank Adaptation), QLoRA, and adapter modules.^[13] Multi-task pre-training is sometimes called meta-learning, and some pipelines use staged training to introduce data or capabilities in phases. Model merging lets practitioners combine separately trained checkpoints into a single model without further training.

prompting

Prompt engineering is the practice of designing input text so that an LLM produces a desired output. Important techniques include:

Zero-shot prompting: ask the model directly without examples.
Few-shot or in-context learning: include several worked examples in the prompt.^[4]
Chain-of-thought prompting: ask the model to reason step by step. Introduced by Wei et al. (2022); prompting a 540B-parameter model with eight worked examples raised the GSM8K math solve rate from 18% to 57%.^[10]
Meta prompting: use the model itself to design prompts.
System prompt: a fixed instruction prepended to every conversation that sets persona, tone, and constraints.
Structured output: force the model to emit JSON, XML, or another schema.

retrieval-augmented generation

Retrieval-augmented generation (RAG) combines an LLM with an external knowledge base.^[11] At query time, a retriever finds relevant documents using dense or sparse embeddings, and the LLM conditions on them to produce a grounded answer. Frameworks like LangChain and LlamaIndex provide building blocks for RAG pipelines, and vector databases such as Pinecone, Weaviate, Qdrant, and Milvus store embeddings at scale. RAG is now standard practice for chatbots that need fresh, factual, or proprietary information.

tool use and agents

When an LLM can call external functions or APIs, it becomes capable of tool use. Higher-level AI agents plan, decompose tasks, and orchestrate multi-step actions. Subtypes include computer-use agents, AI browser agents, and agentic workflows. Persistent state is handled through agent memory and knowledge editing techniques.

context windows and long-context techniques

A model's context window is the maximum number of tokens it can attend to at once. GPT-3 launched with 2,048 tokens; Claude 2 introduced 100,000-token windows in 2023; modern frontier systems including Claude Opus 4.7, Gemini 3 Pro, and GPT-5.5 handle context windows in the 1 million token range. Benchmarks such as LongBench and RULER measure long-context performance.

scaling and inference efficiency

Scaling laws, formalized by Kaplan et al.^[14] and refined by Chinchilla scaling,^[15] describe how loss falls predictably with more compute, data, and parameters. The Chinchilla result showed that for a fixed compute budget, model size and training tokens should be scaled in equal proportion: a 70 billion parameter Chinchilla trained on 4x more data outperformed the 280 billion parameter Gopher on nearly every task.^[15] See also the Scaling Laws for Neural Language Models paper. On the inference side, optimization techniques include speculative decoding, tensor parallelism, model parallelism, pipelining, test-time compute scaling for reasoning models, quantization formats like GGUF, and runtimes like Ollama and LM Studio.

What are the main decoding strategies?

Given a probability distribution over the next token, a decoding strategy decides which token to actually emit. The choice strongly affects fluency, diversity, and factuality.

Strategy	How it works	Typical use
Greedy decoding	Always pick the highest-probability token	Deterministic short answers
Beam search	Maintain the top k hypotheses at each step	Translation, summarization
Sampling	Draw a token according to the distribution	Creative generation
Temperature	Sharpen (low) or flatten (high) the distribution before sampling	Controls randomness
Top-k sampling	Sample only from the k most probable tokens	Reduces tail noise
Top-p (nucleus) sampling	Sample from the smallest set whose cumulative probability exceeds p	Default for many chatbots
Typical sampling	Sample tokens close to the conditional entropy	Coherent open-ended generation
Mirostat	Adaptively keep perplexity near a target	Avoids local repetition
Contrastive search	Combine likelihood with degeneration penalty	Long-form generation
Speculative decoding	Use a small draft model and verify with a large model	Faster inference, same distribution

How is NLP evaluated, and what are the key metrics?

NLP evaluation is hard because language tasks are open-ended. Common metrics and benchmarks include:

intrinsic metrics

Perplexity: exponentiated cross-entropy of a language model on held-out text. Lower is better.
Cross-entropy and bits-per-character / bits-per-byte.

task metrics

BLEU (Bilingual Evaluation Understudy): n-gram precision against reference translations.
ROUGE: n-gram recall, used for summarization (ROUGE-1, ROUGE-2, ROUGE-L).
METEOR: alignment-based metric using stemming and synonyms.
chrF and chrF++: character-level F-score, robust across languages.
BERTScore and BLEURT: model-based semantic similarity.
F1, exact match, and accuracy for classification and QA.

benchmark suites

Benchmark	Year	What it measures
GLUE	2018	General NLU
SuperGLUE	2019	Harder NLU
MMLU and MMLU-Pro	2020 to 2024	57-subject multiple-choice knowledge
TruthfulQA	2021	Truthfulness on misleading questions
HellaSwag	2019	Commonsense sentence completion
BIG-Bench Hard	2022	23 hard reasoning tasks
HumanEval	2021	Python code generation
MBPP	2021	Mostly basic Python problems
GSM8K	2021	Grade-school math word problems
MATH	2021	Competition mathematics
BoolQ, DROP, TriviaQA, SimpleQA	various	Reading comprehension and QA
GPQA Diamond	2023	Graduate-level science questions
MGSM	2022	Multilingual grade-school math
PubMedQA, LegalBench	various	Domain-specific QA
LongBench, RULER	2023 to 2024	Long-context retrieval and reasoning
LiveBench	2024	Continuously refreshed contamination-resistant tasks
MT-Bench	2023	LLM-as-judge multi-turn dialogue
JailbreakBench, AdvBench	2024	Robustness to adversarial prompts

A growing literature studies LLM-as-judge evaluation, where one model rates the outputs of another, with MT-Bench and Chatbot Arena leading the way.

What are modalities and multimodal NLP?

A modality is a channel of input or output. A multimodal model handles more than one. Frontier systems including GPT-4, Gemini, Claude Opus 4.7, and Llama 3.2 accept images and audio alongside text. Speech systems include Whisper and Wav2Vec.

What are NLP used for? (applications)

NLP underlies a wide range of products. Selected categories with representative wiki entries:

Domain	Example systems
Chatbots and assistants	ChatGPT, Claude, Gemini, Grok, Kimi, Doubao, Microsoft 365 Copilot
Search and retrieval	Google Search, Perplexity, Bing Chat, You.com
Code generation	GitHub Copilot, Codex, Code Llama, StarCoder, Codestral
Translation	Google Translate, DeepL, Microsoft Translator, Meta NLLB
Writing tools	QuillBot, Grammarly, Notion AI
Speech	Whisper, Apple Dictation, Amazon Transcribe
Enterprise	Customer-support automation, contract review, LegalBench, PubMedQA-style domain assistants
Routing and aggregation	OpenRouter, LangChain, LlamaIndex, CrewAI

organizations and ecosystems

The modern NLP ecosystem is shaped by a handful of frontier labs and a much larger open-weights community.

Organization	Selected NLP work
OpenAI	GPT family, ChatGPT, Codex, Whisper, o1 and o3 reasoning models
Anthropic	Claude family, Constitutional AI, RLHF research
Google and Google DeepMind	BERT, T5, LaMDA, PaLM, Gemini, Gemma
Meta AI	LLaMA, Llama 3, Llama 4 Scout and Maverick, RoBERTa, BART
Mistral AI	Mistral, Mixtral, Codestral, Mistral Medium 3
xAI	Grok, Grok 3, Grok 4
DeepSeek, Moonshot AI, Baidu AI, Huawei AI, MiniMax, Doubao	Chinese frontier and open-weights labs
AI2, EleutherAI	Open research and open models
Inflection AI, Reka AI	Specialized assistants and multimodal models
Apple	Apple Foundation Models

index of NLP term wiki pages

The original gateway list is preserved below. Each link points to a dedicated AI Wiki entry.

In addition to the original list, the following AI Wiki entries cover related NLP concepts, models, benchmarks, and tools.

Topic area	Wiki pages
Foundational concepts	Embeddings, Vector embeddings, Positional encoding, Byte pair encoding, TF-IDF, Bahdanau attention, Sequence model, Similarity measure, Latent Dirichlet allocation, Information retrieval
Frontier and open-weights LLMs	GPT, GPT-1, GPT-2, GPT-3, GPT-4, GPT-5, GPT-5.5, Claude, Claude Opus 4.7, Gemini, Gemini 3 Pro, LLaMA, Llama 4 Scout and Maverick, Mistral AI, Mixtral, Falcon, Qwen3, Phi-4, Gemma 2, DeepSeek V4, Kimi K2, GLM-4.5, PaLM, LaMDA, Vicuna, BERT, RoBERTa, ALBERT, DeBERTa, XLNet, ELECTRA, ELMo, Grok 4, gpt-oss, OpenAI o3, OpenAI o-series
Reasoning, alignment, and training	Chain-of-thought prompting, In-context learning, Prompt engineering, Meta prompting, System prompt, RLHF, Supervised fine-tuning, LoRA, QLoRA, Knowledge editing, Test-time compute, Speculative decoding, Tensor parallelism, Model merging, Foundation models, Frontier models, Scaling laws, Chinchilla scaling
NLP tasks	Sentiment analysis, Named entity recognition, Text summarization, Machine translation, Question answering, Speech recognition, Whisper, Wav2Vec
Tools and infrastructure	LangChain, LlamaIndex, OpenRouter, Ollama, LM Studio, GGUF, CrewAI, Tool use, AI agents, AI browser agent, Computer-use agent, Agentic workflow, Agent memory, Structured output, Context window, RAG
Benchmarks	GLUE, SuperGLUE, MMLU-Pro, HellaSwag, BIG-Bench Hard, GSM8K, MATH, MBPP, TruthfulQA, TriviaQA, BoolQ, DROP, SimpleQA, GPQA Diamond, MGSM, LongBench, RULER, LiveBench, IFEval, PubMedQA, LegalBench, JailbreakBench, AdvBench, MT-Bench, BLEU, Perplexity
Datasets, organizations, applications	Common Crawl, FineWeb, OpenAI, Anthropic, Meta AI, xAI, DeepSeek, Moonshot AI, Baidu AI, Huawei AI, MiniMax, Reka AI, Inflection AI, AI2, EleutherAI, Apple Foundation Models, ChatGPT, Microsoft 365 Copilot, GitHub Copilot, Codex, Code Llama, Codestral, StarCoder, QuillBot, Perplexity

references

Vaswani, A., et al. "Attention Is All You Need". NeurIPS 30, 2017. arXiv:1706.03762. ↩
Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". NAACL-HLT 2019. arXiv:1810.04805. ↩
Radford, A., et al. "Improving Language Understanding by Generative Pre-Training". OpenAI, 2018. ↩
Brown, T., et al. "Language Models are Few-Shot Learners". NeurIPS 2020. arXiv:2005.14165. ↩
Raffel, C., et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". JMLR 21, 2020. ↩
Mikolov, T., et al. "Efficient Estimation of Word Representations in Vector Space". ICLR 2013. ↩
Pennington, J., Socher, R., and Manning, C. D. "GloVe: Global Vectors for Word Representation". EMNLP 2014. ↩
Bahdanau, D., Cho, K., and Bengio, Y. "Neural Machine Translation by Jointly Learning to Align and Translate". ICLR 2015. ↩
Sennrich, R., Haddow, B., and Birch, A. "Neural Machine Translation of Rare Words with Subword Units". ACL 2016. ↩
Wei, J., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". NeurIPS 2022. arXiv:2201.11903. ↩
Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". NeurIPS 2020. arXiv:2005.11401. ↩
Ouyang, L., et al. "Training Language Models to Follow Instructions with Human Feedback". NeurIPS 2022. ↩
Hu, E. J., et al. "LoRA: Low-Rank Adaptation of Large Language Models". ICLR 2022. arXiv:2106.09685. ↩
Kaplan, J., et al. "Scaling Laws for Neural Language Models". arXiv:2001.08361, 2020. ↩
Hoffmann, J., et al. "Training Compute-Optimal Large Language Models". NeurIPS 2022. arXiv:2203.15556. ↩
Jurafsky, D., and Martin, J. H. *Speech and Language Processing*. 3rd ed. draft, Stanford. ↩
Manning, C. D., Raghavan, P., and Schütze, H. *Introduction to Information Retrieval*. Cambridge University Press, 2008. ↩
Wikipedia contributors. "Natural language processing". *Wikipedia*, accessed 2026-05-09. ↩
OpenAI. "What are tokens and how to count them?". OpenAI Help Center, accessed 2026-06-25. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

AI Wiki Machine learning terms Machine learning terms/Decision Forests Machine learning terms/Fundamentals Natural language processing Terms

How did natural language processing evolve?

What are the core NLP concepts and building blocks?

tokens and tokenization

embeddings

language models

attention and transformers

How does tokenization work in detail?

What are word and sentence embeddings?

What are the main NLP model families?

encoder-only (masked) models

decoder-only (causal / autoregressive) models

encoder-decoder models

What are the core NLP tasks?

How do natural language understanding and natural language generation differ?

What techniques are specific to large language models?

pre-training, fine-tuning, and alignment

prompting

retrieval-augmented generation

tool use and agents

context windows and long-context techniques

scaling and inference efficiency

What are the main decoding strategies?

How is NLP evaluated, and what are the key metrics?

intrinsic metrics

task metrics

benchmark suites

What are modalities and multimodal NLP?

What are NLP used for? (applications)

organizations and ecosystems

index of NLP term wiki pages

extended index of related NLP wiki pages

further reading and external references

references

Improve this article

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Post-training

What links here

Related Articles

Prompt Engineering

Agentic Context Engineering

Claude Sonnet 4.5

Context window

Large Language Model

Post-training

What links here