Natural Language Processing Models
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,500 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Natural Language Processing and Models
Natural language processing (NLP) models are computational systems that read, interpret, and produce human language. They cover a broad family of architectures, training methods, and task formulations, ranging from rule-based parsers built in the 1950s to multi-trillion parameter foundation models of the 2020s. This page is an index for the subcategories of NLP models on AI Wiki and gives a tour of the field: its history, task taxonomy, dominant model families, training objectives, and evaluation benchmarks.
NLP sits at the intersection of computational linguistics, statistics, and machine learning. Its two basic goals are natural language understanding (mapping text into structured meaning) and natural language generation (producing fluent text from input). Practical systems often combine both, as in a chatbot that reads a query, reasons over a knowledge base, and emits a multi-sentence answer.
The earliest NLP systems, beginning with the Georgetown-IBM machine translation experiments in 1954 and continuing through SHRDLU (Winograd 1971) and ELIZA (Weizenbaum 1966), were rule-based. Linguists hand-wrote grammars and transformation rules. These systems were brittle outside narrow domains, but they established core concepts such as parsing, semantic frames, and dialogue management.
From the late 1980s through the 2000s, statistical NLP took over. IBM Model 1 through Model 5 (Brown et al. 1993) used word-alignment probabilities to train translation systems from parallel corpora. Hidden Markov models powered part-of-speech tagging and speech recognition, while conditional random fields (Lafferty, McCallum, and Pereira 2001) became the standard tool for sequence labeling tasks such as named entity recognition. N-gram language models, smoothed with techniques such as Kneser-Ney, set the state of the art in language modeling for two decades.
A second shift came with distributional semantics and word embeddings. The skip-gram and CBOW models of word2vec (Mikolov et al. 2013) showed that a shallow neural network trained on raw text could produce dense word vectors that captured analogical structure. GloVe (Pennington, Socher, and Manning 2014) recast the problem as factorizing a co-occurrence matrix, and fastText (Bojanowski et al. 2017) added subword character n-grams so that out-of-vocabulary words could be represented.
The sequence-to-sequence paradigm (Sutskever, Vinyals, and Le 2014) and the attention mechanism (Bahdanau, Cho, and Bengio 2014) made it possible to learn translation end to end with recurrent neural networks. The transformer (Vaswani et al. 2017) replaced recurrence with self-attention and unlocked the parallel training that defines modern NLP. BERT (Devlin et al. 2018) showed that bidirectional encoder pretraining transfers well to downstream classification and span extraction; the GPT line (Radford et al. 2018 and 2019, Brown et al. 2020) demonstrated that left-to-right decoder pretraining at scale yields strong few-shot generation. From 2022 onward, instruction tuning and reinforcement learning from human feedback (RLHF) produced chat-aligned large language models, and the foundation model framing (Bommasani et al. 2021) became the umbrella term for broadly pretrained systems adaptable to many tasks.
The table below collects the main task families. Each row links to a subcategory page that lists representative models.
| Task family | What it does | Typical inputs and outputs | Subcategory page |
|---|---|---|---|
| Text classification | Assigns one or more labels to a span of text. Includes sentiment, topic, intent, and toxicity detection. | text in, label out | Text classification models |
| Token classification | Labels each token in a sequence. Covers named entity recognition, part-of-speech tagging, and chunking. | text in, per-token labels out | Token classification models |
| Question answering | Answers a question, either by extracting a span, generating free text, or retrieving over a knowledge store. | question (+ context) in, answer out | Question answering models |
| Table question answering | Answers questions grounded in tabular data, often with cell selection or SQL generation. | question + table in, answer out | Table question answering models |
| Text generation | Produces open-ended continuations, completions, or instruction-following responses. | prompt in, text out | Text generation models |
| Text-to-text generation | Frames any NLP task as a string-to-string mapping, the T5 paradigm. | text in, text out | Text2text generation models |
| Summarization | Compresses a long document into a shorter passage, extractively or abstractively. | document in, summary out | Summarization models |
| Translation | Converts text from one language to another, high-resource or low-resource. | source text in, target text out | Translation models |
| Conversational | Holds multi-turn dialogue with a user, often grounded in tools or retrieved context. | dialogue history in, reply out | Conversational models |
| Fill-mask | Predicts masked tokens inside a sentence, the BERT pretraining task and a useful probe. | text with mask in, filled text out | Fill-mask models |
| Zero-shot classification | Classifies text against arbitrary candidate labels supplied at inference time. | text + label set in, label out | Zero-shot classification models |
| Sentence similarity | Produces dense vector representations whose cosine distance reflects semantic similarity. | sentence pair in, score out | Sentence similarity models |
| Feature extraction | Returns hidden-state embeddings that downstream systems consume. | text in, vector out | Feature extraction models |
Other task formulations sit alongside these categories: information extraction, coreference resolution, natural language inference and entailment, semantic parsing and text-to-SQL, paraphrase detection, and grammatical error correction. Most reduce to a classification, tagging, or seq2seq problem and use the same model families.
| Family | Representative models | One-line description |
|---|---|---|
| Word-level embeddings | word2vec, GloVe, fastText | Static dense vectors learned from co-occurrence or skip-gram objectives. |
| Contextual embeddings | ELMo, Flair | Per-token vectors that depend on the surrounding sentence, produced by bidirectional LSTMs. |
| Encoder-only Transformers | BERT, RoBERTa, ALBERT, DeBERTa, ELECTRA, ModernBERT | Bidirectional encoders pretrained with masked language modeling for classification, tagging, and retrieval. |
| Encoder-decoder seq2seq | T5, BART, mT5, FLAN-T5, UL2, Pegasus | Bidirectional encoder plus autoregressive decoder for translation, summarization, and any text-to-text task. |
| Decoder-only LLMs | GPT family, Llama, Mistral, Claude, Gemini, DeepSeek | Causal language models trained at scale, adapted for chat and reasoning through instruction tuning. |
| Multilingual | mBERT, XLM-R, mT5, NLLB-200 | Single models that share parameters across 100 or more languages, often trained on Common Crawl mixtures. |
| Specialized domain | BioBERT, SciBERT, ClinicalBERT, FinBERT, LegalBERT, CodeBERT, CodeT5 | Encoders or seq2seq models continued-pretrained on biomedical, legal, financial, or programming corpora. |
| Mixture-of-experts | Switch Transformer, GLaM, Mixtral, DBRX, DeepSeek-V3 | Sparsely activated networks that route each token to a small subset of experts. |
Families are not mutually exclusive. Llama is a decoder-only LLM and also a foundation model with countless domain-specialized fine-tunes. CodeT5 is both an encoder-decoder seq2seq and a specialized domain model.
The pretraining objective is the loss the model optimizes on unlabeled text before any fine-tuning. It defines what the model is good at by default.
| Objective | Example model | What the model predicts |
|---|---|---|
| Masked language modeling | BERT, RoBERTa | Randomly masked tokens, conditioned on the entire surrounding context. |
| Causal language modeling | GPT, Llama, Mistral | Each next token given the previous tokens only, strictly left to right. |
| Span corruption | T5, mT5 | Contiguous spans replaced by a sentinel, with the decoder reconstructing them. |
| Denoising autoencoding | BART | The original document, given a corrupted version with token masking, deletion, infilling, sentence permutation, or document rotation. |
| Permutation language modeling | XLNet | Tokens in a random factorization order, blending bidirectional context with autoregressive generation. |
| Replaced token detection | ELECTRA | A binary label per token: was this token in the original text or was it swapped by a small generator? |
| Prefix language modeling | UL2, T5 variants | Causal prediction on a suffix, conditioned bidirectionally on a prefix. |
| Fill-in-the-middle | StarCoder, code-trained Llama | The middle of a document, conditioned on both a prefix and a suffix, useful for code completion. |
After pretraining, an NLP model is adapted to a target task through one or more of the following methods.
| Benchmark | Year | Scope |
|---|---|---|
| GLUE | 2018 | Nine English natural language understanding tasks, including sentiment, paraphrase, entailment, and similarity. |
| SuperGLUE | 2019 | Eight harder English tasks targeting reasoning, coreference, and reading comprehension. |
| BIG-Bench | 2022 | More than 200 collaboratively contributed tasks probing diverse capabilities of LLMs. |
| MMLU | 2020 | 57 subject areas of multiple-choice questions covering high school through professional-level knowledge. |
| HELM | 2022 | Holistic evaluation across accuracy, reliability, fairness, calibration, and efficiency on many scenarios. |
| Open LLM Leaderboard | 2023 | Aggregated Hugging Face leaderboard combining several reasoning, knowledge, and truthfulness benchmarks. |
| LMSYS Chatbot Arena | 2023 | Crowd-sourced pairwise human preference judgments turned into Elo ratings for chat models. |
| MTEB | 2022 | Massive Text Embedding Benchmark covering retrieval, classification, clustering, and reranking. |
| XTREME | 2020 | Cross-lingual transfer across 40 languages and nine tasks. |
| HumanEval | 2021 | 164 hand-written Python coding problems with unit tests, the standard small-scale code benchmark. |
| GSM8K | 2021 | 8,500 grade-school math word problems used to study chain-of-thought reasoning. |
No single benchmark is sufficient. Modern model releases report a dashboard across reasoning, code, math, multilingual, factuality, and chat-style preference scores.
Hugging Face Transformers hosts hundreds of thousands of model checkpoints and standardizes loading and tokenization across architectures; see Transformers library. spaCy provides production pipelines for tokenization, tagging, parsing, and entity recognition. NLTK is a teaching workhorse with classic algorithms and corpora. Stanford CoreNLP offers Java-based linguistic tools. Fairseq and OpenNMT focus on sequence-to-sequence training, especially machine translation. vLLM and TensorRT-LLM accelerate decoder-only inference, FlashAttention reduces attention memory and latency, and LangChain and LlamaIndex glue LLMs to tools and retrieval.
NLP models power web search, machine translation, chat assistants, automated content generation, sentiment analysis, biomedical literature mining, legal document review, financial monitoring, code completion (Copilot, Cursor), educational tutoring, and accessibility tools such as captioning. They are also the language backbone of multimodal systems that combine vision, audio, and text.
Context windows have grown by orders of magnitude: Claude reached 200,000 tokens in 2024 and 1 million tokens for select tiers in 2025, and Gemini 1.5 Pro reports a 1 million token window with 2 million in research previews. Mixture-of-experts architectures such as Mixtral 8x7B and DeepSeek-V3 decouple total parameter count from per-token compute. Multimodal training is standard: GPT-4o, Gemini, and Claude process images and audio alongside text. Reasoning models such as OpenAI o1 and o3, DeepSeek-R1, and Claude with extended thinking spend additional inference compute to generate chains of thought. At the small end, ModernBERT (Warner et al. 2024) refreshes the encoder family with 8,192-token context, while Llama 3.2 1B and 3B and Microsoft Phi-4 push capable open weights to laptop-scale footprints.
Hallucination, in which a fluent model confidently asserts false facts, is the most cited reliability issue. Social bias and fairness gaps appear in classification outputs, generated text, and evaluation data. Factual freshness is bounded by the pretraining cutoff unless the model is paired with retrieval or web tools. Reasoning is still uneven, with strong arithmetic and code performance coexisting with brittle planning. Low-resource languages are underrepresented in pretraining corpora, leaving quality gaps even for multilingual models. Energy and compute costs grow with parameter counts, raising sustainability concerns. Benchmark contamination, the leakage of test sets into training data, threatens headline scores. Alignment, the problem of ensuring that powerful generative models follow human intent without harmful side effects, is the focus of intense ongoing research.
Natural Language Processing, Large Language Model, Foundation Model.