Text Generation Models
Last reviewed
May 11, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 2,416 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
30 citations
Review status
Source-backed
Revision
v2 ยท 2,416 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text generation models are language models trained to produce coherent natural-language text, typically by predicting tokens one at a time conditioned on preceding context. They form the dominant class of modern generative AI systems and power applications such as conversational assistants, code editors, search summaries, and autonomous agents. The category is dominated by decoder-only autoregressive transformer models, with smaller contributions from encoder-decoder architectures covered separately on the text2text generation models page.
This page surveys the field: its history, architectural variants, training pipelines, leading models, evaluation benchmarks, applications, and limitations. Individual model families are documented on their own articles, linked throughout.
Text generation has progressed from statistical n-gram models to deep neural language models within roughly two decades.
Statistical era (pre-2010). Early generators relied on n-gram counts that estimated the probability of the next word from short windows of preceding words. These models powered speech recognition decoders and statistical machine translation systems but produced ungrammatical text beyond a few words.
Recurrent era (2010 to 2017). Tomas Mikolov and colleagues introduced recurrent neural network language models in 2010, replacing fixed-context counts with learned hidden states. LSTM variants from Sepp Hochreiter and Jurgen Schmidhuber, originally proposed in 1997, were widely adopted for language modeling once GPU training matured. In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio added soft attention to sequence-to-sequence translation (arXiv:1409.0473), allowing the decoder to align with arbitrary positions in the source. Their work seeded the attention research line that produced the transformer.
Transformer revolution (2017 to 2020). In June 2017, Ashish Vaswani and seven Google co-authors published "Attention Is All You Need" (arXiv:1706.03762), proposing the transformer architecture built entirely on self-attention. OpenAI applied a decoder-only transformer to language modeling in GPT-1 (June 2018, 117 million parameters) and scaled it to GPT-2 in February 2019 (largest variant 1.5 billion parameters). GPT-3, released in May 2020 with 175 billion parameters, demonstrated in-context few-shot learning across translation, question answering, and arithmetic tasks (Brown et al., arXiv:2005.14165). Google released T5 in October 2019, framing every NLP task as text-to-text and reaching up to 11 billion parameters.
Alignment and chat era (2022 to 2023). Long Ouyang and colleagues at OpenAI published InstructGPT in March 2022 (arXiv:2203.02155), introducing reinforcement learning from human feedback (RLHF) for instruction following. Anthropic followed with Constitutional AI in December 2022 (Bai et al., arXiv:2212.08073), training models against written principles using AI feedback. OpenAI launched ChatGPT on November 30, 2022, reaching one million users in five days. GPT-4, released March 14, 2023, accepted images alongside text and improved markedly on professional exams.
Open-source wave (2023 to 2024). Meta released Llama on February 24, 2023, with weights available to researchers under a non-commercial license. Llama 2 (July 18, 2023) shifted to a more permissive license and added chat-tuned variants. Mistral 7B (October 10, 2023) and Mixtral 8x7B (December 11, 2023) from Mistral AI demonstrated that 7 billion parameter dense models and 47 billion parameter mixture-of-experts models could match much larger predecessors. Meta released Llama 3 on April 18, 2024 (8B and 70B), followed by Llama 3.1 with a 405 billion parameter flagship on July 23, 2024.
2024 to 2025 state. Anthropic released the Claude 3 family (Haiku, Sonnet, Opus) on March 4, 2024 with a 200,000 token context window. Google announced Gemini 1.0 in December 2023 and Gemini 1.5 Pro in February 2024 with a one million token context window. OpenAI released GPT-4o on May 13, 2024, with native audio and image processing. DeepSeek released DeepSeek-V3 on December 26, 2024, a 671 billion parameter mixture-of-experts model with 37 billion active parameters per token (arXiv:2412.19437). Alibaba released Qwen 2.5 in late 2024, trained on 18 trillion tokens.
Decoder-only causal language models are the dominant paradigm. A stack of transformer blocks processes tokens left to right, with each block computing masked self-attention so that position t only attends to positions 1 through t. The final hidden state is projected to a vocabulary distribution and trained with next-token cross-entropy loss. GPT, Llama, Mistral, Claude, and Qwen are all decoder-only.
Encoder-decoder generators split the work: an encoder reads the input bidirectionally and a decoder generates output autoregressively while attending to encoder states. T5 and BART (Lewis et al., arXiv:1910.13461) follow this design. Encoder-decoder models remain common in machine translation and summarization but have been overtaken by decoder-only models for general text generation.
Several architectural refinements are now standard in production models.
A typical pipeline has three stages.
Pretraining. The model learns next-token prediction over trillions of tokens scraped and filtered from the web, books, code repositories, and curated text. The Chinchilla scaling study (Hoffmann et al., arXiv:2203.15556) found that for fixed compute, model size and training tokens should scale roughly in proportion, popularizing the view that earlier large models were undertrained. Recent flagship models train on 14 to 18 trillion tokens.
Supervised fine-tuning (SFT). The pretrained base is adapted on curated instruction-response pairs, often crowdsourced or synthesized. Early examples include FLAN from Google and Alpaca from Stanford. Modern instruction sets reach over one million examples.
Preference optimization. A reward signal aligns model outputs with human preferences.
| Model | Year | Organization | Parameters | Key innovation |
|---|---|---|---|---|
| GPT-2 | 2019 | OpenAI | 1.5B (largest) | Zero-shot transfer at scale |
| T5 | 2019 | up to 11B | Text-to-text unification | |
| GPT-3 | 2020 | OpenAI | 175B | In-context few-shot learning |
| GPT-Neo / GPT-J | 2021 | EleutherAI | 6B (J) | Early open replication |
| PaLM | 2022 | 540B | Pathways training, dense scale | |
| Chinchilla | 2022 | DeepMind | 70B | Compute-optimal scaling |
| OPT 175B | 2022 | Meta | 175B | Open replication of GPT-3 |
| BLOOM | 2022 | BigScience | 176B | Open multilingual training (46 languages) |
| InstructGPT | 2022 | OpenAI | 175B | RLHF for instruction following |
| ChatGPT | 2022 | OpenAI | undisclosed | Mass-market chat product |
| GPT-4 | 2023 | OpenAI | undisclosed | Multimodal, professional-exam level |
| Claude | 2023 | Anthropic | undisclosed | Constitutional AI training |
| Llama | 2023 | Meta | 7B, 13B, 33B, 65B | Open-weight foundation models |
| Llama 2 | 2023 | Meta | 7B, 13B, 70B | Commercial-friendly license, chat tuning |
| Falcon | 2023 | TII | 7B, 40B, 180B | Open large model on RefinedWeb |
| Mistral 7B | 2023 | Mistral AI | 7B | Sliding window attention, GQA |
| Mixtral 8x7B | 2023 | Mistral AI | 47B total, 13B active | Sparse mixture of experts |
| Gemini | 2023 | Google DeepMind | undisclosed | Natively multimodal |
| Claude 3 | 2024 | Anthropic | undisclosed | 200K context, Haiku/Sonnet/Opus tiers |
| Gemini 1.5 Pro | 2024 | Google DeepMind | undisclosed | 1M token context window |
| GPT-4o | 2024 | OpenAI | undisclosed | Native voice and vision |
| Llama 3 | 2024 | Meta | 8B, 70B, 405B | Open frontier-scale model |
| Qwen 2.5 | 2024 | Alibaba | 0.5B to 72B | 18T-token pretraining |
| DeepSeek-V3 | 2024 | DeepSeek | 671B total, 37B active | MoE with multi-token prediction |
Parameter counts are taken from official papers and model cards. "Undisclosed" indicates the developer has not published an official figure.
Text generation models are evaluated on a mix of academic, code, math, and human-preference benchmarks.
| Benchmark | Year | What it measures | Notes |
|---|---|---|---|
| MMLU | 2020 | 57 subjects, multiple choice | Hendrycks et al., arXiv:2009.03300 |
| BIG-Bench | 2022 | 200+ diverse tasks | Crowd-sourced by 450+ authors |
| HellaSwag | 2019 | Commonsense sentence completion | Zellers et al., arXiv:1905.07830 |
| ARC | 2018 | Grade-school science questions | AI2 Reasoning Challenge |
| TruthfulQA | 2021 | Resistance to common falsehoods | Lin et al., arXiv:2109.07958 |
| HumanEval | 2021 | Python function synthesis | Chen et al., arXiv:2107.03374 |
| GSM8K | 2021 | Grade-school math word problems | Cobbe et al., arXiv:2110.14168 |
| MT-Bench | 2023 | Multi-turn open-ended chat | LLM-as-judge with GPT-4 |
| LMSYS Chatbot Arena | 2023 | Crowd-sourced pairwise voting | Live Elo leaderboard |
| HELM | 2022 | Holistic, multi-metric evaluation | Stanford CRFM |
Researchers track contamination and saturation closely; several benchmarks (HellaSwag, ARC) are now near ceiling for frontier models, which has driven the introduction of harder successors such as MMLU-Pro and GPQA.
Text generation models underpin a broad range of products and workflows.
Despite rapid progress, text generation models share several well-documented weaknesses.