# Text Generation Models

> Source: https://aiwiki.ai/wiki/text_generation_models
> Updated: 2026-06-28
> Categories: AI Models, Natural Language Processing
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Text generation models** are [language models](/wiki/language_model) trained to produce coherent natural-language text by predicting tokens one at a time, each conditioned on the preceding context. The dominant design is the decoder-only autoregressive [transformer](/wiki/transformer), introduced by Vaswani et al. in 2017,[^attention] and scaled through families such as OpenAI's GPT series, Meta's Llama, Anthropic's Claude, and Google's Gemini. They are the most widely deployed class of modern [generative AI](/wiki/generative_ai) systems and power conversational assistants, code editors, search summaries, and autonomous agents.

This article is a **survey and catalog** of notable text-generation models across eras, from statistical n-gram systems to the current frontier. It complements the [large language model](/wiki/large_language_model) article, which treats the underlying concepts (architecture, tokenization, scaling laws, training, and inference) in depth. For the mechanics of how these systems work, see that article and [GPT](/wiki/gpt); this page focuses on the models themselves and how the landscape has evolved.

## What are text generation models?

A text generation model is a probabilistic model of language that, given a prefix of text, outputs a distribution over the next token and samples from it repeatedly to build a continuation. Almost all current systems are decoder-only causal transformers trained with a next-token prediction objective on trillions of tokens of text. They differ from encoder-only models (such as [BERT](/wiki/bert), used for classification and embedding) and from encoder-decoder models (covered on the [text2text generation models](/wiki/text2text_generation_models) page) in that generation is open-ended and proceeds autoregressively. The category includes both proprietary frontier systems (GPT-5, Claude Opus 4.x, Gemini 3.x, Grok 4.x) and open-weight families (Llama, DeepSeek, Qwen, Mistral, Kimi).

## How have text generation models evolved?

Text generation has progressed from statistical n-gram models to deep neural language models within roughly two decades.

### Statistical era (pre-2010)

Early generators relied on [n-gram](/wiki/n-gram) counts that estimated the probability of the next word from short windows of preceding words. These models powered speech recognition decoders and statistical machine translation systems but produced ungrammatical text beyond a few words. Smoothing techniques such as Kneser-Ney addressed the sparsity of unseen word sequences but could not capture long-range dependencies.

### Neural and recurrent era (2010 to 2017)

Tomas Mikolov and colleagues introduced [recurrent neural network](/wiki/recurrent_neural_network) language models in 2010, replacing fixed-context counts with learned hidden states. [LSTM](/wiki/lstm) variants from Sepp Hochreiter and Jurgen Schmidhuber, originally proposed in 1997, were widely adopted for language modeling once GPU training matured. In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio added soft attention to sequence-to-sequence translation,[^bahdanau] allowing the decoder to align with arbitrary positions in the source. Their work seeded the attention research line that produced the transformer.

### Transformer era (2017 to 2020)

In June 2017, Ashish Vaswani and seven Google co-authors published "[Attention Is All You Need](/wiki/attention_is_all_you_need),"[^attention] proposing the transformer architecture built entirely on [self-attention](/wiki/self_attention). The paper described "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely,"[^attention] a design that proved more parallelizable and far cheaper to train than recurrent predecessors. OpenAI applied a decoder-only transformer to language modeling in [GPT-1](/wiki/gpt-1) (June 2018, 117 million parameters) and scaled it to [GPT-2](/wiki/gpt-2) in February 2019 (largest variant 1.5 billion parameters). [GPT-3](/wiki/gpt-3), released in May 2020, was described by its authors as "an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model,"[^gpt3] and demonstrated in-context few-shot learning across translation, question answering, and arithmetic tasks. Google released [T5](/wiki/t5) in October 2019, framing every NLP task as text-to-text and reaching up to 11 billion parameters.

### Instruction and chat era (2022 to 2023)

Long Ouyang and colleagues at OpenAI published [InstructGPT](/wiki/instructgpt) in March 2022,[^instructgpt] introducing reinforcement learning from human feedback ([RLHF](/wiki/rlhf)) for instruction following; they reported that a 1.3 billion parameter InstructGPT was preferred to the 175 billion parameter GPT-3 base. Anthropic followed with [Constitutional AI](/wiki/constitutional_ai) in December 2022,[^cai] training models against written principles using AI feedback. OpenAI launched [ChatGPT](/wiki/chatgpt) on November 30, 2022; it reached an estimated 100 million monthly users within two months, the fastest-growing consumer application at the time. [GPT-4](/wiki/gpt-4), released March 14, 2023, accepted images alongside text and improved markedly on professional exams.[^gpt4]

### Open-weight wave (2023 to 2024)

Meta released [LLaMA](/wiki/llama) on February 24, 2023, with weights available to researchers under a non-commercial license.[^llama] [Llama 2](/wiki/llama_2) (July 18, 2023) shifted to a more permissive license and added chat-tuned variants. [Mistral 7B](/wiki/mistral) (October 2023) and [Mixtral 8x7B](/wiki/mixtral) (December 2023) from Mistral AI demonstrated that compact dense models and sparse mixture-of-experts models could match much larger predecessors. Meta released [Llama 3](/wiki/llama_3) on April 18, 2024 (8B and 70B), followed by Llama 3.1 with a 405 billion parameter flagship on July 23, 2024. Anthropic released the Claude 3 family ([Claude](/wiki/claude)) on March 4, 2024 with a 200,000 token context window; Google announced [Gemini](/wiki/gemini) 1.0 in December 2023 and Gemini 1.5 Pro in February 2024 with a one million token context window. DeepSeek released DeepSeek-V3 on December 26, 2024, a 671 billion parameter mixture-of-experts model with 37 billion active parameters per token.[^dsv3]

### Reasoning and frontier era (2024 to 2026)

In September 2024, OpenAI released [o1](/wiki/o1), the first widely deployed "reasoning model" trained to perform extended chain-of-thought before answering, followed by [o3](/wiki/o3). DeepSeek released the open-weight reasoning model [DeepSeek-R1](/wiki/deepseek_r1) in January 2025. OpenAI shipped [GPT-5](/wiki/gpt-5) on August 7, 2025 as a unified system with a real-time router that selects between a fast model and a deeper [reasoning](/wiki/reasoning_models) model.[^gpt5] Anthropic, Google, and xAI released successive frontier models through 2025 and 2026, while open-weight labs (Meta, DeepSeek, Alibaba, Mistral, Moonshot AI) narrowed the gap. These developments are cataloged below.

## What architectures do text generation models use?

Depth on architecture lives in [large language model](/wiki/large_language_model); this section summarizes the variants relevant to a model catalog.

Decoder-only causal [language models](/wiki/language_model) are the dominant paradigm. A stack of transformer blocks processes tokens left to right, with each block computing masked self-attention so that position t only attends to positions 1 through t. The final hidden state is projected to a vocabulary distribution and trained with next-token cross-entropy loss. GPT, Claude, Gemini, Grok, Llama, Mistral, Qwen, and DeepSeek are all decoder-only.

Encoder-decoder generators split the work: an encoder reads the input bidirectionally and a decoder generates output autoregressively while attending to encoder states. [T5](/wiki/t5) and [BART](/wiki/bart) follow this design. Encoder-decoder models remain common in machine translation and summarization but have been overtaken by decoder-only models for general text generation.

Several refinements are now standard in production models:

* **Rotary position embedding ([RoPE](/wiki/rope)).** Jianlin Su and colleagues encode position by rotating query and key vectors, enabling extrapolation to longer contexts. Llama, Mistral, Qwen, and DeepSeek all use RoPE.
* **Grouped-query attention (GQA).** Sharing key and value heads across multiple query heads cuts memory bandwidth at inference. Llama 2 70B introduced GQA at scale, and Llama 3, Mistral, and Qwen adopted it.
* **[Mixture of experts](/wiki/mixture_of_experts) (MoE).** A router selects a small subset of expert feedforward blocks per token, scaling parameter count without proportional compute. Mixtral activates 13 billion of 47 billion parameters per token; DeepSeek-V3 activates 37 billion of 671 billion; Llama 4, Mistral Large 3, Qwen3, and Kimi K2 are also MoE models.
* **Sparse and long-context attention.** Mistral 7B used sliding-window attention; DeepSeek-V3.2 introduced DeepSeek Sparse Attention (DSA), reducing long-context cost to near-linear.[^dsv32] Production context windows have grown from 2,048 tokens (GPT-3) to 200,000 (Claude), 400,000 (GPT-5.2), and 1,000,000 or more (Gemini, Grok 4.3, DeepSeek-V4).
* **Multi-head latent attention (MLA).** Introduced in DeepSeek-V2 and carried through V3, MLA compresses the key-value cache into a low-rank latent representation, substantially reducing memory at inference without degrading quality.
* **Multi-token prediction (MTP).** Rather than predicting only the next token, the model simultaneously predicts several future tokens during training. DeepSeek-V3 applied MTP as an auxiliary loss; Gemma 4 released separate drafter models for speculative decoding based on the same principle.[^dsv3]

## What are the main decoding strategies?

At inference time, a text generation model holds a probability distribution over the vocabulary at each step. The **decoding strategy** determines which token is actually selected, making it a critical factor in quality, diversity, and speed.

### Greedy decoding

Greedy decoding selects the single highest-probability token at every step. It is deterministic, fast, and coherent for short completions, but tends to produce repetitive, locally optimal text that degrades over longer generations.

### Beam search

[Beam search](/wiki/beam_search) maintains a fixed number (the beam width) of the highest-scoring partial sequences and expands each in parallel. It was the standard decoding method in neural machine translation and encoder-decoder models. Holtzman and colleagues (2020) showed that maximization objectives such as beam search cause "degeneration" in open-ended generation: outputs become repetitive and incoherent precisely because the model assigns high probability to flat, generic continuations.[^holtzman] As that paper put it, "using likelihood as a decoding objective leads to text that is bland and strangely repetitive."[^holtzman]

### Temperature scaling

Temperature scaling divides each logit by a scalar T before applying the softmax. High temperatures (T greater than 1) flatten the distribution, increasing diversity and sometimes creativity at the cost of coherence. Low temperatures (T approaching 0) sharpen it, approaching greedy behavior. Temperature is a knob available in nearly every production API.

### Top-k sampling

Top-k sampling restricts sampling to the k most probable tokens at each step, redistributing the remaining probability mass among the top k. This avoids sampling from the long tail of unlikely tokens but uses a fixed-size window even when the distribution is either flat or sharply peaked.

### Top-p (nucleus) sampling

[Top-p sampling](/wiki/top_p_sampling), introduced in the same 2020 paper by Holtzman et al.,[^holtzman] dynamically selects the smallest set of tokens whose cumulative probability mass exceeds a threshold p. Because the set expands when the distribution is flat and contracts when it is peaked, nucleus sampling adapts more naturally to the model's uncertainty. It is the most widely adopted open-ended generation strategy and is the default in many frameworks.

### Contrastive search

Contrastive search generates text by contrasting the prediction of a strong target model against a weaker reference, penalizing tokens that are too similar to recently generated context. Adaptive Contrastive Search (2024) extends this by modulating the degeneration penalty according to the model's estimated per-step uncertainty, reducing repetition without sacrificing coherence.[^acs]

### Speculative decoding

Speculative decoding accelerates inference without changing output distribution. A smaller, fast "draft" model generates several candidate tokens; the larger target model then verifies them in a single forward pass, accepting tokens that match its distribution and resampling from the divergence point. Because LLM inference is often memory-bandwidth-bound rather than compute-bound, batched verification achieves near-linear speedups proportional to the number of accepted draft tokens. Speculative decoding is widely deployed in production systems.

### Min-p and Mirostat

Min-p sampling (2024) sets a dynamic per-step floor at a fraction of the top-token probability, cutting tokens below that floor. Mirostat is a feedback-based sampler that adjusts temperature dynamically to maintain a target perplexity level, preventing both degeneracy and incoherence over long generations.

## How are text generation models trained?

A typical pipeline has three stages, covered in detail under [large language model](/wiki/large_language_model).

### Pretraining

The model learns next-token prediction over trillions of tokens scraped and filtered from the web, books, code, and curated text. The [Chinchilla](/wiki/chinchilla) scaling study (Hoffmann et al., 2022) found that for fixed compute, model size and training tokens should scale roughly in proportion, putting the compute-optimal tokens-per-parameter ratio near 20.[^chinchilla] The authors stated that "for compute-optimal training, the model size and the number of training tokens should be scaled equally: for every doubling of model size the number of training tokens should also be doubled."[^chinchilla] Subsequent practice has pushed well beyond Chinchilla-optimal for inference efficiency: Llama 3 trained its 8B model on 15 trillion tokens (about 1,875 tokens per parameter), and Qwen3-0.6B trained on 36 trillion tokens. The key insight is that training beyond compute-optimal improves the deployed model's quality-per-inference-flop even at the cost of extra training compute.

Data quality matters as much as scale. Filtering pipelines remove boilerplate, near-duplicates, and low-quality web text; domain-weighted sampling up-mixes code, math, and scientific text. Recent flagship models report pretraining on tens of trillions of tokens across dozens of languages; Qwen3 covered approximately 119 languages.[^qwen3]

### Supervised fine-tuning (SFT)

The base model is adapted on curated instruction-response pairs, often crowdsourced or synthesized. Early examples include FLAN from Google and Alpaca from Stanford. SFT data has evolved from simple prompt-completion pairs to complex multi-turn conversations covering tool use, reasoning, and multimodal tasks. Synthetic data, generated by a strong teacher model and filtered for quality, has become a dominant source because it scales more cheaply than human annotation.[^instructsurvey]

### Preference optimization

A reward signal aligns outputs with human preferences. [RLHF](/wiki/rlhf) fits a reward model to ranked response pairs and optimizes the policy with PPO. Constitutional AI substitutes a written constitution and uses AI feedback (RLAIF). [Direct preference optimization (DPO)](/wiki/dpo) replaces the reward model and reinforcement-learning step with a single classification loss, making training more stable and reproducible. More recent variants include ORPO, GRPO (used by Qwen3 and DeepSeek-R1), and IPO.

Reasoning models such as o1, DeepSeek-R1, and the "thinking" variants of recent flagships add large-scale reinforcement learning on verifiable tasks (math, code) to elicit long chains of thought. DeepSeek-R1-Zero showed that RL on a base model alone, with no supervised reasoning data, can spontaneously produce self-correction and backtracking behavior.[^r1]

### Post-training data scale and quality

A persistent finding across the 2023 to 2025 literature is that a smaller, higher-quality SFT dataset often outperforms a larger noisy one. The LIMA paper (2023) argued that 1,000 carefully curated examples sufficed to produce competitive instruction-following behavior. Later scaling experiments have qualified this: quality-at-scale wins over quality-at-small-scale once models are large enough and the task distribution is wide enough, but the principle that annotation quality gates fine-tuning quality remains robust.

## What are the most notable text generation models?

Parameter counts are taken from official papers and model cards. "Undisclosed" indicates the developer has not published an official figure. For MoE models, totals are listed with active parameters per token where reported.

### Foundational and scaling-era models

| Model | Year | Organization | Parameters | Significance |
|---|---|---|---|---|
| [GPT-2](/wiki/gpt-2) | 2019 | OpenAI | 1.5B (largest) | Zero-shot transfer at scale |
| [T5](/wiki/t5) | 2019 | Google | up to 11B | Text-to-text unification |
| [GPT-3](/wiki/gpt-3) | 2020 | OpenAI | 175B | In-context few-shot learning |
| GPT-Neo / GPT-J | 2021 | EleutherAI | 6B (J) | Early open replication |
| [PaLM](/wiki/palm) | 2022 | Google | 540B | Pathways training, dense scale |
| [Chinchilla](/wiki/chinchilla) | 2022 | DeepMind | 70B | Compute-optimal scaling |
| OPT 175B | 2022 | Meta | 175B | Open replication of GPT-3 |
| [BLOOM](/wiki/bloom) | 2022 | BigScience | 176B | Open multilingual training (46 languages) |
| [InstructGPT](/wiki/instructgpt) | 2022 | OpenAI | 175B | RLHF for instruction following |
| [ChatGPT](/wiki/chatgpt) | 2022 | OpenAI | undisclosed | Mass-market chat product |
| [GPT-4](/wiki/gpt-4) | 2023 | OpenAI | undisclosed | Multimodal, professional-exam level |

### Open-weight foundation wave (2023 to 2024)

| Model | Year | Organization | Parameters | Significance |
|---|---|---|---|---|
| [LLaMA](/wiki/llama) | 2023 | Meta | 7B, 13B, 33B, 65B | Open-weight foundation models |
| [Llama 2](/wiki/llama_2) | 2023 | Meta | 7B, 13B, 70B | Commercial-friendly license, chat tuning |
| [Falcon](/wiki/falcon) | 2023 | TII | 7B, 40B, 180B | Open large model on RefinedWeb |
| [Mistral 7B](/wiki/mistral) | 2023 | Mistral AI | 7B | Sliding-window attention, GQA |
| [Mixtral 8x7B](/wiki/mixtral) | 2023 | Mistral AI | 47B total, 13B active | Sparse mixture of experts |
| [Llama 3 / 3.1](/wiki/llama_3) | 2024 | Meta | 8B, 70B, 405B | Open frontier-scale model |
| [Qwen 2.5](/wiki/qwen) | 2024 | Alibaba | 0.5B to 72B | 18T-token pretraining |
| DeepSeek-V3 | 2024 | DeepSeek | 671B total, 37B active | MoE with multi-token prediction |

### Current proprietary frontier models (2025 to 2026)

As of May 2026, the leading closed-weight families are OpenAI's GPT-5 series, Anthropic's Claude Opus 4.x, Google's Gemini 3.x, and xAI's Grok 4.x. All are multimodal to varying degrees and ship "thinking" or reasoning modes.

| Model | Released | Organization | Context window | Notes |
|---|---|---|---|---|
| [GPT-5](/wiki/gpt-5) | Aug 7, 2025 | OpenAI | 400K | Unified system with real-time router[^gpt5] |
| [GPT-5.1](/wiki/gpt-5.1) | Nov 2025 | OpenAI | 400K | Instant and Thinking modes[^gpt51] |
| [GPT-5.2](/wiki/gpt-5.2) | Dec 11, 2025 | OpenAI | 400K (128K output) | Knowledge cutoff Aug 31, 2025[^gpt52] |
| [Claude Opus 4](/wiki/claude_opus_4) | May 2025 | Anthropic | 200K | Agentic, hybrid reasoning[^claude4] |
| Claude Opus 4.5 | Nov 24, 2025 | Anthropic | 200K (64K output) | First model above 80% on SWE-bench Verified (80.9%); "effort" parameter[^opus45] |
| Claude Opus 4.8 | May 28, 2026 | Anthropic | 200K | Latest flagship; gains in coding and honesty[^opus48] |
| [Gemini 3 Pro](/wiki/gemini_3) | Nov 18, 2025 | Google DeepMind | 1M | Topped LMArena at 1501 Elo at launch[^gemini3] |
| Gemini 3.1 Pro | Feb 19, 2026 | Google DeepMind | 1M (64K output) | 94.3% GPQA Diamond, 80.6% SWE-bench Verified[^gemini31] |
| [Grok 4](/wiki/grok) | Jul 9, 2025 | xAI | 256K | Native tool use; trained on Colossus cluster[^grok4] |
| Grok 4.1 | Nov 2025 | xAI | 256K | Led LMArena Text Arena (1483 Elo, Thinking)[^grok41] |
| Grok 4.3 | Apr 30, 2026 | xAI | 1M | Adds native video input[^grok43] |

### Current open-weight models (2025 to 2026)

Open-weight (downloadable) models from Meta, DeepSeek, Alibaba, Mistral AI, and Moonshot AI have approached frontier quality, most released under permissive licenses (Apache 2.0 or MIT).

| Model | Released | Organization | Parameters | License | Notes |
|---|---|---|---|---|---|
| [Llama 4](/wiki/llama_4) Scout / Maverick | Apr 5, 2025 | Meta | 109B/17B active; 400B/17B active | Llama 4 Community | First Meta MoE; Scout has 10M-token context[^llama4] |
| [Qwen3](/wiki/qwen) | Apr 29, 2025 | Alibaba | 235B total, 22B active (plus dense 0.6B to 32B) | Apache 2.0 | Pretrained on approximately 36T tokens, 119 languages[^qwen3] |
| [Kimi K2](/wiki/kimi) | Jul 2025 | Moonshot AI | 1T total, 32B active | Modified MIT | Trillion-parameter open MoE for agentic and coding tasks[^kimi] |
| [DeepSeek-V3.2](/wiki/deepseek) | Dec 1, 2025 | DeepSeek | 671B total, 37B active | MIT | DeepSeek Sparse Attention; 160K context[^dsv32] |
| Mistral Large 3 | Dec 2, 2025 | Mistral AI | 675B total, 41B active | Apache 2.0 | 256K context; trained on roughly 3,000 H200 GPUs[^mistral3] |
| DeepSeek-V4 (preview) | Apr 24, 2026 | DeepSeek | V4-Pro 1.6T; V4-Flash 284B | MIT | 1M-token context; Compressed Sparse Attention[^dsv4] |

### Small and on-device models

Not all deployment contexts require or can afford frontier-scale models. A distinct tier of small language models targets on-device, edge, and low-latency use cases.

| Model family | Organization | Size range | Notable features |
|---|---|---|---|
| [Gemma](/wiki/gemma) 3 / 4 | Google DeepMind | 1B to 27B | On-device optimized; Gemma 4 E2B/E4B use per-layer embeddings for efficiency |
| [Phi](/wiki/phi) 4 | Microsoft | 3.8B (text), 5.6B (multimodal) | Strong math and reasoning at small scale; trained primarily on synthetic data |
| Qwen3 dense | Alibaba | 0.6B to 32B | Apache 2.0; strong multilingual performance down to 600M parameters |
| Llama 3.2 | Meta | 1B, 3B | Official small-scale Llama; runs on mobile devices |
| SmolLM | Hugging Face | 135M to 1.7B | Ultra-compact models for browser and microcontroller deployment |

Sub-2B models suit IoT and mobile inference; 3B to 5B models run on consumer laptops; 9B and above require server hardware for low latency. Speculative decoding using small draft models accelerates the larger models in the same family, an approach Gemma 4 ships with dedicated Multi-Token Prediction drafters.

## How are text generation models evaluated?

Text generation models are evaluated on a mix of knowledge, code, math, agentic, and human-preference benchmarks. As frontier models saturate older tests, harder successors and live arenas have taken over.

| Benchmark | Year | What it measures | Status (2026) |
|---|---|---|---|
| [MMLU](/wiki/mmlu) | 2020 | 57 subjects, multiple choice | Largely saturated; MMLU-Pro is the harder successor |
| [HellaSwag](/wiki/hellaswag) | 2019 | Commonsense sentence completion | Near ceiling for frontier models |
| [TruthfulQA](/wiki/truthfulqa) | 2021 | Resistance to common falsehoods | Probes imitative falsehoods |
| [HumanEval](/wiki/humaneval) | 2021 | Python function synthesis | Mostly saturated; GPT-5.3 Codex at 93% |
| [GSM8K](/wiki/gsm8k) | 2021 | Grade-school math word problems | At 99% for frontier models; no longer differentiating |
| [HELM](/wiki/helm) | 2022 | Holistic, multi-metric evaluation | Stanford CRFM; still used for broad comparisons |
| [BIG-Bench](/wiki/big_bench) | 2022 | 200+ diverse tasks | Crowd-sourced; BIG-Bench Hard (BBH) remains active |
| [MT-Bench](/wiki/mt_bench) | 2023 | Multi-turn open-ended chat | LLM-as-judge with GPT-4 |
| LMArena (Chatbot Arena) | 2023 | Crowd-sourced pairwise voting | Live Elo leaderboard; most widely cited human-preference ranking |
| GPQA Diamond | 2023 | Graduate-level science questions | Frontier models now exceed 90%; Gemini 3.1 Pro at 94.3% |
| SWE-bench Verified | 2024 | Real-world software-engineering fixes | Leading agentic-coding metric; top models above 80% |
| ARC-AGI-2 | 2025 | Abstract visual reasoning; near-zero base rate | Designed to resist memorization; Gemini 3.1 Pro reported 77.1% |
| Humanity's Last Exam | 2025 | Expert-level questions across many fields | Grok 4 Heavy exceeded 50% in multi-agent mode |
| FrontierMath | 2024 | Unpublished research-level math problems | GPT-5.5 at 51.7% on tiers 1 to 3 |

### Evaluation metrics

Beyond accuracy on fixed benchmarks, text generation quality is measured with several complementary metrics.

**Perplexity** is the exponentiated average negative log-likelihood per token on a held-out corpus. It measures how well the model predicts natural text and is used during pretraining to track progress and compare model checkpoints on the same data distribution.

**BLEU and ROUGE** are n-gram overlap metrics historically used for machine translation and summarization. They correlate weakly with human judgment on open-ended generation and are largely replaced by model-based evaluators for those tasks.

**Win rates and Elo.** Human or LLM-judged pairwise comparisons between responses yield Elo ratings on platforms such as LMArena. Elo captures holistic quality and is harder to game than fixed benchmarks, but requires large numbers of comparisons and is sensitive to the question mix.

**LLM-as-judge.** Using a strong model such as GPT-4 to rate or rank outputs has become common for evaluating instruction-following, factuality, and format adherence. MT-Bench pioneered this approach; it has since been adopted widely for rapid offline evaluation.

**Contamination and saturation.** A persistent concern is that public benchmark questions may appear in pretraining corpora, inflating scores. Dynamic evaluations (held-out private test sets, live arenas, and freshly generated question banks) partially address contamination. Multiple classical benchmarks are now at ceiling for frontier models, which has driven the creation of harder successors and live arenas as the primary differentiators.

## What are text generation models used for?

Text generation models underpin a broad range of products and workflows.

### Conversational assistants

[ChatGPT](/wiki/chatgpt), [Claude](/wiki/claude), Gemini, Grok, and Copilot expose general-purpose chat over the web and APIs. Conversational systems handle tasks from drafting emails and answering factual questions to multi-step research, creative writing, and roleplay. Context windows of 200K to 1M tokens let users share entire codebases, documents, or long conversation histories in a single prompt.

### Code generation and software engineering

GitHub Copilot, Cursor, and similar tools wrap models such as GPT-5, Claude, and specialized derivatives (Codex, Code Llama, Qwen-Coder, DeepSeek-Coder) to produce, edit, explain, and debug source code. SWE-bench Verified, measuring resolution of real open-source GitHub issues, has become the canonical benchmark for this use case. Claude Opus 4.5 reached 80.9% on SWE-bench Verified in November 2025, a milestone described as the first time a model could resolve a majority of realistic software tasks autonomously.[^opus45]

### Writing assistance

Drafting, editing, and rewriting features are integrated into Gmail, Notion, Microsoft 365, and Google Workspace. These pipelines usually embed the model inside a product that provides formatting context and constrained output requirements, often via system prompts and structured output modes.

### Summarization and translation

Long-document condensation, meeting recaps, legal brief summaries, and multilingual translation are mature applications. Models with million-token context windows can summarize entire books or transcripts without chunking. [Machine translation](/wiki/machine_translation) quality rivals dedicated translation systems for major language pairs.

### Retrieval-augmented generation (RAG)

[Retrieval-augmented generation](/wiki/retrieval_augmented_generation) augments a generation model with a retrieval system that fetches relevant passages from a knowledge base or the web before generation. This addresses the knowledge-cutoff limitation and grounds outputs in verifiable sources. RAG has become standard in enterprise search, customer support, and research assistants. Agentic RAG extends the idea to multi-step pipelines where the model autonomously formulates retrieval queries, refines them based on results, and integrates multiple sources.[^agentic_rag]

### Autonomous agents

Multi-step tool use, browser automation, code execution, and file management drive agentic products built on reasoning-capable flagships. Models like Kimi K2 and Claude Opus 4.5 are explicitly positioned for autonomous software engineering tasks that require sustained, goal-directed operation over tens or hundreds of steps. The SWE-bench and Terminal-Bench evaluations track this frontier.

### Scientific and professional tasks

Text generation models assist in literature review, hypothesis generation, grant writing, legal document drafting, medical note-taking, and clinical decision support. These applications require high factual reliability and often combine the model with [RAG](/wiki/retrieval_augmented_generation), citation grounding, or specialized fine-tuning on domain data.

## What are the limitations of text generation models?

Despite rapid progress, text generation models share several well-documented weaknesses, discussed at length under [large language model](/wiki/large_language_model).

### Hallucination

Models produce fluent but factually incorrect statements, particularly for rare entities, fresh events, or detailed numerical claims. A 2025 survey found hallucination to be one of the three most studied limitations in the LLM literature, alongside reasoning failures and generalization gaps.[^llimits] Reliability has improved across model generations through RLHF, RAG integration, and citation-grounded output, but the problem is not solved. A theoretical result (Xu et al., 2024) argues that hallucination is an unavoidable consequence of compressing a corpus into finite model weights.[^hallinevitable]

### Knowledge cutoff

Pretraining data has a fixed end date, so models lack information on later events unless augmented with retrieval or tools. Knowledge cutoffs range from months to over a year before a model's release. Web-search plugins, RAG, and real-time tool-use APIs address this for well-resourced deployments but not for offline or privacy-sensitive ones.

### Context limits and degradation

Even million-token windows degrade in needle-in-a-haystack accuracy at the extremes: models systematically miss information placed far from the beginning or end of very long contexts. Inference cost also grows with sequence length, making long-context use expensive at scale.

### Compute cost and energy

Training frontier models costs tens to hundreds of millions of dollars and consumes large amounts of accelerator time and energy. Inference at scale is also expensive: a single GPT-5-class response can consume hundreds of times the energy of a keyword search query. Speculative decoding, quantization, and mixture-of-experts activation sparsity partially mitigate inference cost, but energy consumption remains a concern.

### Alignment, safety, and sycophancy

Models can be jailbroken through adversarial prompting to bypass safety filters. [Sycophancy](/wiki/sycophancy) (telling users what they want to hear rather than the truth) is a persistent alignment failure mode: RLHF optimizes for human approval, and approval correlates with agreement. Models also exhibit prompt sensitivity: small changes in phrasing can flip answers or change quality markedly.

### Bias

Outputs reflect biases in training data across demographics, languages, and viewpoints. Under-represented languages and cultures receive lower quality outputs. Political and social topics exhibit systematic skews inherited from the web corpus.

### Training-data contamination

Public benchmarks may leak into pretraining corpora, inflating reported scores. This is a structural problem: once a benchmark is published, it becomes part of the web and is likely included in future training runs. Dynamic and private evaluations, such as LMArena and held-out test banks, partially mitigate this risk, but contamination is difficult to audit with certainty.

### Alignment-capability trade-off

A finding reported across multiple 2024 and 2025 studies is that aggressive preference optimization can degrade certain capabilities: heavily RLHF-aligned models sometimes become overly cautious, refuse benign requests, or lose nuanced reasoning ability.[^aligntax] Balancing helpfulness, harmlessness, and capability remains an active research and engineering challenge.

## See also

* [Large language model](/wiki/large_language_model)
* [Language model](/wiki/language_model)
* [GPT](/wiki/gpt)
* [Transformer](/wiki/transformer)
* [Generative AI](/wiki/generative_ai)
* [Reasoning models](/wiki/reasoning_models)
* [Natural language processing models](/wiki/natural_language_processing_models)
* [Text2text generation models](/wiki/text2text_generation_models)
* [Retrieval-augmented generation](/wiki/retrieval_augmented_generation)
* [RLHF](/wiki/rlhf)
* [Instruction tuning](/wiki/instruction_tuning)

## References

[^bahdanau]: Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. https://arxiv.org/abs/1409.0473 Accessed 2026-05-31.
[^attention]: Vaswani, A. et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762 Accessed 2026-06-28.
[^gpt3]: Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). https://arxiv.org/abs/2005.14165 Accessed 2026-06-28.
[^instructgpt]: Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). https://arxiv.org/abs/2203.02155 Accessed 2026-05-31.
[^cai]: Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073 Accessed 2026-05-31.
[^gpt4]: OpenAI (2023). GPT-4 Technical Report. https://arxiv.org/abs/2303.08774 Accessed 2026-05-31.
[^llama]: Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971 Accessed 2026-05-31.
[^dsv3]: DeepSeek-AI (2024). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437 Accessed 2026-05-31.
[^holtzman]: Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. https://arxiv.org/abs/1904.09751 Accessed 2026-06-28.
[^chinchilla]: Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). https://arxiv.org/abs/2203.15556 Accessed 2026-06-28.
[^instructsurvey]: Zhang, S. et al. (2023). Instruction Tuning for Large Language Models: A Survey. https://arxiv.org/abs/2308.10792 Accessed 2026-05-31.
[^r1]: DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/abs/2501.12948 Accessed 2026-05-31.
[^acs]: Guo, Y. et al. (2024). Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation. https://arxiv.org/abs/2407.18698 Accessed 2026-05-31.
[^agentic_rag]: Shi, J. et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. https://arxiv.org/abs/2501.09136 Accessed 2026-05-31.
[^llimits]: Various authors (2025). LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models. https://arxiv.org/abs/2505.19240 Accessed 2026-05-31.
[^hallinevitable]: Xu, Z. et al. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. https://arxiv.org/abs/2401.11817 Accessed 2026-05-31.
[^aligntax]: Fundamental Limitations of Alignment in Large Language Models. https://arxiv.org/abs/2304.11082 Accessed 2026-05-31.
[^gpt5]: OpenAI (2025). Introducing GPT-5. https://openai.com/index/introducing-gpt-5/ Accessed 2026-05-31.
[^gpt51]: OpenAI (2025). GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. https://openai.com/index/gpt-5-system-card-addendum-gpt-5-1/ Accessed 2026-05-31.
[^gpt52]: OpenAI (2025). Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ Accessed 2026-05-31.
[^claude4]: Anthropic (2025). Introducing Claude 4. https://www.anthropic.com/news/claude-4 Accessed 2026-05-31.
[^opus45]: Anthropic (2025). Introducing Claude Opus 4.5. https://www.anthropic.com/news/claude-opus-4-5 Accessed 2026-05-31.
[^opus48]: Anthropic (2026). Introducing Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8 Accessed 2026-05-31. See also TechCrunch (2026): https://techcrunch.com/2026/05/28/anthropic-releases-opus-4-8-with-dynamic-workflow-tool/ Accessed 2026-05-31.
[^gemini3]: Google (2025). Gemini 3: Introducing the latest Gemini AI model from Google. https://blog.google/products/gemini/gemini-3/ Accessed 2026-05-31.
[^gemini31]: Google DeepMind (2026). Gemini 3.1 Pro Model Card. https://deepmind.google/models/model-cards/gemini-3-1-pro/ Accessed 2026-05-31.
[^grok4]: xAI (2025). Grok 4. https://x.ai/news/grok-4 Accessed 2026-05-31.
[^grok41]: xAI (2025). Grok 4.1. https://x.ai/news/grok-4-1 Accessed 2026-05-31.
[^grok43]: Codersera (2026). Grok 4.3: xAI's Cheap Frontier Model (May 2026 Guide). https://codersera.com/blog/grok-4-3-launch-guide-2026/ Accessed 2026-05-31.
[^llama4]: Meta (2025). The Llama 4 herd. https://www.llama.com/models/llama-4/ Accessed 2026-05-31.
[^qwen3]: Qwen Team (2025). Qwen3: Think Deeper, Act Faster. https://qwenlm.github.io/blog/qwen3/ Accessed 2026-05-31.
[^kimi]: Moonshot AI (2025). Kimi K2: Open Agentic Intelligence. https://moonshotai.github.io/Kimi-K2/ Accessed 2026-05-31.
[^dsv32]: DeepSeek-AI (2025). DeepSeek-V3.2 (model card). https://huggingface.co/deepseek-ai/DeepSeek-V3.2 Accessed 2026-05-31.
[^mistral3]: Mistral AI (2025). Introducing Mistral 3. https://mistral.ai/news/mistral-3/ Accessed 2026-05-31.
[^dsv4]: MIT Technology Review (2026). Three reasons why DeepSeek's new model matters. https://www.technologyreview.com/2026/04/24/1136422/why-deepseeks-v4-matters/ Accessed 2026-05-31.