Text Generation Models
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 4,973 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v4 · 4,973 words
Add missing citations, update stale details, or suggest a clearer explanation.
Text generation models are language models trained to produce coherent natural-language text, typically by predicting tokens one at a time conditioned on preceding context. They form the dominant class of modern generative AI systems and power applications such as conversational assistants, code editors, search summaries, and autonomous agents. The category is dominated by decoder-only autoregressive transformer models, with smaller contributions from encoder-decoder architectures covered separately on the text2text generation models page.
This article is a survey and catalog of notable text-generation models across eras, from statistical n-gram systems to the current frontier. It complements the large language model article, which treats the underlying concepts (architecture, tokenization, scaling laws, training, and inference) in depth. For the mechanics of how these systems work, see that article and GPT; this page focuses on the models themselves and how the landscape has evolved.
Text generation has progressed from statistical n-gram models to deep neural language models within roughly two decades.
Early generators relied on n-gram counts that estimated the probability of the next word from short windows of preceding words. These models powered speech recognition decoders and statistical machine translation systems but produced ungrammatical text beyond a few words. Smoothing techniques such as Kneser-Ney addressed the sparsity of unseen word sequences but could not capture long-range dependencies.
Tomas Mikolov and colleagues introduced recurrent neural network language models in 2010, replacing fixed-context counts with learned hidden states. LSTM variants from Sepp Hochreiter and Jurgen Schmidhuber, originally proposed in 1997, were widely adopted for language modeling once GPU training matured. In 2014, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio added soft attention to sequence-to-sequence translation,1 allowing the decoder to align with arbitrary positions in the source. Their work seeded the attention research line that produced the transformer.
In June 2017, Ashish Vaswani and seven Google co-authors published "Attention Is All You Need,"2 proposing the transformer architecture built entirely on self-attention. OpenAI applied a decoder-only transformer to language modeling in GPT-1 (June 2018, 117 million parameters) and scaled it to GPT-2 in February 2019 (largest variant 1.5 billion parameters). GPT-3, released in May 2020 with 175 billion parameters, demonstrated in-context few-shot learning across translation, question answering, and arithmetic tasks.3 Google released T5 in October 2019, framing every NLP task as text-to-text and reaching up to 11 billion parameters.
Long Ouyang and colleagues at OpenAI published InstructGPT in March 2022,4 introducing reinforcement learning from human feedback (RLHF) for instruction following; they reported that a 1.3 billion parameter InstructGPT was preferred to the 175 billion parameter GPT-3 base. Anthropic followed with Constitutional AI in December 2022,5 training models against written principles using AI feedback. OpenAI launched ChatGPT on November 30, 2022; it reached an estimated 100 million monthly users within two months, the fastest-growing consumer application at the time. GPT-4, released March 14, 2023, accepted images alongside text and improved markedly on professional exams.6
Meta released LLaMA on February 24, 2023, with weights available to researchers under a non-commercial license.7 Llama 2 (July 18, 2023) shifted to a more permissive license and added chat-tuned variants. Mistral 7B (October 2023) and Mixtral 8x7B (December 2023) from Mistral AI demonstrated that compact dense models and sparse mixture-of-experts models could match much larger predecessors. Meta released Llama 3 on April 18, 2024 (8B and 70B), followed by Llama 3.1 with a 405 billion parameter flagship on July 23, 2024. Anthropic released the Claude 3 family (Claude) on March 4, 2024 with a 200,000 token context window; Google announced Gemini 1.0 in December 2023 and Gemini 1.5 Pro in February 2024 with a one million token context window. DeepSeek released DeepSeek-V3 on December 26, 2024, a 671 billion parameter mixture-of-experts model with 37 billion active parameters per token.8
In September 2024, OpenAI released o1, the first widely deployed "reasoning model" trained to perform extended chain-of-thought before answering, followed by o3. DeepSeek released the open-weight reasoning model DeepSeek-R1 in January 2025. OpenAI shipped GPT-5 on August 7, 2025 as a unified system with a real-time router that selects between a fast model and a deeper reasoning model.9 Anthropic, Google, and xAI released successive frontier models through 2025 and 2026, while open-weight labs (Meta, DeepSeek, Alibaba, Mistral, Moonshot AI) narrowed the gap. These developments are cataloged below.
Depth on architecture lives in large language model; this section summarizes the variants relevant to a model catalog.
Decoder-only causal language models are the dominant paradigm. A stack of transformer blocks processes tokens left to right, with each block computing masked self-attention so that position t only attends to positions 1 through t. The final hidden state is projected to a vocabulary distribution and trained with next-token cross-entropy loss. GPT, Claude, Gemini, Grok, Llama, Mistral, Qwen, and DeepSeek are all decoder-only.
Encoder-decoder generators split the work: an encoder reads the input bidirectionally and a decoder generates output autoregressively while attending to encoder states. T5 and BART follow this design. Encoder-decoder models remain common in machine translation and summarization but have been overtaken by decoder-only models for general text generation.
Several refinements are now standard in production models:
At inference time, a text generation model holds a probability distribution over the vocabulary at each step. The decoding strategy determines which token is actually selected, making it a critical factor in quality, diversity, and speed.
Greedy decoding selects the single highest-probability token at every step. It is deterministic, fast, and coherent for short completions, but tends to produce repetitive, locally optimal text that degrades over longer generations.
Beam search maintains a fixed number (the beam width) of the highest-scoring partial sequences and expands each in parallel. It was the standard decoding method in neural machine translation and encoder-decoder models. Holtzman and colleagues (2020) showed that maximization objectives such as beam search cause "degeneration" in open-ended generation: outputs become repetitive and incoherent precisely because the model assigns high probability to flat, generic continuations.11
Temperature scaling divides each logit by a scalar T before applying the softmax. High temperatures (T greater than 1) flatten the distribution, increasing diversity and sometimes creativity at the cost of coherence. Low temperatures (T approaching 0) sharpen it, approaching greedy behavior. Temperature is a knob available in nearly every production API.
Top-k sampling restricts sampling to the k most probable tokens at each step, redistributing the remaining probability mass among the top k. This avoids sampling from the long tail of unlikely tokens but uses a fixed-size window even when the distribution is either flat or sharply peaked.
Top-p sampling, introduced in the same 2020 paper by Holtzman et al.,11 dynamically selects the smallest set of tokens whose cumulative probability mass exceeds a threshold p. Because the set expands when the distribution is flat and contracts when it is peaked, nucleus sampling adapts more naturally to the model's uncertainty. It is the most widely adopted open-ended generation strategy and is the default in many frameworks.
Contrastive search generates text by contrasting the prediction of a strong target model against a weaker reference, penalizing tokens that are too similar to recently generated context. Adaptive Contrastive Search (2024) extends this by modulating the degeneration penalty according to the model's estimated per-step uncertainty, reducing repetition without sacrificing coherence.12
Speculative decoding accelerates inference without changing output distribution. A smaller, fast "draft" model generates several candidate tokens; the larger target model then verifies them in a single forward pass, accepting tokens that match its distribution and resampling from the divergence point. Because LLM inference is often memory-bandwidth-bound rather than compute-bound, batched verification achieves near-linear speedups proportional to the number of accepted draft tokens. Speculative decoding is widely deployed in production systems.
Min-p sampling (2024) sets a dynamic per-step floor at a fraction of the top-token probability, cutting tokens below that floor. Mirostat is a feedback-based sampler that adjusts temperature dynamically to maintain a target perplexity level, preventing both degeneracy and incoherence over long generations.
A typical pipeline has three stages, covered in detail under large language model.
The model learns next-token prediction over trillions of tokens scraped and filtered from the web, books, code, and curated text. The Chinchilla scaling study (Hoffmann et al., 2022) found that for fixed compute, model size and training tokens should scale roughly in proportion, putting the compute-optimal tokens-per-parameter ratio near 20.13 Subsequent practice has pushed well beyond Chinchilla-optimal for inference efficiency: Llama 3 trained its 8B model on 15 trillion tokens (about 1,875 tokens per parameter), and Qwen3-0.6B trained on 36 trillion tokens. The key insight is that training beyond compute-optimal improves the deployed model's quality-per-inference-flop even at the cost of extra training compute.
Data quality matters as much as scale. Filtering pipelines remove boilerplate, near-duplicates, and low-quality web text; domain-weighted sampling up-mixes code, math, and scientific text. Recent flagship models report pretraining on tens of trillions of tokens across dozens of languages; Qwen3 covered approximately 119 languages.14
The base model is adapted on curated instruction-response pairs, often crowdsourced or synthesized. Early examples include FLAN from Google and Alpaca from Stanford. SFT data has evolved from simple prompt-completion pairs to complex multi-turn conversations covering tool use, reasoning, and multimodal tasks. Synthetic data, generated by a strong teacher model and filtered for quality, has become a dominant source because it scales more cheaply than human annotation.15
A reward signal aligns outputs with human preferences. RLHF fits a reward model to ranked response pairs and optimizes the policy with PPO. Constitutional AI substitutes a written constitution and uses AI feedback (RLAIF). Direct preference optimization (DPO) replaces the reward model and reinforcement-learning step with a single classification loss, making training more stable and reproducible. More recent variants include ORPO, GRPO (used by Qwen3 and DeepSeek-R1), and IPO.
Reasoning models such as o1, DeepSeek-R1, and the "thinking" variants of recent flagships add large-scale reinforcement learning on verifiable tasks (math, code) to elicit long chains of thought. DeepSeek-R1-Zero showed that RL on a base model alone, with no supervised reasoning data, can spontaneously produce self-correction and backtracking behavior.16
A persistent finding across the 2023 to 2025 literature is that a smaller, higher-quality SFT dataset often outperforms a larger noisy one. The LIMA paper (2023) argued that 1,000 carefully curated examples sufficed to produce competitive instruction-following behavior. Later scaling experiments have qualified this: quality-at-scale wins over quality-at-small-scale once models are large enough and the task distribution is wide enough, but the principle that annotation quality gates fine-tuning quality remains robust.
Parameter counts are taken from official papers and model cards. "Undisclosed" indicates the developer has not published an official figure. For MoE models, totals are listed with active parameters per token where reported.
| Model | Year | Organization | Parameters | Significance |
|---|---|---|---|---|
| GPT-2 | 2019 | OpenAI | 1.5B (largest) | Zero-shot transfer at scale |
| T5 | 2019 | up to 11B | Text-to-text unification | |
| GPT-3 | 2020 | OpenAI | 175B | In-context few-shot learning |
| GPT-Neo / GPT-J | 2021 | EleutherAI | 6B (J) | Early open replication |
| PaLM | 2022 | 540B | Pathways training, dense scale | |
| Chinchilla | 2022 | DeepMind | 70B | Compute-optimal scaling |
| OPT 175B | 2022 | Meta | 175B | Open replication of GPT-3 |
| BLOOM | 2022 | BigScience | 176B | Open multilingual training (46 languages) |
| InstructGPT | 2022 | OpenAI | 175B | RLHF for instruction following |
| ChatGPT | 2022 | OpenAI | undisclosed | Mass-market chat product |
| GPT-4 | 2023 | OpenAI | undisclosed | Multimodal, professional-exam level |
| Model | Year | Organization | Parameters | Significance |
|---|---|---|---|---|
| LLaMA | 2023 | Meta | 7B, 13B, 33B, 65B | Open-weight foundation models |
| Llama 2 | 2023 | Meta | 7B, 13B, 70B | Commercial-friendly license, chat tuning |
| Falcon | 2023 | TII | 7B, 40B, 180B | Open large model on RefinedWeb |
| Mistral 7B | 2023 | Mistral AI | 7B | Sliding-window attention, GQA |
| Mixtral 8x7B | 2023 | Mistral AI | 47B total, 13B active | Sparse mixture of experts |
| Llama 3 / 3.1 | 2024 | Meta | 8B, 70B, 405B | Open frontier-scale model |
| Qwen 2.5 | 2024 | Alibaba | 0.5B to 72B | 18T-token pretraining |
| DeepSeek-V3 | 2024 | DeepSeek | 671B total, 37B active | MoE with multi-token prediction |
As of May 2026, the leading closed-weight families are OpenAI's GPT-5 series, Anthropic's Claude Opus 4.x, Google's Gemini 3.x, and xAI's Grok 4.x. All are multimodal to varying degrees and ship "thinking" or reasoning modes.
| Model | Released | Organization | Context window | Notes |
|---|---|---|---|---|
| GPT-5 | Aug 7, 2025 | OpenAI | 400K | Unified system with real-time router9 |
| GPT-5.1 | Nov 2025 | OpenAI | 400K | Instant and Thinking modes17 |
| GPT-5.2 | Dec 11, 2025 | OpenAI | 400K (128K output) | Knowledge cutoff Aug 31, 202518 |
| Claude Opus 4 | May 2025 | Anthropic | 200K | Agentic, hybrid reasoning19 |
| Claude Opus 4.5 | Nov 24, 2025 | Anthropic | 200K (64K output) | First model above 80% on SWE-bench Verified (80.9%); "effort" parameter20 |
| Claude Opus 4.8 | May 28, 2026 | Anthropic | 200K | Latest flagship; gains in coding and honesty21 |
| Gemini 3 Pro | Nov 18, 2025 | Google DeepMind | 1M | Topped LMArena at 1501 Elo at launch22 |
| Gemini 3.1 Pro | Feb 19, 2026 | Google DeepMind | 1M (64K output) | 94.3% GPQA Diamond, 80.6% SWE-bench Verified23 |
| Grok 4 | Jul 9, 2025 | xAI | 256K | Native tool use; trained on Colossus cluster24 |
| Grok 4.1 | Nov 2025 | xAI | 256K | Led LMArena Text Arena (1483 Elo, Thinking)25 |
| Grok 4.3 | Apr 30, 2026 | xAI | 1M | Adds native video input26 |
Open-weight (downloadable) models from Meta, DeepSeek, Alibaba, Mistral AI, and Moonshot AI have approached frontier quality, most released under permissive licenses (Apache 2.0 or MIT).
| Model | Released | Organization | Parameters | License | Notes |
|---|---|---|---|---|---|
| Llama 4 Scout / Maverick | Apr 5, 2025 | Meta | 109B/17B active; 400B/17B active | Llama 4 Community | First Meta MoE; Scout has 10M-token context27 |
| Qwen3 | Apr 29, 2025 | Alibaba | 235B total, 22B active (plus dense 0.6B to 32B) | Apache 2.0 | Pretrained on approximately 36T tokens, 119 languages14 |
| Kimi K2 | Jul 2025 | Moonshot AI | 1T total, 32B active | Modified MIT | Trillion-parameter open MoE for agentic and coding tasks28 |
| DeepSeek-V3.2 | Dec 1, 2025 | DeepSeek | 671B total, 37B active | MIT | DeepSeek Sparse Attention; 160K context10 |
| Mistral Large 3 | Dec 2, 2025 | Mistral AI | 675B total, 41B active | Apache 2.0 | 256K context; trained on roughly 3,000 H200 GPUs29 |
| DeepSeek-V4 (preview) | Apr 24, 2026 | DeepSeek | V4-Pro 1.6T; V4-Flash 284B | MIT | 1M-token context; Compressed Sparse Attention30 |
Not all deployment contexts require or can afford frontier-scale models. A distinct tier of small language models targets on-device, edge, and low-latency use cases.
| Model family | Organization | Size range | Notable features |
|---|---|---|---|
| Gemma 3 / 4 | Google DeepMind | 1B to 27B | On-device optimized; Gemma 4 E2B/E4B use per-layer embeddings for efficiency |
| Phi 4 | Microsoft | 3.8B (text), 5.6B (multimodal) | Strong math and reasoning at small scale; trained primarily on synthetic data |
| Qwen3 dense | Alibaba | 0.6B to 32B | Apache 2.0; strong multilingual performance down to 600M parameters |
| Llama 3.2 | Meta | 1B, 3B | Official small-scale Llama; runs on mobile devices |
| SmolLM | Hugging Face | 135M to 1.7B | Ultra-compact models for browser and microcontroller deployment |
Sub-2B models suit IoT and mobile inference; 3B to 5B models run on consumer laptops; 9B and above require server hardware for low latency. Speculative decoding using small draft models accelerates the larger models in the same family, an approach Gemma 4 ships with dedicated Multi-Token Prediction drafters.
Text generation models are evaluated on a mix of knowledge, code, math, agentic, and human-preference benchmarks. As frontier models saturate older tests, harder successors and live arenas have taken over.
| Benchmark | Year | What it measures | Status (2026) |
|---|---|---|---|
| MMLU | 2020 | 57 subjects, multiple choice | Largely saturated; MMLU-Pro is the harder successor |
| HellaSwag | 2019 | Commonsense sentence completion | Near ceiling for frontier models |
| TruthfulQA | 2021 | Resistance to common falsehoods | Probes imitative falsehoods |
| HumanEval | 2021 | Python function synthesis | Mostly saturated; GPT-5.3 Codex at 93% |
| GSM8K | 2021 | Grade-school math word problems | At 99% for frontier models; no longer differentiating |
| HELM | 2022 | Holistic, multi-metric evaluation | Stanford CRFM; still used for broad comparisons |
| BIG-Bench | 2022 | 200+ diverse tasks | Crowd-sourced; BIG-Bench Hard (BBH) remains active |
| MT-Bench | 2023 | Multi-turn open-ended chat | LLM-as-judge with GPT-4 |
| LMArena (Chatbot Arena) | 2023 | Crowd-sourced pairwise voting | Live Elo leaderboard; most widely cited human-preference ranking |
| GPQA Diamond | 2023 | Graduate-level science questions | Frontier models now exceed 90%; Gemini 3.1 Pro at 94.3% |
| SWE-bench Verified | 2024 | Real-world software-engineering fixes | Leading agentic-coding metric; top models above 80% |
| ARC-AGI-2 | 2025 | Abstract visual reasoning; near-zero base rate | Designed to resist memorization; Gemini 3.1 Pro reported 77.1% |
| Humanity's Last Exam | 2025 | Expert-level questions across many fields | Grok 4 Heavy exceeded 50% in multi-agent mode |
| FrontierMath | 2024 | Unpublished research-level math problems | GPT-5.5 at 51.7% on tiers 1 to 3 |
Beyond accuracy on fixed benchmarks, text generation quality is measured with several complementary metrics.
Perplexity is the exponentiated average negative log-likelihood per token on a held-out corpus. It measures how well the model predicts natural text and is used during pretraining to track progress and compare model checkpoints on the same data distribution.
BLEU and ROUGE are n-gram overlap metrics historically used for machine translation and summarization. They correlate weakly with human judgment on open-ended generation and are largely replaced by model-based evaluators for those tasks.
Win rates and Elo. Human or LLM-judged pairwise comparisons between responses yield Elo ratings on platforms such as LMArena. Elo captures holistic quality and is harder to game than fixed benchmarks, but requires large numbers of comparisons and is sensitive to the question mix.
LLM-as-judge. Using a strong model such as GPT-4 to rate or rank outputs has become common for evaluating instruction-following, factuality, and format adherence. MT-Bench pioneered this approach; it has since been adopted widely for rapid offline evaluation.
Contamination and saturation. A persistent concern is that public benchmark questions may appear in pretraining corpora, inflating scores. Dynamic evaluations (held-out private test sets, live arenas, and freshly generated question banks) partially address contamination. Multiple classical benchmarks are now at ceiling for frontier models, which has driven the creation of harder successors and live arenas as the primary differentiators.
Text generation models underpin a broad range of products and workflows.
ChatGPT, Claude, Gemini, Grok, and Copilot expose general-purpose chat over the web and APIs. Conversational systems handle tasks from drafting emails and answering factual questions to multi-step research, creative writing, and roleplay. Context windows of 200K to 1M tokens let users share entire codebases, documents, or long conversation histories in a single prompt.
GitHub Copilot, Cursor, and similar tools wrap models such as GPT-5, Claude, and specialized derivatives (Codex, Code Llama, Qwen-Coder, DeepSeek-Coder) to produce, edit, explain, and debug source code. SWE-bench Verified, measuring resolution of real open-source GitHub issues, has become the canonical benchmark for this use case. Claude Opus 4.5 reached 80.9% on SWE-bench Verified in November 2025, a milestone described as the first time a model could resolve a majority of realistic software tasks autonomously.20
Drafting, editing, and rewriting features are integrated into Gmail, Notion, Microsoft 365, and Google Workspace. These pipelines usually embed the model inside a product that provides formatting context and constrained output requirements, often via system prompts and structured output modes.
Long-document condensation, meeting recaps, legal brief summaries, and multilingual translation are mature applications. Models with million-token context windows can summarize entire books or transcripts without chunking. Machine translation quality rivals dedicated translation systems for major language pairs.
Retrieval-augmented generation augments a generation model with a retrieval system that fetches relevant passages from a knowledge base or the web before generation. This addresses the knowledge-cutoff limitation and grounds outputs in verifiable sources. RAG has become standard in enterprise search, customer support, and research assistants. Agentic RAG extends the idea to multi-step pipelines where the model autonomously formulates retrieval queries, refines them based on results, and integrates multiple sources.31
Multi-step tool use, browser automation, code execution, and file management drive agentic products built on reasoning-capable flagships. Models like Kimi K2 and Claude Opus 4.5 are explicitly positioned for autonomous software engineering tasks that require sustained, goal-directed operation over tens or hundreds of steps. The SWE-bench and Terminal-Bench evaluations track this frontier.
Text generation models assist in literature review, hypothesis generation, grant writing, legal document drafting, medical note-taking, and clinical decision support. These applications require high factual reliability and often combine the model with RAG, citation grounding, or specialized fine-tuning on domain data.
Despite rapid progress, text generation models share several well-documented weaknesses, discussed at length under large language model.
Models produce fluent but factually incorrect statements, particularly for rare entities, fresh events, or detailed numerical claims. A 2025 survey found hallucination to be one of the three most studied limitations in the LLM literature, alongside reasoning failures and generalization gaps.32 Reliability has improved across model generations through RLHF, RAG integration, and citation-grounded output, but the problem is not solved. A theoretical result (Xu et al., 2024) argues that hallucination is an unavoidable consequence of compressing a corpus into finite model weights.33
Pretraining data has a fixed end date, so models lack information on later events unless augmented with retrieval or tools. Knowledge cutoffs range from months to over a year before a model's release. Web-search plugins, RAG, and real-time tool-use APIs address this for well-resourced deployments but not for offline or privacy-sensitive ones.
Even million-token windows degrade in needle-in-a-haystack accuracy at the extremes: models systematically miss information placed far from the beginning or end of very long contexts. Inference cost also grows with sequence length, making long-context use expensive at scale.
Training frontier models costs tens to hundreds of millions of dollars and consumes large amounts of accelerator time and energy. Inference at scale is also expensive: a single GPT-5-class response can consume hundreds of times the energy of a keyword search query. Speculative decoding, quantization, and mixture-of-experts activation sparsity partially mitigate inference cost, but energy consumption remains a concern.
Models can be jailbroken through adversarial prompting to bypass safety filters. Sycophancy (telling users what they want to hear rather than the truth) is a persistent alignment failure mode: RLHF optimizes for human approval, and approval correlates with agreement. Models also exhibit prompt sensitivity: small changes in phrasing can flip answers or change quality markedly.
Outputs reflect biases in training data across demographics, languages, and viewpoints. Under-represented languages and cultures receive lower quality outputs. Political and social topics exhibit systematic skews inherited from the web corpus.
Public benchmarks may leak into pretraining corpora, inflating reported scores. This is a structural problem: once a benchmark is published, it becomes part of the web and is likely included in future training runs. Dynamic and private evaluations, such as LMArena and held-out test banks, partially mitigate this risk, but contamination is difficult to audit with certainty.
A finding reported across multiple 2024 and 2025 studies is that aggressive preference optimization can degrade certain capabilities: heavily RLHF-aligned models sometimes become overly cautious, refuse benign requests, or lose nuanced reasoning ability.34 Balancing helpfulness, harmlessness, and capability remains an active research and engineering challenge.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. https://arxiv.org/abs/1409.0473 Accessed 2026-05-31. ↩
Vaswani, A. et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762 Accessed 2026-05-31. ↩
Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). https://arxiv.org/abs/2005.14165 Accessed 2026-05-31. ↩
Ouyang, L. et al. (2022). Training Language Models to Follow Instructions with Human Feedback (InstructGPT). https://arxiv.org/abs/2203.02155 Accessed 2026-05-31. ↩
Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. https://arxiv.org/abs/2212.08073 Accessed 2026-05-31. ↩
OpenAI (2023). GPT-4 Technical Report. https://arxiv.org/abs/2303.08774 Accessed 2026-05-31. ↩
Touvron, H. et al. (2023). LLaMA: Open and Efficient Foundation Language Models. https://arxiv.org/abs/2302.13971 Accessed 2026-05-31. ↩
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. https://arxiv.org/abs/2412.19437 Accessed 2026-05-31. ↩ ↩2
OpenAI (2025). Introducing GPT-5. https://openai.com/index/introducing-gpt-5/ Accessed 2026-05-31. ↩ ↩2
DeepSeek-AI (2025). DeepSeek-V3.2 (model card). https://huggingface.co/deepseek-ai/DeepSeek-V3.2 Accessed 2026-05-31. ↩ ↩2
Holtzman, A. et al. (2020). The Curious Case of Neural Text Degeneration. https://arxiv.org/abs/1904.09751 Accessed 2026-05-31. ↩ ↩2
Guo, Y. et al. (2024). Adaptive Contrastive Search: Uncertainty-Guided Decoding for Open-Ended Text Generation. https://arxiv.org/abs/2407.18698 Accessed 2026-05-31. ↩
Hoffmann, J. et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). https://arxiv.org/abs/2203.15556 Accessed 2026-05-31. ↩
Qwen Team (2025). Qwen3: Think Deeper, Act Faster. https://qwenlm.github.io/blog/qwen3/ Accessed 2026-05-31. ↩ ↩2
Zhang, S. et al. (2023). Instruction Tuning for Large Language Models: A Survey. https://arxiv.org/abs/2308.10792 Accessed 2026-05-31. ↩
DeepSeek-AI (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/abs/2501.12948 Accessed 2026-05-31. ↩
OpenAI (2025). GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum. https://openai.com/index/gpt-5-system-card-addendum-gpt-5-1/ Accessed 2026-05-31. ↩
OpenAI (2025). Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ Accessed 2026-05-31. ↩
Anthropic (2025). Introducing Claude 4. https://www.anthropic.com/news/claude-4 Accessed 2026-05-31. ↩
Anthropic (2025). Introducing Claude Opus 4.5. https://www.anthropic.com/news/claude-opus-4-5 Accessed 2026-05-31. ↩ ↩2
Anthropic (2026). Introducing Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8 Accessed 2026-05-31. See also TechCrunch (2026): https://techcrunch.com/2026/05/28/anthropic-releases-opus-4-8-with-dynamic-workflow-tool/ Accessed 2026-05-31. ↩
Google (2025). Gemini 3: Introducing the latest Gemini AI model from Google. https://blog.google/products/gemini/gemini-3/ Accessed 2026-05-31. ↩
Google DeepMind (2026). Gemini 3.1 Pro Model Card. https://deepmind.google/models/model-cards/gemini-3-1-pro/ Accessed 2026-05-31. ↩
xAI (2025). Grok 4. https://x.ai/news/grok-4 Accessed 2026-05-31. ↩
xAI (2025). Grok 4.1. https://x.ai/news/grok-4-1 Accessed 2026-05-31. ↩
Codersera (2026). Grok 4.3: xAI's Cheap Frontier Model (May 2026 Guide). https://codersera.com/blog/grok-4-3-launch-guide-2026/ Accessed 2026-05-31. ↩
Meta (2025). The Llama 4 herd. https://www.llama.com/models/llama-4/ Accessed 2026-05-31. ↩
Moonshot AI (2025). Kimi K2: Open Agentic Intelligence. https://moonshotai.github.io/Kimi-K2/ Accessed 2026-05-31. ↩
Mistral AI (2025). Introducing Mistral 3. https://mistral.ai/news/mistral-3/ Accessed 2026-05-31. ↩
MIT Technology Review (2026). Three reasons why DeepSeek's new model matters. https://www.technologyreview.com/2026/04/24/1136422/why-deepseeks-v4-matters/ Accessed 2026-05-31. ↩
Shi, J. et al. (2025). Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. https://arxiv.org/abs/2501.09136 Accessed 2026-05-31. ↩
Various authors (2025). LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models. https://arxiv.org/abs/2505.19240 Accessed 2026-05-31. ↩
Xu, Z. et al. (2024). Hallucination is Inevitable: An Innate Limitation of Large Language Models. https://arxiv.org/abs/2401.11817 Accessed 2026-05-31. ↩
Fundamental Limitations of Alignment in Large Language Models. https://arxiv.org/abs/2304.11082 Accessed 2026-05-31. ↩