Large Language Model
Last reviewed
Jun 9, 2026
Sources
73 citations
Review status
Source-backed
Revision
v9 ยท 11,575 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
73 citations
Review status
Source-backed
Revision
v9 ยท 11,575 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms, Natural language processing, Transformer
A large language model (LLM) is a type of artificial intelligence system built on neural networks with billions (or trillions) of parameters, trained on massive text corpora to understand and generate human language. These models can perform a wide range of tasks including translation, summarization, question answering, code generation, and open-ended conversation. Since the release of GPT-1 in 2018, LLMs have become one of the most consequential developments in the history of computing, powering products used by hundreds of millions of people worldwide.
LLMs work by learning statistical patterns in text. During training, a model reads vast quantities of text from books, websites, academic papers, and code repositories, building an internal representation of language structure, factual knowledge, and reasoning patterns. The resulting model can then generate text one token at a time, predicting the most likely next token given everything that came before it. Despite this relatively simple mechanism, LLMs exhibit surprisingly complex behavior, including the ability to follow instructions, write software, solve math problems, and engage in multi-step reasoning.
The term "large" in LLM is relative and has shifted over time. In 2018, GPT-1's 117 million parameters qualified as large. By 2025, models with fewer than a billion parameters are generally considered small, and frontier LLMs contain hundreds of billions to trillions of parameters. The "language model" part refers to the core training objective: predicting the probability distribution over the next token in a sequence, a form of self-supervised learning that requires no manually labeled data.
There is no formal parameter threshold that makes a model "large" [1]; in practice, three properties are usually present:
Modern LLMs sit at the center of generative AI products such as ChatGPT, Claude, Gemini, and Microsoft Copilot, are the substrate for the open-weight ecosystem around Llama, Mistral, Qwen, DeepSeek, and Gemma, and form one face of the broader category of foundation models, which also includes vision-language, code, and protein models.
The development of LLMs can be traced through several distinct phases, each marked by significant increases in model size, training data, and capability.
Language modeling predates deep learning. Statistical n-gram models from the 1990s and 2000s estimated the probability of the next word from counts of short sequences in a fixed corpus, and were the workhorse of speech recognition and machine translation for decades. By 2001, smoothed n-gram models trained on roughly 300 million words held the state of the art in perplexity [1].
The shift to learned distributed representations began with neural probabilistic language models (Bengio et al., 2003) and accelerated with word embeddings. Word2vec, published by Tomas Mikolov and colleagues at Google in 2013, made dense word vectors cheap to train and showed that arithmetic on those vectors captured surprising semantic structure, including the famous king minus man plus woman example [2]. GloVe followed in 2014 with a co-occurrence-based formulation [3], and ELMo (2018) extended the idea to contextual embeddings using bidirectional LSTMs.
The transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," laid the groundwork for all modern LLMs [4]. The Transformer replaced recurrent neural networks with a self-attention mechanism that could process entire sequences in parallel, dramatically improving both training speed and the model's ability to capture long-range dependencies in text.
OpenAI released GPT-1 in June 2018 with 117 million parameters [5]. It was trained on the BookCorpus dataset and demonstrated that generative pre-training followed by discriminative fine-tuning could achieve strong results on a variety of natural language processing benchmarks. In February 2019, GPT-2 followed with 1.5 billion parameters, trained on WebText, a 40-gigabyte dataset of 8 million web pages [6]. OpenAI initially withheld the full model citing concerns about potential misuse for generating disinformation, releasing it in stages over several months; the full 1.5-billion-parameter weights were published in November 2019 [7].
Google introduced BERT (Bidirectional Encoder Representations from Transformers) in October 2018 with 340 million parameters. Unlike GPT, BERT used a bidirectional training approach (masked language modeling), making it particularly effective for understanding tasks like classification and question answering rather than text generation [8]. Encoder-only models of this family went on to dominate discriminative NLP benchmarks such as GLUE and SuperGLUE.
The release of GPT-3 in May 2020 marked a turning point. With 175 billion parameters and a 2,048-token context window, GPT-3 showed that scaling up model size could unlock qualitatively new capabilities [9]. The model demonstrated impressive "few-shot learning" abilities, performing tasks it had never been explicitly trained for simply by being given a few examples in the prompt. GPT-3's training cost was estimated at $4.6 million in cloud compute, and required approximately 350 GB of storage for its weights alone.
Google responded with PaLM (Pathways Language Model) in April 2022, scaling to 540 billion parameters. PaLM demonstrated strong performance across NLP benchmarks and showed particular strength in reasoning tasks when combined with chain-of-thought prompting [10]. Google also developed LaMDA, a model focused specifically on natural conversational abilities.
This period also saw the emergence of several important open-source efforts. EleutherAI released GPT-Neo and GPT-J, providing the research community with openly accessible alternatives to proprietary models. BigScience, an international collaboration, released BLOOM, a 176-billion-parameter multilingual model, in July 2022. Google released T5 (Text-to-Text Transfer Transformer; Raffel et al., 2019), which framed all NLP tasks as text-to-text problems using an encoder-decoder architecture, with checkpoints up to 11 billion parameters [11], and later Flan-T5, an instruction-tuned variant that demonstrated the power of multi-task fine-tuning [12].
The transition from raw language model to chat assistant began with InstructGPT (Ouyang et al., March 2022), which combined supervised fine-tuning with reinforcement learning from human feedback (RLHF). Human labelers preferred outputs from a 1.3-billion-parameter InstructGPT model over the 175-billion-parameter GPT-3 base model, despite a 100x parameter gap [13].
The launch of ChatGPT on November 30, 2022 (built on GPT-3.5) brought LLMs into mainstream public awareness. Within two months, it had over 100 million users, making it the fastest-growing consumer application in history at the time.
OpenAI released GPT-4 on March 14, 2023, a multimodal model capable of processing both text and images. GPT-4 was widely praised for its increased accuracy and reasoning capabilities [14]. Anthropic launched the Claude model family in March 2023, followed by Claude 2 in July 2023, emphasizing its Constitutional AI approach to safety. Google released Gemini (initially called Bard) with Nano, Pro, and Ultra variants.
Meta released LLaMA (Large Language Model Meta AI) in February 2023, a family of models ranging from 7 billion to 65 billion parameters. The 13B-parameter LLaMA model outperformed GPT-3 (175B) on most NLP benchmarks, demonstrating that smaller, well-trained models could match or exceed much larger ones [15]. LLaMA's open release catalyzed a wave of community fine-tuning projects including Alpaca, Vicuna, and Koala. Meta followed with LLaMA 2 in July 2023 and LLaMA 3 (with models up to 405 billion parameters) in 2024.
GPT-4o launched on May 13, 2024 with native text, image, and audio input and output, achieving audio response times around 320 milliseconds [16]. Meta's Llama 3.1, including a 405-billion-parameter version trained on more than 15 trillion tokens with a 128,000-token context window, shipped on July 23, 2024 [17].
In September 2024, OpenAI released o1-preview, the first in a new series of "reasoning models" trained specifically for extended chain-of-thought problem solving, representing a new paradigm in LLM capability [18].
By 2025, LLMs entered a new phase characterized by massive context windows, native multimodality, mixture-of-experts architectures, and agentic capabilities.
OpenAI released GPT-4.1 on April 14, 2025, an API-only family with a 1-million-token context window and large coding-benchmark gains over GPT-4o [19]. GPT-5 followed on August 7, 2025, featuring a 400,000-token context window and significantly improved reliability, with hallucination rates reduced to approximately 6.2%. It scored 94.6% on the AIME 2025 math benchmark without tools and 74.9% on SWE-bench Verified for agentic coding [20]. GPT-5.2 followed in December 2025 with improved tool use and long-context processing [20].
Anthropic released Claude Opus 4 and Claude Sonnet 4 on May 22, 2025 [21], designed explicitly for agentic use cases including tool invocation, file access, and long-horizon reasoning. Claude Sonnet 4 gained a 1-million-token context window by August 2025. Claude Opus 4.5 arrived in November 2025, and the Claude 4.6 family launched in February 2026 with 1M-token context and up to 128K output tokens [22].
Google's Gemini 2.5 Pro, released March 20, 2025, shipped a 1-million-token context window and a "thinking" reasoning mode, with a Deep Think variant rolled out in August 2025 using parallel thinking techniques [23]. Google then released Gemini 3 Pro in November 2025, followed by Gemini 3.1 Pro in February 2026, which led on 12 of 18 tracked benchmarks and offered a 1-million-token context window [24].
Meta released the LLaMA 4 family on April 5, 2025, marking an architectural shift to mixture-of-experts (MoE) design with native multimodality, trained on more than 30 trillion tokens. LLaMA 4 Scout featured a 10-million-token context window, capable of processing approximately 7,500 pages of text [25].
In December 2024, the Chinese AI lab DeepSeek released DeepSeek-V3, a 671-billion-parameter mixture-of-experts model trained on 14.8 trillion tokens, and followed in January 2025 with DeepSeek R1, an open-weight reasoning model that performed comparably to OpenAI's o1 at a fraction of the cost per token [26]. Alibaba's Qwen3 family, released April 28, 2025, was trained on 36 trillion tokens and introduced hybrid thinking/non-thinking modes across dense and MoE models [27]. Mistral Large 3 launched in December 2025 with 675 billion total parameters (41 billion active) under the Apache 2.0 open-source license [28].
Virtually all modern LLMs are based on the Transformer architecture. Transformers rely on attention mechanisms (specifically self-attention) that allow the model to weigh the importance of different tokens in a sequence relative to each other. In each self-attention layer, every token is projected to a query, key, and value vector; attention weights are computed by a softmax over query-key dot products, and the output is a weighted sum of value vectors. Stacking dozens to hundreds of these layers, interleaved with feed-forward networks and normalization, gives the model the capacity to mix information across long token spans [4]. This enables the model to learn complex linguistic patterns and generate coherent, context-aware text across long sequences.
The original Transformer had both an encoder (for understanding input) and a decoder (for generating output). Modern LLMs have diverged into distinct architectural families:
| Architecture type | How it works | Training objective | Strengths | Example models |
|---|---|---|---|---|
| Decoder-only | Generates text left-to-right using causal (unidirectional) attention | Next-token prediction | Text generation, conversation, code | GPT series, Claude, LLaMA, Mistral |
| Encoder-only | Processes input bidirectionally using masked attention | Masked language modeling | Classification, NER, sentence embeddings | BERT, RoBERTa, DeBERTa |
| Encoder-decoder | Maps input to output via cross-attention between encoder and decoder | Span corruption or text-to-text | Translation, summarization, question answering | T5, BART, Flan-T5 |
The decoder-only architecture has become dominant for large-scale language models because it naturally supports autoregressive text generation, scales efficiently with increasing parameter counts, and uses the same network for both prompt encoding and generation. A 2024 study found that at small scales, encoder-decoder models can outperform decoder-only models by several points on complex tasks, but this advantage diminishes at larger scales where decoder-only models match or exceed them [29].
A significant architectural trend in 2024-2025 has been the adoption of Mixture-of-Experts designs. In an MoE model, only a fraction of the total parameters are activated for any given input token. A routing mechanism selects which "expert" subnetworks to use, allowing models to have very large total parameter counts while keeping inference costs manageable.
The first widely deployed open example was Mistral AI's Mixtral 8x7B, released December 11, 2023, with 46.7 billion total parameters but only about 12.9 billion used per token, giving it the inference cost of a much smaller dense model while matching or beating Llama 2 70B on many benchmarks [30]. DeepSeek-V3 pushed the approach further: 671 billion total parameters, 37 billion active per token, and 256 routed experts plus a shared expert per layer, with auxiliary-loss-free load balancing [26]. Frontier models followed the same pattern: Mistral Large 3 activates 41 of its 675 billion parameters per token, LLaMA 4 Scout uses 16 experts (17 billion active of 109 billion total), and LLaMA 4 Maverick uses 128 experts (17 billion active of 400 billion total) [25][28].
Mamba, introduced by Gu and Dao in December 2023, uses selective state space models rather than attention as the core sequence-mixing operation [31]. Mamba scales linearly with sequence length in both computation and memory, compared with the quadratic cost of standard attention, making it attractive for very long sequences. Hybrid architectures that interleave Mamba layers with attention layers have shown that combining the two can outperform either alone: AI21 Labs' Jamba family achieved production deployment, with Jamba 1.5 scaling to 398 billion total parameters (94 billion active) using 16 MoE experts [32]. As of 2025, pure Mamba models have not displaced Transformers in frontier chat products, but hybrid designs remain an active research direction.
LLMs process text as tokens rather than individual characters or whole words. A token is typically a subword unit: common words like "the" are single tokens, while less frequent words may be split into multiple tokens. On average, one token corresponds to roughly 3/4 of a word in English. Tokenization is a foundational preprocessing step that bridges the gap between raw text and the model's numerical representations.
| Algorithm | How it works | Used by | Key characteristic |
|---|---|---|---|
| Byte Pair Encoding (BPE) | Starts with individual characters and iteratively merges the most frequent adjacent pair until reaching target vocabulary size | GPT-2, GPT-3, GPT-4, LLaMA | Most popular; byte-level variant treats every possible byte as a basic unit |
| WordPiece | Similar to BPE but merges based on which pair maximizes the likelihood of the training data, not just frequency | BERT, DistilBERT, Electra | Tends to keep frequent words intact while splitting rare words |
| SentencePiece | Language-agnostic; treats input as raw byte stream and learns subword units using BPE or Unigram algorithms | T5, LLaMA, many multilingual models | Works directly on raw text without language-specific preprocessing; uses special marker for word boundaries |
| Unigram | Starts with a large vocabulary and iteratively removes tokens that least reduce the training data likelihood | SentencePiece-based models, XLNet | Probabilistic approach; can assign multiple tokenizations to the same text |
Byte-level BPE, used by models like GPT-2 and later, operates at the byte level rather than the character level. This ensures that any text (including emojis, non-Latin scripts, and special characters) can be tokenized without unknown tokens, since every input can be decomposed into its constituent bytes [34].
Vocabulary sizes for modern LLMs typically range from 32,000 to 256,000 tokens. Larger vocabularies reduce the average number of tokens needed to represent text (improving efficiency) but increase the size of the embedding layer. GPT-4 uses a vocabulary of approximately 100,000 tokens, while LLaMA 3 expanded to 128,000 tokens to improve multilingual performance.
The development of a modern LLM typically follows a multi-stage pipeline: pre-training, supervised fine-tuning (SFT), and alignment.
During pre-training, the model is exposed to enormous quantities of text, learning to predict the next token given the preceding context. This self-supervised phase is by far the most computationally expensive step. GPT-3, for example, was trained on roughly 300 billion tokens. More recent models use far more data: LLaMA 3 was trained on over 15 trillion tokens, a ratio of roughly 1,875 tokens per parameter [17].
Pre-training data typically includes web crawls (Common Crawl), books, Wikipedia, academic papers, code repositories (GitHub), and increasingly synthetic data. Common Crawl, a non-profit web archive that has been crawling the web since 2007, releases monthly snapshots of 200 to 400 TiB and is the standard public source [35]. Derivative datasets clean and deduplicate it: RefinedWeb (2023) produced 5 trillion English tokens and was used to train Falcon, and FineWeb (2024) distilled 15 trillion tokens from 96 Common Crawl snapshots [36]. Data quality matters enormously; deduplication, filtering, and careful curation of training data have been shown to significantly improve model performance relative to simply adding more data. Token budgets keep climbing: Qwen 2.5 was pretrained on 18 trillion tokens and Qwen3 on 36 trillion [37][27].
Pre-training requires massive compute infrastructure. LLaMA 4 was trained on a cluster of thousands of NVIDIA GPUs, Mistral Large 3 used approximately 3,000 H200 GPUs [28], and Llama 3.1 405B used more than 16,000 H100 GPUs [17]. Training runs for frontier models cost tens to hundreds of millions of dollars, though efficiency outliers exist: DeepSeek-V3's technical report gave a much-discussed figure of around $5.6 million in GPU-hour cost for its final pre-training run, a number that excluded prior research, failed experiments, and post-training [26]. Epoch AI estimates that training costs for frontier models have grown by a factor of 2 to 3 times per year over the past eight years, with projections suggesting the largest models may cost over a billion dollars by 2027 [38].
| Model | Year | Estimated training cost |
|---|---|---|
| GPT-3 | 2020 | $4.6 million |
| PaLM (540B) | 2022 | ~$8-12 million (estimates vary) |
| GPT-4 | 2023 | $78-100+ million |
| Gemini Ultra 1.0 | 2023 | ~$192 million |
| DeepSeek-V3 | 2024 | ~$5.6 million (final run GPU-hours only) |
| GPT-5 | 2025 | Undisclosed (est. $200M+) |
After pre-training, the model is fine-tuned on a smaller, curated dataset of high-quality instruction-response pairs. Human annotators or AI systems write examples of ideal responses to various prompts, and the model is trained to mimic these responses. This stage, also called instruction tuning when the demonstrations follow an instruction-response format, transforms the base model from a raw text predictor into an assistant that can follow instructions and engage in conversation.
Alignment techniques adjust the model's behavior to be helpful, harmless, and honest. The two dominant approaches are:
RLHF (Reinforcement Learning from Human Feedback): Introduced by OpenAI and refined by Anthropic, RLHF involves three sub-steps: (1) collecting human preference data by having annotators rank model outputs, (2) training a reward model to predict human preferences, and (3) using reinforcement learning (specifically PPO, Proximal Policy Optimization) to fine-tune the LLM to maximize the reward model's score [13]. Notable RLHF-trained models include ChatGPT, Claude, and Gemini.
DPO (Direct Preference Optimization): Introduced by Rafailov et al. in 2023, DPO simplifies alignment by eliminating the separate reward model and RL loop. Instead, it directly optimizes the LLM on preference pairs using a classification-style loss function, exploiting the fact that the optimal RLHF policy can be written in closed form as a function of the reward. DPO is simpler to implement, cuts compute costs by approximately 40% compared to RLHF, and has been shown to produce comparable results in many settings [39]. By 2024, Hugging Face reported a 210% year-over-year increase in DPO usage.
Constitutional AI, published by Anthropic (Bai et al., December 2022), replaces most of the human harm-labeling step with model-generated critiques and revisions guided by a written constitution, and uses Reinforcement Learning from AI Feedback (RLAIF) to update the model [40].
Several newer methods extend this toolkit. Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-R1 work, dispenses with the separate critic model used in PPO: the model generates a group of candidate responses to a prompt, scores them with a reward function, and estimates the advantage from the relative scores within the group, significantly reducing memory requirements [26]. Reinforcement Learning with Verifiable Rewards (RLVR) uses rule-based or programmatic reward signals instead of a learned reward model: for math problems the reward is 1 if the final answer matches the ground truth and 0 otherwise, and for code it is whether the output passes test cases. Because verifiable rewards are less prone to reward hacking, larger-scale RL training can be performed with less risk of collapse; RLVR was central to DeepSeek-R1's training, improving AIME 2024 pass@1 from 15.6% to 71.0% [26]. Meta's LLaMA 4 uses a multi-round alignment process combining SFT, rejection sampling, PPO, and DPO [25]. Recent post-training rounds also add tool-use traces (function calling, code execution, web search), agentic behavior, and teacher-generated reasoning chains.
Full fine-tuning (updating all model parameters) is prohibitively expensive for most practitioners. Parameter-efficient fine-tuning (PEFT) methods enable adaptation of LLMs by modifying only a small fraction of the model's weights.
LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2021, injects trainable low-rank decomposition matrices into specific layers of the frozen pre-trained model [41]. Instead of updating a full weight matrix W of dimension d x d, LoRA learns two smaller matrices A (d x r) and B (r x d) where r is much smaller than d (typically 8 to 64). The effective update is W + BA, adding only a tiny number of parameters while capturing task-specific adaptations. LoRA typically trains 0.1% to 1% of the original parameters.
QLoRA (Quantized LoRA), introduced by Dettmers et al. in 2023, combines LoRA with aggressive quantization of the base model [42]. The pre-trained model weights are quantized to 4-bit precision using a new data type called NormalFloat4 (NF4), which is information-theoretically optimal for normally distributed weights. LoRA adapters are then trained in 16-bit precision on top of the frozen quantized base. Key innovations include double quantization (quantizing the quantization constants themselves) and paged optimizers to handle memory spikes. QLoRA makes it possible to fine-tune a 65-billion-parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning performance.
| Method | Approach | Typical parameters trained |
|---|---|---|
| Full fine-tuning | Updates all parameters | 100% |
| LoRA | Low-rank adapter matrices | 0.1-1% |
| QLoRA | LoRA on 4-bit quantized base | 0.1-1% |
| DoRA | Decomposed weight-norm LoRA | ~0.5% |
| Prefix tuning | Learnable prefix tokens prepended to each layer | <0.1% |
| Adapter layers | Small bottleneck modules inserted between layers | 1-5% |
With LoRA and QLoRA, practitioners can adapt a 7-billion-parameter model on a single consumer GPU in a few hours for roughly $10. Frameworks like LLaMA-Factory and Hugging Face's PEFT library integrate these methods into streamlined training pipelines.
Retrieval-Augmented Generation (RAG) is a technique that enhances LLM outputs by retrieving relevant documents from an external knowledge base before generating a response [43]. RAG addresses several core LLM limitations: it provides access to up-to-date information beyond the training cutoff, reduces hallucination by grounding responses in retrieved evidence, and enables source attribution so users can verify claims.
A typical RAG pipeline involves three steps:
RAG saw explosive research growth in 2024, with over 1,200 RAG-related papers published on arXiv compared to fewer than 100 the previous year [43]. Advanced variants include GraphRAG (Microsoft, 2024), which builds knowledge graphs from documents for more structured retrieval, and Agentic RAG, where an LLM-powered agent plans multi-step retrieval strategies before generating. For enterprise applications, RAG offers a cost-effective alternative to full fine-tuning: rather than retraining the model on proprietary data, organizations can simply index their documents and retrieve relevant passages at query time.
Scaling laws describe the predictable relationship between a model's performance (measured by loss on held-out data) and the resources used to train it: model size (parameters), dataset size (tokens), and compute (FLOPs).
In January 2020, researchers at OpenAI (Kaplan et al.) published one of the first systematic studies of neural language model scaling, finding that loss follows a power-law relationship with each of these three factors, with the trend holding over more than seven orders of magnitude in compute. Their work suggested that, given a fixed compute budget, model size should be prioritized over dataset size when scaling up [44].
In 2022, DeepMind's "Chinchilla" paper (Hoffmann et al.) challenged this view. The researchers trained more than 400 models ranging from 70 million to 16 billion parameters on between 5 and 500 billion tokens, and found that for compute-optimal training, model size and training data should be scaled roughly equally. They proposed a ratio of approximately 20 tokens per parameter as optimal [45]. The 70-billion-parameter Chinchilla model, trained on 1.4 trillion tokens with the same compute budget as the much larger 280-billion-parameter Gopher, outperformed Gopher as well as GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a wide range of downstream tasks.
Subsequent research has pushed well beyond the Chinchilla-optimal ratio. Practitioners discovered that models intended for wide deployment benefit from being trained on far more tokens than Chinchilla recommends, because the marginal cost of additional training is small compared to the ongoing cost of serving a larger model to millions of users. The practical effect was that post-2022 models got smaller and trained on more data: Llama 2 70B was trained on 2 trillion tokens, while LLaMA 3 models were trained at a ratio of roughly 1,875 tokens per parameter, nearly 100 times the Chinchilla-optimal ratio [17]. Research from Tsinghua University suggested a ratio of 192:1 may be more practical for many settings [46]. Loss continues to decrease well beyond the Chinchilla-optimal point, though with diminishing returns.
A 2024 paper from UC Berkeley ("Beyond Chinchilla-Optimal") formalized this intuition, showing that when inference costs are factored in, the optimal strategy is to train smaller models for longer than the original Chinchilla prescription [47]. The frontier later shifted again toward investing more in inference compute, a regime sometimes called test-time scaling.
The context window (or context length) is the maximum number of tokens the model can process in a single forward pass, including both the input prompt and the generated output. Larger context windows allow the model to work with longer documents, maintain coherence over extended conversations, and perform tasks like whole-codebase analysis or book-length summarization.
Context windows have grown by a factor of approximately 20,000 since 2018, from 512 tokens to 10 million tokens in LLaMA 4 Scout.
| Model | Year | Context window |
|---|---|---|
| GPT-1 | 2018 | 512 tokens |
| GPT-2 | 2019 | 1,024 tokens |
| GPT-3 | 2020 | 2,048 tokens |
| GPT-3.5-Turbo | 2023 | 16,384 tokens |
| GPT-4 | 2023 | 128,000 tokens |
| Claude 3 Opus | 2024 | 200,000 tokens |
| Gemini 1.5 Pro | 2024 | 1,000,000 tokens |
| GPT-5 | 2025 | 400,000 tokens |
| Claude Opus 4.6 | 2026 | 1,000,000 tokens |
| Gemini 3.1 Pro | 2026 | 1,000,000 tokens |
| LLaMA 4 Scout | 2025 | 10,000,000 tokens |
This expansion has been driven by algorithmic improvements (RoPE and its extensions like LongRoPE and YaRN), more efficient attention mechanisms (FlashAttention, sparse attention), and hardware advances in memory capacity. However, longer context windows introduce new challenges. Performance can degrade when relevant information is buried in the middle of a long document (the "lost in the middle" problem), and processing long contexts increases both latency and cost. KV-cache memory usage grows linearly with sequence length, making million-token contexts expensive to serve at scale. The practical bottleneck has accordingly shifted from advertised window size to the model's actual ability to use information deep inside the context reliably.
One of the most discussed phenomena in LLM research is the concept of emergent abilities: capabilities that appear in larger models but are absent or negligible in smaller ones. Examples include the ability to perform multi-step arithmetic, follow complex instructions, and reason about abstract concepts.
LLMs can learn new tasks from examples provided directly in the prompt, without any weight updates. This capability, known as in-context learning (ICL), scales with model size and context length. With expanded context windows, "many-shot" in-context learning (providing hundreds or thousands of examples rather than just a few) has shown significant performance gains across generative and discriminative tasks. A 2024 paper on many-shot ICL was accepted as a Spotlight Presentation at NeurIPS 2024, documenting performance improvements across a wide variety of tasks [48].
Chain-of-thought (CoT) prompting guides LLMs to break complex problems into intermediate reasoning steps. By prefacing a prompt with "Let's think step by step" or providing worked examples, models produce more accurate answers on math, logic, and science problems. This capability emerges primarily in models above approximately 100 billion parameters and is the foundation for dedicated reasoning models like OpenAI's o1/o3 and DeepSeek R1.
The existence and nature of emergent abilities is debated. Wei et al. (2022) documented numerous tasks where performance appeared to jump discontinuously at certain model scales [49]. However, Schaeffer et al. (2023) argued that apparent emergence may be an artifact of the choice of evaluation metric; when smooth, continuous metrics are used instead of sharp accuracy thresholds, performance improvements look gradual rather than sudden [50].
Regardless of the theoretical debate, it is empirically clear that larger and better-trained models can perform tasks that smaller models cannot. The practical question for researchers and engineers is whether a given capability requires a model above a certain size threshold or whether clever training techniques (better data, improved architectures, distillation) can bring that capability to smaller models.
Reasoning models are LLMs trained specifically to spend more computation at inference time by generating extended chains of thought before producing a final answer. OpenAI's o1 was the first widely available example; it and its successors (o3, o4-mini) generate "thinking tokens" that are not shown to the user but allow the model to work through intermediate steps, backtrack when it detects errors, and approach problems more methodically.
Test-time compute scaling refers to the finding that, for reasoning-trained models, performance on hard problems improves with more inference-time computation, whether through longer reasoning chains or through sampling multiple solutions and choosing the best. This creates a second scaling axis beyond model parameters and training tokens: a smaller reasoning model given a larger compute budget at inference can match a larger model that generates answers directly.
DeepSeek-R1 showed the recipe could be reproduced openly: GRPO plus RLVR applied to the DeepSeek-V3 base yielded reasoning capability matching o1 in MIT-licensed open weights [26]. Qwen3 introduced hybrid thinking/non-thinking modes within a single model family, letting users toggle extended reasoning on or off per request [27], and Google's Gemini 2.5 Deep Think mode uses parallel thinking, generating many candidate reasoning paths simultaneously before selecting the best answer [23].
The limitations of test-time scaling have also become clearer: extended reasoning does not reliably improve performance on knowledge-intensive tasks requiring factual accuracy, and models can reach a correct intermediate step and then deviate toward an incorrect conclusion during prolonged reasoning chains.
As LLMs grow larger, efficient inference becomes increasingly important. A 2025 ACL study found that proper inference optimization techniques can reduce energy usage by up to 73% compared to naive serving, typically translating to a 2-3x reduction in cloud costs [51].
Generating text from an LLM is a token-by-token loop. At each step, the model produces a probability distribution over the vocabulary, a sampling rule picks one token, and the new token is appended to the context for the next step. The main sampling controls are:
| Parameter | Effect |
|---|---|
| Temperature | Sharpens (low) or flattens (high) the next-token distribution; 0 reduces to greedy decoding |
| Top-k | Restricts sampling to the k highest-probability tokens |
| Top-p (nucleus) | Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p |
| Min-p | Drops tokens whose probability is below a fraction of the most likely token |
| Beam search | Maintains multiple candidate sequences and keeps the highest-scoring overall |
Quantization reduces the numerical precision of model weights from their training precision (typically 16-bit floating point) to lower-bit representations such as 8-bit, 4-bit, or even 2-bit. This cuts memory usage by 4-8x with modest quality loss, making models that would otherwise require multiple high-end GPUs runnable on consumer hardware, and can speed up inference significantly. NVIDIA's NVFP4 format, for instance, enables 4-bit quantization with minimal accuracy loss, delivering up to 4x throughput improvement on B200 GPUs compared to FP8 on H100 [52]. Common quantization approaches include GPTQ, AWQ, and GGUF.
Speculative decoding uses a small, fast "draft" model to generate candidate tokens, which are then verified in parallel by the larger target model. Since the large model can verify multiple tokens simultaneously (a single forward pass over several positions), this approach achieves 2-3x speedups without changing the output distribution. It works best when the draft model's distribution closely matches the target's, which holds for models of the same family at different sizes [53]. NVIDIA's TensorRT-LLM demonstrated up to 3.55x throughput improvement with Llama 3.3 70B using speculative decoding [54].
Techniques like PagedAttention (used in vLLM) manage the key-value cache more efficiently, reducing memory waste during batched inference. NVFP4 KV cache quantization can cut KV cache memory by up to 50%, effectively doubling context budgets and unlocking larger batch sizes [52].
Traditional static batching waits for a batch of requests to complete before starting a new batch, leaving GPUs idle. Continuous batching (also called in-flight batching) allows new requests to enter mid-batch and completed requests to exit immediately, dramatically improving GPU utilization and throughput.
vLLM, released in 2023, introduced PagedAttention to manage the KV cache as pages of virtual memory, dramatically reducing memory fragmentation; it became the dominant open-source serving framework and supports speculative decoding, tensor parallelism, and most major model families [53]. SGLang, developed at UC Berkeley, uses RadixAttention to cache and reuse KV states across requests that share a common prefix; benchmarks show roughly 29% higher throughput than vLLM on 7-8B models on H100 GPUs, with the gap narrowing to 3-5% on 70B+ models. TensorRT-LLM (NVIDIA) and TGI (Hugging Face) round out the major serving options; TensorRT-LLM achieves the highest raw token throughput on NVIDIA hardware through custom CUDA kernels.
No single number captures LLM quality. The benchmark stack used in 2025-2026 includes:
| Benchmark | Domain | Notes |
|---|---|---|
| MMLU | 57 academic subjects, multiple choice | Frontier models exceed 88%; largely saturated [55] |
| GPQA Diamond | Expert biology, chemistry, physics (198 questions) | Non-expert PhDs score ~34%; top models exceed 85% |
| HumanEval | 164 Python coding problems, unit-tested | Top models exceed 90% pass@1 |
| SWE-bench Verified | Real GitHub issues, patch must pass project tests | Gold standard for agentic coding; GPT-5 hit 74.9% [20] |
| GSM8K | Grade-school math word problems | Near-saturated; top models exceed 95% |
| MATH | Competition-level math | Harder than GSM8K; still discriminating |
| AIME 2025 | US math olympiad problems | GPT-5 achieved 94.6% without tools [20] |
| ARC-AGI | Abstract visual grid reasoning | Tests general intelligence; GPT-5.5 scored 95.0% [56] |
| Humanity's Last Exam (HLE) | 2,500 expert questions across 100+ subjects | Early 2025 models scored under 22%; Grok 4 reached 24% [57] |
| FrontierMath | Research-level mathematics | GPT-5.2 Thinking solved 40.3% on tiers 1-3 |
| BIG-Bench Hard | Reasoning and knowledge tasks | Broad collection for probing model capability |
| TruthfulQA / HaluEval | Hallucination and truthfulness | Adversarial truthfulness evaluation |
Benchmark saturation is a chronic problem. MMLU reached near-ceiling scores by 2024 [55]. The reaction has been to introduce harder benchmarks (GPQA Diamond, FrontierMath, Humanity's Last Exam) and to lean on agentic, real-world evaluations like SWE-bench Verified that are harder to game with narrow optimization.
The LLM ecosystem is split between proprietary (closed) models and open-weight (open) models, with ongoing debate about the advantages of each approach.
Closed models like GPT-5, Claude, and Gemini are developed by companies that do not release the model weights. Users access them through APIs or chat interfaces. Advantages include strong safety measures, regular updates, and state-of-the-art performance. Drawbacks include vendor lock-in, limited customization, unpredictable pricing changes, and data privacy concerns (since user inputs are sent to third-party servers).
Open-weight models like LLaMA 4, Mistral Large 3, DeepSeek V3, and Qwen 3 release their trained weights for anyone to download and run. This allows full customization, fine-tuning for specific domains, and local deployment without sending data to external servers. Other notable open-weight families include Google's Gemma (distilled from Gemini) and the Technology Innovation Institute's Falcon series (7B to 180B parameters), trained on RefinedWeb [36]. Licenses span a spectrum from permissive (Apache 2.0 for Mistral, Qwen, and many Gemma releases) to bespoke and restrictive (the Llama Community License, Gemma terms). Meta widely described Llama 2 as "open source," but the Open Source Initiative has argued that the term is misleading for models whose licenses impose redistribution and use limits, preferring the label "open weights" [58]; the OSI's Open Source AI Definition (OSAID) 1.0, published in October 2024, sets formal criteria that most open-weight models fail because they do not release their training data [59].
DeepSeek-V3 and R1 marked a turning point: the first time a freely downloadable open-weight model from outside the United States matched the reasoning quality of frontier closed models on widely cited benchmarks, while reportedly using a much smaller training budget [26]. This intensified an already active debate about whether open weights are a safety risk (because alignment training can be undone with cheap fine-tuning) or a safety asset (because the wider research community can study and patch the models).
By early 2026, the performance gap between open and closed models has narrowed substantially. Open-weight models trail proprietary frontier models by only about three months on average across standard benchmarks [60]. However, closed models maintain a lead on complex agentic tasks, production-quality coding benchmarks (SWE-bench), and overall human preference ratings on platforms like Chatbot Arena. For domain-specific applications such as legal document analysis or medical coding, a fine-tuned 7B open model can often outperform a general-purpose frontier model while running on a single consumer GPU.
Parameter counts, where reported, are total parameters; context windows are at the standard pricing tier where applicable.
| Model | Provider | Released | Parameters | Context | License | Notes |
|---|---|---|---|---|---|---|
| BERT base/large | Oct 2018 | 110M / 340M | 512 | Apache 2.0 | Encoder-only, masked LM [8] | |
| GPT-2 | OpenAI | 2019 | 1.5B (largest) | 1024 | MIT (weights) | Staged release; full 1.5B weights released Nov 2019 [7] |
| T5 (11B) | Oct 2019 | 11B | 512 | Apache 2.0 | Text-to-text encoder-decoder [11] | |
| GPT-3 | OpenAI | May 2020 | 175B | 2048 | API only | Demonstrated in-context few-shot learning [9] |
| InstructGPT | OpenAI | Mar 2022 | 1.3B / 6B / 175B | 2048 | API only | First major RLHF deployment [13] |
| ChatGPT | OpenAI | Nov 2022 | not disclosed | 4096 (initial) | Product | Brought LLMs to general public |
| GPT-4 | OpenAI | Mar 2023 | not disclosed | 8K / 32K | API only | Multimodal vision, no published params [14] |
| Llama 2 | Meta | Jul 2023 | 7B / 13B / 70B | 4096 | Llama 2 Community | First weights-available chat-tuned Llama [58] |
| Mistral 7B | Mistral AI | Sep 2023 | 7.3B | 8192 | Apache 2.0 | Strong small dense model |
| Mixtral 8x7B | Mistral AI | Dec 2023 | 46.7B (12.9B active) | 32K | Apache 2.0 | Sparse MoE [30] |
| Gemini 1.0 | Google DeepMind | Dec 2023 | not disclosed | 32K | API only | Native multimodal training |
| GPT-4o | OpenAI | May 2024 | not disclosed | 128K | API only | Native text, audio, image I/O [16] |
| Llama 3.1 | Meta | Jul 2024 | 8B / 70B / 405B | 128K | Llama 3 Community | 405B trained on 15T+ tokens, 16K H100s [17] |
| Qwen 2.5 | Alibaba | Sep 2024 | 0.5B to 72B | up to 128K | Apache 2.0 (most) | Pretrained on 18T tokens [37] |
| Gemma 2 | Jun 2024 | 2B / 9B / 27B | 8192 | Gemma terms | Open-weight, distilled from Gemini | |
| DeepSeek-R1 | DeepSeek | Jan 2025 | 671B (37B active) | 128K | MIT (weights) | RL-trained reasoning model on V3 base [26] |
| Gemini 2.5 Pro | Google DeepMind | Mar 2025 | not disclosed | 1M | API only | Thinking model; Deep Think variant [23] |
| Llama 4 Scout | Meta | Apr 2025 | 109B (17B active) | 10M | Llama 4 Community | Natively multimodal MoE, 16 experts [25] |
| Llama 4 Maverick | Meta | Apr 2025 | 400B (17B active) | 1M | Llama 4 Community | 128 experts, natively multimodal MoE [25] |
| Qwen3 235B-A22B | Alibaba | Apr 2025 | 235B (22B active) | 131K | Apache 2.0 | Hybrid thinking/non-thinking, 36T tokens [27] |
| GPT-4.1 | OpenAI | Apr 2025 | not disclosed | 1M | API only | 54.6% on SWE-bench Verified [19] |
| Claude Opus 4 | Anthropic | May 2025 | not disclosed | 200K | API only | Released alongside Sonnet 4 [21] |
| DeepSeek-V3.1 | DeepSeek | Aug 2025 | 685B | 128K | MIT (weights) | Hybrid thinking/non-thinking mode |
| Model | Developer | Release date | Total parameters | Active parameters | Context window | Architecture | License |
|---|---|---|---|---|---|---|---|
| GPT-5 | OpenAI | Aug 2025 | Undisclosed | Undisclosed | 400K tokens | Decoder-only | Proprietary |
| Claude Opus 4.6 | Anthropic | Feb 2026 | Undisclosed | Undisclosed | 1M tokens | Decoder-only | Proprietary |
| Gemini 3.1 Pro | Google DeepMind | Feb 2026 | Undisclosed | Undisclosed | 1M tokens | Decoder-only | Proprietary |
| LLaMA 4 Maverick | Meta | Apr 2025 | 400B | 17B | 1M tokens | MoE | Open-weight (Llama license) |
| Mistral Large 3 | Mistral AI | Dec 2025 | 675B | 41B | 256K tokens | MoE | Apache 2.0 |
| DeepSeek V3 | DeepSeek | Dec 2024 | 671B | 37B | 128K tokens | MoE | Open-weight (MIT) |
| DeepSeek R1 | DeepSeek | Jan 2025 | 671B | 37B | 128K tokens | MoE | Open-weight (MIT) |
The frontier LLM market is concentrated among a small number of well-funded organizations with access to large GPU clusters and proprietary training data.
OpenAI, founded in 2015 and based in San Francisco, released the GPT series and ChatGPT, which catalyzed mainstream adoption. It operates as a capped-profit company partially owned by Microsoft, and ChatGPT had over 400 million weekly users as of early 2025. The o-series reasoning models (o1, o3, o4-mini) form a separate product line optimized for test-time compute scaling, and GPT-5.5 achieved 84.9% on the GDPval knowledge-work benchmark and led the ARC-AGI leaderboard at 95.0% [56].
Anthropic, founded in 2021 by former OpenAI researchers including Dario Amodei and Daniela Amodei, focuses on AI safety research alongside model development; its Claude family uses Constitutional AI and RLAIF for alignment. Claude 3 Opus briefly held the top spot on multiple benchmarks when released in March 2024, and Claude Opus 4.7 (2025-2026) scored 87.6% on SWE-bench Verified, leading on agentic coding benchmarks.
Google DeepMind, formed through the 2023 merger of Google Brain and DeepMind, trains the Gemini family, distributed through the Gemini consumer product, Google Cloud Vertex AI, and the Gemini API; the open-weight Gemma family provides smaller models under permissive terms.
Meta AI open-sources its Llama family, making Meta the dominant provider of open-weight base models. Its strategic motivation is partly to prevent proprietary models from controlling AI infrastructure costs for Meta's own products.
xAI, founded by Elon Musk in 2023, trains the Grok series on its Colossus supercluster. Grok 3 (February 2025) was trained with 10x the compute of previous xAI models and achieved 84.6% on GPQA Diamond [61]; Grok 4 (mid-2025) set then-record scores on GPQA Diamond (88%) and Humanity's Last Exam (24%), achieving an Artificial Analysis Intelligence Index of 73, ahead of competing frontier models at the time.
DeepSeek, a Chinese AI lab affiliated with the quantitative hedge fund High-Flyer, released the MIT-licensed V3 and R1 models that sparked the 2025 debate about AI training economics. DeepSeek-V3.1 followed in August 2025 with hybrid thinking mode, and DeepSeek-V3.2 later reportedly matched GPT-5 on several benchmarks.
Mistral AI, a French startup founded in 2023 by former Google DeepMind and Meta researchers, focuses on efficient open models. Its Mistral 7B and Mixtral 8x7B were widely adopted in the open-source community; Codestral targets code generation and Mistral Large targets enterprise use.
Alibaba's Qwen team produces the Qwen family, covering sizes from 0.5B to over 235B parameters with strong multilingual coverage (Qwen3 supports 119 languages [27]). Qwen 2.5-72B-Instruct was reported to compete with Llama 3.1 405B-Instruct, which has roughly five times its parameter count [37].
Modern LLMs demonstrate a broad range of capabilities that have expanded significantly with each generation.
Text generation and conversation: LLMs can produce fluent, coherent text on virtually any topic. They power chatbots, writing assistants, and content generation tools used by millions of people daily.
Reasoning and problem-solving: Frontier models can perform multi-step logical reasoning, solve mathematical problems, and pass standardized exams. GPT-5 scored 94.6% on the AIME 2025 math benchmark without tools, and reasoning-focused models like OpenAI's o1 and DeepSeek R1 can tackle complex problems using extended chain-of-thought processing [20][26].
Code generation: LLMs have become powerful programming assistants. Claude 4.5 achieved 77.2% on SWE-bench Verified (a benchmark of real-world software engineering tasks), and models can write, debug, refactor, and explain code in dozens of programming languages [22].
Translation: LLMs perform high-quality translation between many language pairs, often rivaling or exceeding dedicated machine translation systems. LLaMA 4 was trained across over 200 languages [25].
Summarization: Models can condense long documents into concise summaries while preserving key information, a capability that improves substantially with larger context windows.
Agentic behavior: A significant development in 2025-2026 has been the emergence of agentic LLMs that can plan multi-step tasks, use external tools, browse the web, write and execute code, and interact with computer interfaces. Anthropic's Model Context Protocol (MCP) and Google's Agent-to-Agent Protocol (A2A) are establishing standards for how agents connect to external tools and APIs [62]. Gartner predicts that 40% of enterprise applications will embed AI agents by the end of 2026.
Multimodal LLMs extend the standard text-only framework by accepting, and often generating, non-text modalities. GPT-4 introduced image understanding in 2023, and by 2025 native multimodal training (jointly on text, images, and video) had become standard for frontier models.
GPT-4V (November 2023) and GPT-4o accept images as part of the prompt, enabling tasks like chart interpretation, document understanding, and visual question answering. Gemini was designed from the start to be natively multimodal, trained jointly on text, images, audio, and video rather than adding vision as a bolt-on capability. Claude 3 (March 2024) added vision across all model tiers, and Claude Opus 4.7 (2025-2026) features a 3x jump in image resolution, reaching 2,576px for professional-grade visual analysis.
Open-source vision-language models became highly capable through 2024-2025: the LLaVA, InternVL, and Qwen-VL families achieved GPT-4V-level performance in open-weight form, and Meta's LLaMA 4 models are jointly pretrained on text, image, and video tokens [25].
GPT-4o extended the multimodal stack to native audio input and output, enabling near-real-time voice conversations [16], and Gemini 1.5 Pro supports audio as a native input modality within its long-context window. Specialized audio models such as Whisper (OpenAI, 2022) handle speech-to-text transcription upstream of text-only models.
Gemini 1.5 Pro and 2.0 support video input directly within the context window, enabling temporal reasoning over hours of footage. Several open-source video-language models (LLaVA-Video, InternVideo) followed in 2024-2025.
LLMs have found applications across nearly every sector of the economy.
| Sector | Applications | Examples |
|---|---|---|
| Software development | Code generation, debugging, testing, refactoring | GitHub Copilot, Cursor, Claude Code |
| Customer service | Chatbots, virtual assistants, ticket routing | ChatGPT Enterprise, Intercom Fin |
| Healthcare | Clinical documentation, literature review, patient communication | Med-PaLM, ambient scribes |
| Legal | Contract analysis, legal research, document drafting | Harvey AI, CoCounsel |
| Education | Personalized tutoring, grading, content generation | Khan Academy Khanmigo, Duolingo |
| Scientific research | Literature review, hypothesis generation, data analysis | Elicit, Consensus |
| Finance | Sentiment analysis, compliance, report generation | Bloomberg GPT, FinGPT |
| Content creation | Writing assistance, marketing copy, creative writing | Jasper, Copy.ai |
Agentic coding capability has improved especially rapidly: SWE-bench resolution rates went from under 5% in 2023 to 74.9% in 2025 [20], and autonomous coding agents now tackle multi-file refactors and resolve real-world GitHub issues without step-by-step guidance. Orchestration frameworks such as LangChain and LlamaIndex automate retrieval and tool use, and multi-agent systems assign different roles to different model instances to decompose complex tasks.
Industry analysts project that the agentic AI market will grow from $7.8 billion in 2025 to over $52 billion by 2030 [62].
Despite rapid progress, LLMs face several fundamental limitations.
LLMs sometimes generate plausible but factually incorrect information, a phenomenon known as hallucination. Theoretical work has shown that hallucination is an inherent property of LLMs and cannot be completely eliminated through architecture, data, or algorithmic improvements alone [63]. The problem stems from the fact that LLMs learn statistical patterns rather than grounding their knowledge in verified facts: the model is rewarded for producing plausible-sounding text, not for refusing to answer when uncertain, so it will fabricate citations, invent code that calls non-existent functions, and confidently give wrong answers in long-tail domains. On constraint satisfaction tasks, hallucination rates scale linearly with problem complexity.
Mitigation approaches include RAG (grounding responses in retrieved documents), chain-of-verification (having the model check its own outputs), and calibrated uncertainty (systems that transparently signal doubt and can safely refuse to answer rather than guessing). A 2025 multi-model study showed that simple prompt-based mitigation cut GPT-4o's hallucination rate from 53% to 23% [63]. While frontier models have reduced hallucination rates significantly (GPT-5 reports approximately 6.2%), the problem persists.
While LLMs have improved substantially at reasoning tasks, they still fail on problems that require genuine logical deduction, spatial reasoning, or common sense in unfamiliar contexts. State-of-the-art models perform poorly on certain clinical reasoning tasks and can struggle with novel problem formulations that differ from their training distribution [64]. Reasoning models (o1, DeepSeek R1) have partially addressed this through extended chain-of-thought processing, but at the cost of significantly increased inference time and expense.
Since LLMs are trained on internet text, they can learn and reproduce societal biases present in the training data. These biases can manifest in harmful stereotypes, uneven performance across languages and demographics, and skewed representations. Alignment techniques (RLHF, DPO) mitigate but do not eliminate these issues.
A model trained through a given date knows nothing about later events except through retrieval or tools; the weights encode a snapshot of the world as of the training cutoff. This is why almost all chat products now ship with web search, and why RAG pipelines are standard in enterprise deployments.
LLMs can be exploited for generating disinformation, phishing emails, malicious code, and other harmful content. Prompt injection attacks can manipulate LLM-powered applications into ignoring their instructions; the OWASP 2025 list ranks prompt injection as the top vulnerability for LLM-integrated applications [65]. Three related but distinct concerns dominate the security literature:
Defenses combine input filtering, separate trust levels for system, developer, and user content, output checks, and defense-in-depth rather than reliance on the model's own safety training. Defending against these attacks remains an active area of research.
Alignment research asks whether the stated goal of producing helpful, harmless, and honest outputs can be durably encoded into model weights. Anthropic's Constitutional AI and scalable oversight research are two published frameworks for pursuing this at scale without requiring human labeling of every output [40]. At the organizational level, OpenAI's Preparedness Framework and Anthropic's Responsible Scaling Policy describe commitments to evaluate models at capability thresholds before deployment.
Training and deploying LLMs requires enormous computational resources, raising significant environmental concerns.
Training GPT-3 consumed an estimated 1,287 megawatt-hours (MWh) of electricity and produced over 550 metric tons of CO2 equivalent emissions, while requiring more than 700 kiloliters of water for cooling [66]. As models have grown, costs have scaled accordingly. GPT-4's training is estimated at $78-100 million, and Gemini Ultra 1.0 reached approximately $192 million. Epoch AI projects that the cost of frontier training runs has grown by 2-3x per year over the past eight years [38].
Recent research reveals that inference (rather than training) is emerging as the primary contributor to ongoing environmental costs, since inference occurs continuously at massive scale while training is a one-time event. A 2025 study estimated that GPT-4o inference alone would require approximately 391,000 to 463,000 MWh of electricity annually at current usage levels, consuming energy comparable to 35,000 U.S. homes [66]. The most energy-intensive models consume over 29 Wh per long prompt, more than 65 times the most efficient systems.
Inference pricing has fallen as dramatically as capability has risen. GPT-4 launched in 2023 at $0.03 per 1,000 input tokens; by 2025 GPT-4.1 offered eight times the context window at lower per-token prices [19], and open-weight models on commodity hardware pushed marginal inference cost to near zero for many use cases.
Research has also shown that LLMs can have dramatically lower environmental impact than human labor for equivalent output. For a typical LLM like Llama-3-70B, the human-to-LLM emissions ratio ranges from 40:1 to 150:1, meaning the LLM produces 40 to 150 times less carbon per unit of output than the human equivalent [66]. Optimization techniques (quantization, efficient serving, renewable-powered data centers) continue to improve the efficiency of LLM deployment.
As of early 2026, the LLM field is characterized by several major trends.
Million-token context windows are now standard among frontier models. Claude 4.6 and Gemini 3.1 Pro both offer 1-million-token windows, and LLaMA 4 Scout pushes to 10 million tokens. These expanded windows enable processing of entire codebases, book-length documents, and multi-hour conversation histories in a single pass.
Agentic capabilities have become a defining feature. Frontier models can use tools, browse the web, write and execute code, manage files, and carry out multi-step tasks with minimal human supervision. Frameworks built on MCP and A2A allow agents to connect to external services and APIs through standardized protocols. Multi-agent systems, where orchestrated teams of specialized agents collaborate on tasks, saw a 1,445% increase in interest from Q1 2024 to Q2 2025 according to Gartner [62]. Agent reliability remains the open problem for commercial deployment: models still make errors in long agentic loops, and reducing error rates in tool use, code execution, and long-horizon planning is central to converting chat assistants into autonomous workers.
Reasoning models represent a distinct category. OpenAI's o1/o3 series and DeepSeek R1 use extended internal "thinking" to solve complex problems, trading speed for accuracy on mathematical, scientific, and coding tasks.
Mixture-of-experts architectures have become widespread, allowing models to scale total parameter counts into the hundreds of billions or trillions while keeping inference costs practical by activating only a fraction of parameters per token.
The open-weight ecosystem continues to mature. Models like LLaMA 4, DeepSeek V3, Mistral Large 3, and Qwen 3 provide near-frontier capabilities with full weight access, enabling fine-tuning, local deployment, and research that would be impossible with closed models.
Hybrid architectures are beginning to appear. NVIDIA's Nemotron 3 family (announced December 2025) combines Mamba (a state-space model) with Transformer layers in an MoE configuration, targeting improved inference throughput and long-context efficiency for agent workloads [67].
The pace of frontier model releases continued through the second quarter of 2026. OpenAI launched GPT-5.5 on April 23, 2026, describing it as its most capable and intuitive model to date, with particular gains in agentic coding, scientific research, and computer use. The model offers a roughly 1-million-token context window with up to 128,000 output tokens, and is priced at $5 per million input tokens and $30 per million output tokens, double the cost of GPT-5.4 [68]. It scored 82.7% on Terminal-Bench 2.0 and posted strong results on the FrontierMath benchmark [68]. On May 5, 2026, OpenAI released GPT-5.5 Instant as the new default model for all ChatGPT users, replacing GPT-5.3 Instant [69].
Google introduced Gemini 3.5 Flash at Google I/O on May 19, 2026, positioning it as its strongest agentic and coding model and reporting that it runs roughly four times faster (in output tokens per second) than comparable frontier models. It scored 76.2% on Terminal-Bench 2.1, with Gemini 3.5 Pro slated to follow [70].
Anthropic released Claude Opus 4.8 on May 28, 2026, citing improved agentic coding, reasoning, and honesty. The company reported the model reaches 84% on the Online-Mind2Web browser-agent benchmark and is about four times less likely than its predecessor to let flaws in its own code pass unremarked [71]. Pricing remained $5 per million input tokens and $25 per million output tokens [71]. The same day, Anthropic announced a $65 billion Series H round at a $965 billion post-money valuation [72], a figure that news outlets reported surpassed OpenAI's valuation, making Anthropic the most valuable AI startup at the time [73].
Several notable frontier model pages were added as dedicated entries rather than being covered only inside this overview:
| Model | Developer | Why it matters |
|---|---|---|
| GPT-5.4 | OpenAI | Mainline reasoning model with computer use and tool search |
| GPT-4.1 | OpenAI | API-only family focused on coding and 1M context |
| Gemini 3 Pro | Google DeepMind | Gemini 3-series flagship preview model |
| Claude Opus 4.7 | Anthropic | Anthropic's April 2026 flagship generally available model |
| Grok 4.1 Fast | xAI | 2M-context tool-calling model for agentic tasks |
A large language model is like a super-smart computer program that has read billions of books, articles, and web pages. By reading all that text, it learned how words and sentences fit together. When you ask it a question or give it a task, it figures out what words should come next, one at a time, to write a helpful answer. It can do lots of things: translate languages, answer questions, write stories, help with homework, or even write computer code. But it is not perfect. Sometimes it makes things up that sound right but are not true, because it learned patterns in language rather than actually understanding the world the way people do.