LLMs
Last reviewed
May 7, 2026
Sources
39 citations
Review status
Source-backed
Revision
v5 ยท 6,957 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
39 citations
Review status
Source-backed
Revision
v5 ยท 6,957 words
Add missing citations, update stale details, or suggest a clearer explanation.
Large language models (LLMs) are a class of neural network trained on very large bodies of text, typically with billions to trillions of parameters, that learn to predict tokens in a sequence and can be steered to follow instructions, write code, summarize documents, translate, reason about images, and call external tools. The category is fuzzy, since there is no formal parameter threshold that makes a model "large," but in practice the term is used for transformer-based models trained with self-supervised objectives on web-scale text and then post-trained for chat or task use. Wikipedia's working definition is simply "a neural network trained on a vast amount of text for natural language processing tasks, especially language generation" [1].
Modern LLMs sit at the center of generative AI products such as ChatGPT, Claude, Gemini, Microsoft Copilot, and Meta AI. They are also the substrate for the open-weight ecosystem around Llama, Mistral, Qwen, DeepSeek, and Gemma.
The "large" in LLM has shifted with hardware. GPT-2's 1.5 billion-parameter model was treated as too dangerous to release in early 2019; by 2025, models with several hundred billion total parameters were running in commercial chat products [2][3]. Three properties are usually present:
Models below roughly 1 billion parameters are sometimes called "small" language models, but the boundary is informal.
Language modeling predates deep learning. Statistical n-gram models from the 1990s and 2000s estimated the probability of the next word from counts of short sequences in a fixed corpus, and were the workhorse of speech recognition and machine translation for decades. By 2001, smoothed n-gram models trained on roughly 300 million words held the state of the art in perplexity [1].
The shift to learned distributed representations began with neural probabilistic language models (Bengio et al., 2003) and accelerated with word embeddings. Word2vec, published by Tomas Mikolov and colleagues at Google in 2013, made dense word vectors cheap to train and showed that arithmetic on those vectors captured surprising semantic structure, including the famous king minus man plus woman example [4]. GloVe followed in 2014 with a co-occurrence-based formulation [5]. ELMo (2018) extended this to contextual embeddings using bidirectional LSTMs.
The modern era began with the transformer paper, "Attention Is All You Need" (Vaswani et al., NeurIPS 2017), which dropped recurrence in favor of multi-head self-attention and dramatically improved parallelism on GPUs and TPUs [6]. Two complementary directions then split off:
T5 (Raffel et al., 2019) explored encoder-decoder transformers in the "text-to-text" framing, training the same architecture on translation, summarization, and classification by recasting all tasks as sequence-to-sequence problems; the largest checkpoint had 11B parameters [9].
The transition from raw language model to chat product happened with InstructGPT (Ouyang et al., March 2022), which combined supervised fine-tuning with reinforcement learning from human feedback (RLHF). Human labelers preferred outputs from a 1.3B-parameter InstructGPT model over the 175B GPT-3 base model, despite a 100x parameter gap [10]. ChatGPT, released by OpenAI on November 30, 2022, applied this recipe at scale and brought LLMs to a general audience in a way that earlier demos had not. The service reached 100 million users in two months, faster than any prior consumer application.
GPT-4 followed on March 14, 2023, with improved reasoning and a multimodal vision capability; OpenAI did not publish parameter counts or training compute [11]. Anthropic launched Claude in March 2023 and Claude 2 in July 2023, emphasizing a "Constitutional AI" approach to safety. Google launched Bard in March 2023 and later rebranded it as Gemini, backed by the Gemini family of natively multimodal models. Meta released Llama 2 in July 2023 as an open-weights model for research and commercial use.
2024 and 2025 were the years of multimodal-by-default models, longer context windows, and reasoning-trained variants. GPT-4o launched May 13, 2024 with native text, image, and audio input/output and audio response times around 320 milliseconds [12]. Llama 3.1, including a 405B-parameter version trained on more than 15 trillion tokens with a 128K context window, shipped July 23, 2024 [13]. DeepSeek-V3 (December 2024) and DeepSeek-R1 (January 2025) introduced a 671B-parameter mixture-of-experts model trained on 14.8 trillion tokens that matched frontier closed models on reasoning benchmarks at a fraction of the reported training cost [14].
Meta released Llama 4 on April 5, 2025, with the Scout and Maverick models natively multimodal and trained on more than 30 trillion tokens using a mixture-of-experts architecture [33]. Qwen3 from Alibaba, released April 28, 2025, was trained on 36 trillion tokens and introduced hybrid thinking/non-thinking modes across a family of dense and MoE models [34].
Anthropicf released Claude Opus 4 and Sonnet 4 on May 22, 2025 [15]. Google's Gemini 2.5 Pro, released March 20, 2025, shipped a 1-million-token context window and a "thinking" reasoning mode, with a Deep Think variant rolled out in August 2025 using parallel thinking techniques [17]. OpenAI released GPT-5 in 2025, achieving 94.6% on AIME 2025 without tools and 74.9% on SWE-bench Verified for agentic coding [32]. GPT-4.1 followed on April 14, 2025 with a 1-million-token context window and large coding-benchmark gains over GPT-4o [18].
Nearly every production LLM as of 2026 is a transformer. The core unit is the self-attention layer: each token is projected to a query, key, and value vector, attention weights are computed by a softmax over query-key dot products, and the output is a weighted sum of value vectors. Stacking dozens to hundreds of these layers, interleaved with feed-forward networks and normalization, gives the model the capacity to mix information across long token spans [6].
Three architectural families coexist:
| Family | Pretraining objective | Typical use | Examples |
|---|---|---|---|
| Encoder-only | Masked-language modeling, next-sentence prediction | Classification, retrieval, embeddings | BERT, RoBERTa, DeBERTa |
| Decoder-only | Causal next-token prediction | Generation, chat, agents | GPT-3, Llama, Claude, Gemini, Mistral |
| Encoder-decoder | Span corruption (T5) or denoising | Translation, summarization, instruction-following | T5, BART, Flan-T5 |
Decoder-only autoregressive transformers became the default for general-purpose chat models because next-token prediction works for any task that can be written as text and because the same model serves both prompt encoding and generation.
The original transformer used fixed sinusoidal positional encodings [6]. Modern models almost always use learned alternatives. Rotary Position Embedding (RoPE), introduced by Su et al. in RoFormer (2021), encodes position by rotating query and key vectors, preserves relative position under shifts, and extrapolates more gracefully than absolute encodings; it is used in Llama, GPT-NeoX, and most newer open models [19]. ALiBi (Press et al., 2022) instead biases attention scores by a linear function of token distance and continues to work well past the training context length.
Mixture of Experts (MoE) routes each token through a small subset of expert feed-forward networks rather than running the full network on every token. Mistral AI's Mixtral 8x7B, released December 11, 2023, has 46.7 billion total parameters but uses only about 12.9 billion per token, giving it the inference cost of a much smaller dense model while matching or beating Llama 2 70B on many benchmarks [20].
DeepSeek-V3 pushed this further: 671 billion total parameters, 37 billion active per token, 256 routed experts plus a shared expert per layer, with auxiliary-loss-free load balancing [14]. Llama 4 Scout uses 16 experts with 17B active parameters out of 109B total; Llama 4 Maverick uses 128 experts with 17B active parameters out of 400B total, both natively multimodal [33].
Mamba, introduced by Gu and Dao (December 2023), uses selective state space models rather than attention as the core sequence-mixing operation [36]. Mamba scales linearly with sequence length in both computation and memory, compared to the quadratic cost of standard attention, making it attractive for very long sequences.
Hybrid architectures that interleave Mamba layers with attention layers have shown that combining the two can outperform either alone. AI21 Labs' Jamba family interleaves Mamba and attention layers and achieved production deployment, with Jamba 1.5 scaling to 398B total parameters (94B active) using 16 MoE experts [37]. As of 2025, pure Mamba models have not displaced transformers in frontier chat products, but hybrid research remains active.
A frontier LLM is built in stages. The terminology varies between labs, but the structure is fairly stable.
The model is trained with self-supervised next-token prediction on a corpus of web text, books, code, scientific papers, and increasingly synthetic data. The standard public source is Common Crawl, a non-profit web archive that has been crawling the web since 2007 and releases monthly snapshots of 200 to 400 TiB [21]. Derivative datasets clean and deduplicate it. RefinedWeb (2023) produced 5 trillion English tokens and was used to train Falcon. FineWeb (2024) is a 15-trillion-token dataset distilled from 96 Common Crawl snapshots [22]. Llama 3.1 was trained on more than 15 trillion tokens; Qwen 2.5 used 18 trillion; Qwen3 used 36 trillion [13][23][34].
The key hyperparameters at this stage are model size (parameter count), dataset size (tokens), and compute budget. The relationships between these were formalized as scaling laws.
Kaplan et al. (OpenAI, January 2020) showed that test loss scales as a power law in model size, dataset size, and compute, with the power-law trend holding over more than seven orders of magnitude in compute. Their conclusion was that, given a fixed compute budget, you should spend most of it on a larger model and undertrain it on relatively few tokens [24].
The DeepMind Chinchilla paper (Hoffmann et al., March 2022) revisited this by training more than 400 models from 70 million to 16 billion parameters on between 5 and 500 billion tokens. They found that for compute-optimal training, model size and training tokens should grow at the same rate: roughly 20 tokens per parameter, not the much smaller ratios used by GPT-3 and similar models. They tested this by training Chinchilla, a 70B model on 1.4 trillion tokens, with the same compute budget as the 280B Gopher; Chinchilla beat Gopher, GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a wide range of downstream tasks [25].
The practical effect was that post-2022 models got smaller and trained on more data. Llama 2 70B was trained on 2 trillion tokens; Llama 3.1 8B on more than 15 trillion. The cost-optimal frontier moved toward more data per parameter, then later toward investing more in inference compute, a shift sometimes called the test-time scaling regime.
A freshly pretrained model is a competent next-token predictor but not a useful assistant. Post-training converts it into one, in roughly this order:
Direct Preference Optimization (DPO), introduced by Rafailov et al. at NeurIPS 2023, replaced the reward-model-plus-PPO pipeline with a single supervised classification loss on preference pairs. The trick is that the optimal RLHF policy can be written in closed form as a function of the reward, so the reward model implicit in the policy can be optimized directly. DPO matches or beats PPO-based RLHF on summarization and dialogue tasks while being simpler to implement, and is now the default in many open-source post-training stacks [27].
Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-R1 paper (January 2025), is a variant that dispenses with the separate critic model used in PPO. Instead of computing a per-step advantage with a value network, GRPO generates a group of candidate responses to a prompt, scores them with a reward function, and estimates the advantage from the relative scores within the group. This significantly reduces memory requirements compared to PPO and was central to DeepSeek-R1's training pipeline [14].
Reinforcement Learning with Verifiable Rewards (RLVR) uses rule-based or programmatic reward signals rather than a learned reward model. For math problems, the reward is 1 if the final answer matches the ground truth and 0 otherwise; for code, the reward is whether the code passes test cases. Because verifiable rewards are less prone to reward hacking, larger-scale RL training can be performed with less risk of training collapse. RLVR was central to DeepSeek-R1's training, improving AIME 2024 pass@1 from 15.6% to 71.0% [14].
More recent rounds of post-training add tool-use traces (function calling, code execution, web search), agentic behavior (multi-step planning), and reasoning chains generated by either a teacher model or a separate verifiable reward signal.
Reasoning models are LLMs specifically trained to spend more computation at inference time by generating extended chains of thought before producing a final answer. OpenAI's o1, released September 2024, was the first widely available example. o1 and its successors (o3, o4-mini) generate "thinking tokens" that are not shown to the user but allow the model to work through intermediate steps, backtrack when it detects errors, and approach problems more methodically.
Test-time compute scaling refers to the finding that, for reasoning-trained models, performance on hard problems improves with more inference-time computation, whether through longer reasoning chains or through sampling multiple solutions and choosing the best. This creates a second scaling axis beyond model parameters and training tokens: a smaller reasoning model given more compute budget at inference can match a larger model that generates answers directly.
DeepSeek-R1, released January 2025, demonstrated that GRPO combined with RLVR applied to the DeepSeek-V3 base model could produce reasoning capabilities matching OpenAI o1, as open-weights MIT-licensed models [14]. Qwen3 (April 2025) introduced hybrid thinking/non-thinking mode in the same model family, allowing users to toggle extended reasoning on or off per request [34]. Google's Gemini 2.5 Deep Think mode (August 2025) uses parallel thinking techniques, generating many candidate reasoning paths simultaneously before selecting the best answer [17].
Limitations of test-time scaling have also become clearer: extended reasoning does not reliably improve performance on knowledge-intensive tasks requiring factual accuracy, and models can reach a correct intermediate step and then deviate toward incorrect conclusions during prolonged reasoning chains.
Multimodal LLMs extend the standard text-only framework by accepting and often generating non-text modalities. Most frontier models as of 2025 are multimodal by default.
GPT-4V (November 2023) and GPT-4o (May 2024) accept images as part of the prompt, enabling tasks like chart interpretation, document understanding, and visual question answering. Gemini was designed from the start to be natively multimodal, trained jointly on text, images, audio, and video rather than adding vision as a bolt-on capability. Claude 3 (March 2024) added vision across all model tiers; Claude Opus 4.7 (2025-2026) features a 3x jump in image resolution, reaching 2,576px for professional-grade visual analysis.
Open-source vision-language models became highly capable through 2024-2025. LLaVA, InternVL, and Qwen-VL families achieved GPT-4V-level performance in open-weight form. Meta's Llama 4 Scout and Maverick (April 2025) are natively multimodal, jointly pretrained on text, image, and video tokens with 10 million token context windows [33].
GPT-4o extended the multimodal stack to native audio input and output, enabling near-real-time voice conversations. Gemini 1.5 Pro supports audio as a native input modality within its long-context window. Specialized audio models such as Whisper (OpenAI, 2022) handle speech-to-text transcription upstream of text-only models.
Gemini 1.5 Pro and 2.0 support video input directly within the context window, enabling temporal reasoning over hours of footage. Several open-source video-language models (LLaVA-Video, InternVideo) followed in 2024-2025.
This is a non-exhaustive list of LLMs that have shaped the field. Parameter counts, where reported, are total parameters; context windows are at standard pricing tier when applicable.
| Model | Provider | Released | Parameters | Context | License | Notes |
|---|---|---|---|---|---|---|
| BERT base/large | Oct 2018 | 110M / 340M | 512 | Apache 2.0 | Encoder-only, masked LM [7] | |
| GPT-2 | OpenAI | 2019 | 1.5B (largest) | 1024 | MIT (weights) | Staged release; full 1.5B weights released Nov 2019 [2] |
| T5 (11B) | Oct 2019 | 11B | 512 | Apache 2.0 | Text-to-text encoder-decoder [9] | |
| GPT-3 | OpenAI | May 2020 | 175B | 2048 | API only | Demonstrated in-context few-shot learning [8] |
| InstructGPT | OpenAI | Mar 2022 | 1.3B / 6B / 175B | 2048 | API only | First major RLHF deployment [10] |
| ChatGPT | OpenAI | Nov 2022 | not disclosed | 4096 (initial) | Product | Brought LLMs to general public |
| GPT-4 | OpenAI | Mar 2023 | not disclosed | 8K / 32K | API only | Multimodal vision, no published params [11] |
| Llama 2 | Meta | Jul 2023 | 7B / 13B / 70B | 4096 | Llama 2 Community | First weights-available chat-tuned Llama [29] |
| Mistral 7B | Mistral AI | Sep 2023 | 7.3B | 8192 | Apache 2.0 | Strong small dense model |
| Mixtral 8x7B | Mistral AI | Dec 2023 | 46.7B (12.9B active) | 32K | Apache 2.0 | Sparse MoE [20] |
| Gemini 1.0 | Google DeepMind | Dec 2023 | not disclosed | 32K | API only | Native multimodal training |
| GPT-4o | OpenAI | May 2024 | not disclosed | 128K | API only | Native text, audio, image I/O [12] |
| Llama 3.1 | Meta | Jul 2024 | 8B / 70B / 405B | 128K | Llama 3 Community | 405B trained on 15T+ tokens, 16K H100s [13] |
| Qwen 2.5 | Alibaba | Sep 2024 | 0.5B to 72B | up to 128K | Apache 2.0 (most) | Pretrained on 18T tokens [23] |
| Gemma 2 | Jun 2024 | 2B / 9B / 27B | 8192 | Gemma terms | Open-weight, distilled from Gemini | |
| DeepSeek-V3 | DeepSeek | Dec 2024 | 671B (37B active) | 128K | MIT (weights) | MoE, 14.8T tokens, low reported training cost [14] |
| DeepSeek-R1 | DeepSeek | Jan 2025 | 671B (37B active) | 128K | MIT (weights) | RL-trained reasoning model on V3 base [14] |
| Llama 4 Scout | Meta | Apr 2025 | 109B (17B active) | 10M | Llama 4 Community | Natively multimodal MoE, 16 experts [33] |
| Llama 4 Maverick | Meta | Apr 2025 | 400B (17B active) | 10M | Llama 4 Community | 128 experts, natively multimodal MoE [33] |
| Qwen3 235B-A22B | Alibaba | Apr 2025 | 235B (22B active) | 131K | Apache 2.0 | Hybrid thinking/non-thinking, 36T tokens [34] |
| GPT-5 | OpenAI | 2025 | not disclosed | not disclosed | API only | 94.6% AIME 2025, 74.9% SWE-bench Verified [32] |
| Gemini 2.5 Pro | Google DeepMind | Mar 2025 | not disclosed | 1M | API only | Thinking model; Deep Think variant [17] |
| GPT-4.1 | OpenAI | Apr 2025 | not disclosed | 1M | API only | 54.6% on SWE-bench Verified [18] |
| Claude Opus 4 | Anthropic | May 2025 | not disclosed | 200K | API only | Released alongside Sonnet 4 [15] |
| Claude Sonnet 4 | Anthropic | May 2025 | not disclosed | 200K (1M beta) | API only | Long-context beta |
| DeepSeek-V3.1 | DeepSeek | Aug 2025 | 685B | 128K | MIT (weights) | Hybrid thinking/non-thinking mode |
The frontier LLM market is concentrated among a small number of well-funded organizations with access to large GPU clusters and proprietary training data.
OpenAI, founded in 2015 and based in San Francisco, released the GPT series and ChatGPT, which catalyzed mainstream adoption. OpenAI operates as a capped-profit company partially owned by Microsoft and is the operator of the ChatGPT product (over 400 million weekly users as of early 2025) and the OpenAI API. The o-series reasoning models (o1, o3, o4-mini) represent a separate product line optimized for test-time compute scaling. GPT-5.5 achieved 84.9% on the GDPval knowledge-work benchmark and led the ARC-AGI leaderboard with a score of 95.0%.
Anthropic, founded in 2021 by former OpenAI researchers including Dario Amodei and Daniela Amodei, focuses on AI safety research alongside model development. Its Claude model family uses Constitutional AI and RLAIF for alignment. Claude 3 Opus briefly held the top spot on multiple benchmarks when released in March 2024. Claude Opus 4.7 (2025-2026) scored 87.6% on SWE-bench Verified and leads on agentic coding benchmarks.
Google DeepMind, formed through the merger of Google Brain and DeepMind in 2023, trains the Gemini family. Gemini 2.5 Pro and its Deep Think variant support 1-million-token context windows and native multimodal inputs across text, images, audio, and video. Google distributes Gemini through the Gemini consumer product, Google Cloud Vertex AI, and the Gemini API. The open-weight Gemma family provides smaller models under permissive terms.
Meta AI open-sources its Llama family, making Meta the dominant provider of open-weight base models. Llama 3.1 405B set a high bar for open-weight performance in mid-2024; Llama 4 (April 2025) introduced native multimodality and mixture-of-experts architecture. Meta's strategic motivation for open-weighting its models is partly to prevent proprietary models from controlling AI infrastructure costs for Meta's own products.
xAI, founded by Elon Musk in 2023, trains the Grok series on xAI's Colossus supercluster. Grok 3 (February 2025) was trained with 10x the compute of previous xAI models and achieved 84.6% on GPQA Diamond [35]. Grok 4 (mid-2025) set all-time-high scores on GPQA Diamond (88%) and Humanity's Last Exam (24%), achieving an Artificial Analysis Intelligence Index of 73, ahead of competing frontier models at the time.
DeepSeek, a Chinese AI lab affiliated with the quantitative hedge fund High-Flyer, released DeepSeek-V3 and R1 in late 2024 and early 2025 under MIT licenses. These models matched frontier closed models at a reported fraction of the training cost, sparking substantial industry debate about AI economics. DeepSeek-V3.1 followed in August 2025 with hybrid thinking mode; DeepSeek-V3.2 later reportedly matched GPT-5 on several benchmarks.
Mistral AI, a French startup founded in 2023 by former Google DeepMind and Meta researchers, focuses on efficient open models. Its Mistral 7B and Mixtral 8x7B were widely adopted in the open-source community. Mistral's Codestral model targets code generation; Mistral Large targets enterprise use cases.
Alibaba's Qwen team produces the Qwen family, which covers sizes from 0.5B to over 235B parameters with strong multilingual coverage. Qwen3 supports 119 languages and was trained on 36 trillion tokens. Qwen 2.5-72B-Instruct was reported to compete with Llama 3.1 405B-Instruct, which has roughly five times its parameter count [23].
Generating text from an LLM is a token-by-token loop. At each step, the model produces a probability distribution over the vocabulary, a sampling rule picks one token, and the new token is appended to the prompt for the next step. The main sampling controls are:
| Parameter | Effect |
|---|---|
| Temperature | Sharpens (low) or flattens (high) the next-token distribution; 0 reduces to greedy decoding |
| Top-k | Restricts sampling to the k highest-probability tokens |
| Top-p (nucleus) | Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p |
| Min-p | Drops tokens whose probability is below a fraction of the most likely token |
| Beam search | Maintains multiple candidate sequences and keeps the highest-scoring overall |
vLLM, released in 2023, introduced PagedAttention to manage the KV cache as pages of virtual memory, dramatically reducing memory fragmentation and enabling much higher throughput for batched inference. vLLM became the dominant open-source serving framework and supports speculative decoding, tensor parallelism, and most major model families [28].
SGLang, developed at UC Berkeley, uses RadixAttention to cache and reuse KV states across requests sharing a common prefix, achieving higher throughput on workloads with structured prompts and agentic patterns. Benchmarks show SGLang achieving approximately 29% higher throughput than vLLM on 7-8B models on H100 GPUs, with the gap narrowing to 3-5% on 70B+ models.
TensorRT-LLM (NVIDIA) and TGI (Hugging Face Text Generation Inference) round out the major serving options. TensorRT-LLM achieves the highest raw token throughput on NVIDIA hardware through custom CUDA kernels.
Speculative decoding uses a small draft model to propose multiple tokens that the large model verifies in parallel, giving 2-3x latency speedups without changing the output distribution [28]. This works well when the draft model's distribution closely matches the target model's, which holds for models of the same family at different sizes.
Quant formats (GPTQ, AWQ, GGUF) reduce model weights from 16-bit or 32-bit floats to 4-bit or even 2-bit integers, reducing memory requirements by 4-8x with modest quality loss. This makes models that would otherwise require multiple high-end GPUs runnable on consumer hardware or a single small GPU.
No single number captures LLM quality. The benchmark stack used in 2025-2026 includes:
| Benchmark | Domain | Notes |
|---|---|---|
| MMLU | 57 academic subjects, multiple choice | Frontier models exceed 88%; largely saturated [30] |
| GPQA Diamond | Expert biology, chemistry, physics (198 questions) | Non-expert PhDs score ~34%; top models exceed 85% |
| HumanEval | 164 Python coding problems, unit-tested | Top models exceed 90% pass@1 |
| SWE-bench Verified | Real GitHub issues, patch must pass project tests | Gold standard for agentic coding; GPT-5 hit 74.9% [32] |
| GSM8K | Grade-school math word problems | Near-saturated; top models exceed 95% |
| MATH | Competition-level math | Harder than GSM8K; still discriminating |
| AIME 2025 | US math olympiad problems | GPT-5 achieved 94.6% without tools [32] |
| ARC-AGI | Abstract visual grid reasoning | Tests general intelligence; GPT-5.5 scored 95.0% [38] |
| Humanity's Last Exam (HLE) | 2,500 expert questions across 100+ subjects | Early 2025 models scored under 22%; Grok 4 reached 24% [39] |
| FrontierMath | Research-level mathematics | GPT-5.2 Thinking solved 40.3% on tiers 1-3 |
| BIG-Bench Hard | Reasoning and knowledge tasks | Broad collection for probing model capability |
| TruthfulQA / HaluEval | Hallucination and truthfulness | Adversarial truthfulness evaluation |
Benchmark saturation is a chronic problem. MMLU reached near-ceiling scores by 2024. The reaction has been to introduce harder benchmarks (GPQA Diamond, FrontierMath, Humanity's Last Exam) and to lean on agentic, real-world evaluations like SWE-bench Verified that are harder to game with narrow optimization.
GitHub Copilot, Cursor, Claude Code, and Windsurf pair LLMs with editor integration and tool use to assist with code completion, test generation, debugging, and code review. The SWE-bench trajectory from under 5% in 2023 to 74.9% in 2025 illustrates how rapidly agentic coding capability has improved. Autonomous coding agents can now tackle multi-file refactors and resolve real-world GitHub issues without step-by-step human guidance.
LLMs are used in contract analysis, legal research, financial modeling, customer support, and document summarization across industries. Retrieval-augmented generation (RAG) systems ground LLM responses in up-to-date enterprise knowledge bases by retrieving relevant documents at query time and appending them to the prompt, reducing hallucination and knowledge-cutoff problems [Lewis et al., 2020]. RAG systems convert documents into vector embeddings stored in a vector database, retrieve the most relevant chunks at query time, and feed them alongside the user's question into the LLM context.
LLMs accelerate literature review, hypothesis generation, and structured data extraction from research papers. Specialized fine-tuned models target protein structure prediction, drug discovery, genomics, and clinical note summarization.
ChatGPT, Claude.ai, Google Gemini, and Microsoft Copilot serve hundreds of millions of users as general-purpose assistants for writing, research, learning, and productivity. Voice interfaces powered by native audio-capable models have extended the modality beyond text.
Agentic systems loop an LLM with planning, memory, and tool calls (browsers, shells, code interpreters). Frameworks such as LangChain and LlamaIndex automate retrieval and tool orchestration. Multi-agent systems assign different roles to different model instances running in parallel or sequence, enabling decomposition of complex tasks.
The LLM market in 2025-2026 splits into two main camps. Closed-weights labs (OpenAI, Anthropic, Google DeepMind for the Gemini frontier tier) ship via API and reveal little about parameter counts, training data, or training compute. Open-weights labs (Meta with Llama, Mistral, DeepSeek, Alibaba with Qwen, Google with Gemma, UAE's Technology Innovation Institute with Falcon) publish weights under licenses that range from permissive (Apache 2.0 for Mistral, Qwen, Gemma in many cases) to bespoke and restrictive (Llama Community License, Gemma terms).
DeepSeek-V3 and R1 were a turning point: the first time a freely downloadable open-weights model from outside the United States matched the reasoning quality of frontier closed-weights models on widely cited benchmarks, while reportedly using a much smaller training budget [14]. This intensified an already active debate about whether open weights are a safety risk (because alignment training can be undone with cheap fine-tuning) or a safety asset (because the wider research community can study and patch the models).
Llama 2 was widely described as "open source" by Meta but the Llama 2 Community License imposes redistribution and use limits. The Open Source Initiative has argued that the term is misleading; the more accurate label is "open weights" [29].
| Family | Provider | Latest sizes | Notes |
|---|---|---|---|
| Llama 3.1 / 4 | Meta | 8B, 70B, 405B; Scout 109B, Maverick 400B | Most-downloaded open-weights base; Llama Community License [13][33] |
| Mistral / Mixtral | Mistral AI | 7B dense, 8x7B and 8x22B MoE | Apache 2.0 for most variants [20] |
| Qwen 2.5 / 3 | Alibaba | 0.5B to 235B | Apache 2.0; Qwen3 trained on 36T tokens [34] |
| DeepSeek V3 / R1 / V3.1 | DeepSeek | 671B-685B MoE | MIT-licensed weights; reasoning-trained variant [14] |
| Gemma 2 | 2B, 9B, 27B | Distilled from Gemini; Gemma terms | |
| Falcon | Technology Innovation Institute | 7B, 40B, 180B | Trained on RefinedWeb [22] |
LLMs do not understand text in the way a human reader does. They are statistical models, and several failure modes follow from that.
Hallucination, the production of confident but false statements, is intrinsic to probabilistic generation. The model is rewarded for producing plausible-sounding text, not for refusing to answer when uncertain, so it will fabricate citations, invent code that calls non-existent functions, and confidently give wrong answers in long-tail domains. RAG, tool use, and citation training reduce the rate but do not eliminate it.
Long-context degradation is the gap between the advertised context window and the model's actual ability to use information deep inside it. Even with 1-million-token windows, recall drops for content in the middle of the context, and reasoning over information scattered across long documents is harder than tasks that fit in a few thousand tokens.
Bias and toxicity inherited from the training data show up in outputs. Models can refuse requests on the basis of demographic cues, generate stereotyped descriptions, or produce harmful content under adversarial prompting. Safety training reduces some of these but trades off against helpfulness on sensitive topics.
Knowledge cutoffs are intrinsic. A model trained through, say, late 2024 knows nothing about events after that date except through retrieval or tools. This is why almost all chat products now ship with web search.
Cost and energy are nontrivial. Training a frontier model requires tens of thousands of high-end GPUs running for weeks. Llama 3.1 405B used more than 16,000 H100 GPUs [13]. Inference at scale is itself a major datacenter workload, which is why providers invest heavily in quantization, KV-cache reuse, and speculative decoding.
Test-time compute scaling has its own limitations: extended reasoning chains do not reliably improve performance on knowledge-intensive tasks, and models can reach a correct intermediate step and then deviate toward an incorrect conclusion during prolonged reasoning.
The security literature treats LLMs as a system component with its own threat model. The OWASP 2025 list ranks prompt injection as the top vulnerability for LLM-integrated applications [31]. Three related but distinct concerns:
Defenses combine input filtering, separate trust levels for system, developer, and user content, output checks, and defense-in-depth rather than reliance on the model's own safety training.
Alignment research asks whether the stated goal of producing helpful, harmless, and honest outputs can be durably encoded into model weights. Anthropic's Constitutional AI (2022) and Scalable Oversight (2022) are two published frameworks for doing this at scale without requiring human labeling of every output. OpenAI's Preparedness Framework and Anthropic's Responsible Scaling Policy describe commitments to evaluate models at capability thresholds before deployment.
The question of whether open-weight release increases risk remains actively debated. Proponents argue that published weights allow independent safety audits and community-driven patches; critics argue that any alignment training can be undone with modest fine-tuning budgets, making open weights net-negative for safety.
Frontier-model training is expensive but the absolute numbers are usually closely held. Public anchors:
Inference economics shifted just as dramatically. GPT-4 launched in 2023 at $0.03 per 1,000 input tokens. By GPT-4.1 in 2025, the same provider was offering models with eight times the context window at lower per-token prices [18]. Open-weights models running on commodity hardware pushed marginal inference cost down to near zero for many use cases.
LLMs are one face of a broader category called foundation models, which also includes vision-language models, code models, and protein models. They are the language backbone behind:
The research community uses LLMs as a substrate for nearly every applied NLP problem, from clinical note summarization to legal-document review.
Several trajectories appear likely to shape the next few years:
Test-time compute scaling is developing into a second axis alongside training-time scaling. The o-series and DeepSeek-R1 demonstrated that RL-trained reasoning models can solve problems that overwhelm direct-answer models of the same size. Improving the reliability of long reasoning chains and extending test-time scaling to knowledge-heavy domains are active research areas.
Longer and more reliable context continues to be a focus. Context windows have grown from 2K tokens in GPT-3 to 1-10 million tokens in 2025 frontier models. The practical bottleneck has shifted from window size to the model's ability to use context reliably.
Native multimodality is becoming standard. The transition from add-on vision to natively joint-trained multimodal models (Gemini, GPT-4o, Llama 4) is largely complete at the frontier. Video understanding and audio generation are the next modalities receiving heavy investment.
Agent reliability is the open problem for commercial deployment. LLMs can generate plausible multi-step plans but still make errors in long agentic loops. Reducing error rates in tool use, code execution, and long-horizon planning is central to converting LLMs from chat assistants into autonomous workers.
Model efficiency is advancing on multiple fronts: MoE architectures reduce active parameter counts, quantization reduces memory, speculative decoding reduces latency, and distillation allows smaller models to approach larger-model quality. These trends are pushing capable models further down the hardware cost curve.