LLMs

Large language models (LLMs) are a class of neural network trained on very large bodies of text, typically with billions to trillions of parameters, that learn to predict tokens in a sequence and can be steered to follow instructions, write code, summarize documents, translate, reason about images, and call external tools. The category is fuzzy, since there is no formal parameter threshold that makes a model "large," but in practice the term is used for transformer-based models trained with self-supervised objectives on web-scale text and then post-trained for chat or task use. Wikipedia's working definition is simply "a neural network trained on a vast amount of text for natural language processing tasks, especially language generation" ^[1].

Modern LLMs sit at the center of generative AI products such as ChatGPT, Claude, Gemini, Microsoft Copilot, and Meta AI. They are also the substrate for the open-weight ecosystem around Llama, Mistral, Qwen, DeepSeek, and Gemma.

What counts as an LLM

The "large" in LLM has shifted with hardware. GPT-2's 1.5 billion-parameter model was treated as too dangerous to release in early 2019; by 2025, models with several hundred billion total parameters were running in commercial chat products ^[2]^[3]. Three properties are usually present:

a transformer (or close variant) backbone with attention as the dominant mixing operator,
self-supervised pretraining on a corpus large enough that the model never sees the same example twice,
a separate post-training stage that turns the raw next-token predictor into something usable, typically supervised fine-tuning followed by preference optimization.

Models below roughly 1 billion parameters are sometimes called "small" language models, but the boundary is informal.

History

Statistical foundations and early neural models

Language modeling predates deep learning. Statistical n-gram models from the 1990s and 2000s estimated the probability of the next word from counts of short sequences in a fixed corpus, and were the workhorse of speech recognition and machine translation for decades. By 2001, smoothed n-gram models trained on roughly 300 million words held the state of the art in perplexity ^[1].

The shift to learned distributed representations began with neural probabilistic language models (Bengio et al., 2003) and accelerated with word embeddings. Word2vec, published by Tomas Mikolov and colleagues at Google in 2013, made dense word vectors cheap to train and showed that arithmetic on those vectors captured surprising semantic structure, including the famous king minus man plus woman example ^[4]. GloVe followed in 2014 with a co-occurrence-based formulation ^[5]. ELMo (2018) extended this to contextual embeddings using bidirectional LSTMs.

The transformer era (2017-2019)

The modern era began with the transformer paper, "Attention Is All You Need" (Vaswani et al., NeurIPS 2017), which dropped recurrence in favor of multi-head self-attention and dramatically improved parallelism on GPUs and TPUs ^[6]. Two complementary directions then split off:

Encoder-only masked-language models such as BERT (Devlin et al., 2018), trained to predict masked tokens given both left and right context, dominated discriminative NLP benchmarks like GLUE and SuperGLUE ^[7].
Decoder-only autoregressive models such as GPT-1 (2018), GPT-2 (1.5B parameters, full release November 2019), and GPT-3 (175B parameters, May 2020) scaled the next-token-prediction objective and discovered in-context learning, where a model picks up new tasks from a few examples in its prompt without gradient updates ^[3]^[8].

T5 (Raffel et al., 2019) explored encoder-decoder transformers in the "text-to-text" framing, training the same architecture on translation, summarization, and classification by recasting all tasks as sequence-to-sequence problems; the largest checkpoint had 11B parameters ^[9].

RLHF and the chat product era (2022-2023)

The transition from raw language model to chat product happened with InstructGPT (Ouyang et al., March 2022), which combined supervised fine-tuning with reinforcement learning from human feedback (RLHF). Human labelers preferred outputs from a 1.3B-parameter InstructGPT model over the 175B GPT-3 base model, despite a 100x parameter gap ^[10]. ChatGPT, released by OpenAI on November 30, 2022, applied this recipe at scale and brought LLMs to a general audience in a way that earlier demos had not. The service reached 100 million users in two months, faster than any prior consumer application.

GPT-4 followed on March 14, 2023, with improved reasoning and a multimodal vision capability; OpenAI did not publish parameter counts or training compute ^[11]. Anthropic launched Claude in March 2023 and Claude 2 in July 2023, emphasizing a "Constitutional AI" approach to safety. Google launched Bard in March 2023 and later rebranded it as Gemini, backed by the Gemini family of natively multimodal models. Meta released Llama 2 in July 2023 as an open-weights model for research and commercial use.

Scale, open weights, and reasoning (2024-2026)

2024 and 2025 were the years of multimodal-by-default models, longer context windows, and reasoning-trained variants. GPT-4o launched May 13, 2024 with native text, image, and audio input/output and audio response times around 320 milliseconds ^[12]. Llama 3.1, including a 405B-parameter version trained on more than 15 trillion tokens with a 128K context window, shipped July 23, 2024 ^[13]. DeepSeek-V3 (December 2024) and DeepSeek-R1 (January 2025) introduced a 671B-parameter mixture-of-experts model trained on 14.8 trillion tokens that matched frontier closed models on reasoning benchmarks at a fraction of the reported training cost ^[14].

Meta released Llama 4 on April 5, 2025, with the Scout and Maverick models natively multimodal and trained on more than 30 trillion tokens using a mixture-of-experts architecture ^[33]. Qwen3 from Alibaba, released April 28, 2025, was trained on 36 trillion tokens and introduced hybrid thinking/non-thinking modes across a family of dense and MoE models ^[34].

Anthropicf released Claude Opus 4 and Sonnet 4 on May 22, 2025 ^[15]. Google's Gemini 2.5 Pro, released March 20, 2025, shipped a 1-million-token context window and a "thinking" reasoning mode, with a Deep Think variant rolled out in August 2025 using parallel thinking techniques ^[17]. OpenAI released GPT-5 in 2025, achieving 94.6% on AIME 2025 without tools and 74.9% on SWE-bench Verified for agentic coding ^[32]. GPT-4.1 followed on April 14, 2025 with a 1-million-token context window and large coding-benchmark gains over GPT-4o ^[18].

Architecture

Transformer backbone

Nearly every production LLM as of 2026 is a transformer. The core unit is the self-attention layer: each token is projected to a query, key, and value vector, attention weights are computed by a softmax over query-key dot products, and the output is a weighted sum of value vectors. Stacking dozens to hundreds of these layers, interleaved with feed-forward networks and normalization, gives the model the capacity to mix information across long token spans ^[6].

Three architectural families coexist:

Family	Pretraining objective	Typical use	Examples
Encoder-only	Masked-language modeling, next-sentence prediction	Classification, retrieval, embeddings	BERT, RoBERTa, DeBERTa
Decoder-only	Causal next-token prediction	Generation, chat, agents	GPT-3, Llama, Claude, Gemini, Mistral
Encoder-decoder	Span corruption (T5) or denoising	Translation, summarization, instruction-following	T5, BART, Flan-T5

Decoder-only autoregressive transformers became the default for general-purpose chat models because next-token prediction works for any task that can be written as text and because the same model serves both prompt encoding and generation.

Positional encoding

The original transformer used fixed sinusoidal positional encodings ^[6]. Modern models almost always use learned alternatives. Rotary Position Embedding (RoPE), introduced by Su et al. in RoFormer (2021), encodes position by rotating query and key vectors, preserves relative position under shifts, and extrapolates more gracefully than absolute encodings; it is used in Llama, GPT-NeoX, and most newer open models ^[19]. ALiBi (Press et al., 2022) instead biases attention scores by a linear function of token distance and continues to work well past the training context length.

Sparse mixture-of-experts (MoE)

Mixture of Experts (MoE) routes each token through a small subset of expert feed-forward networks rather than running the full network on every token. Mistral AI's Mixtral 8x7B, released December 11, 2023, has 46.7 billion total parameters but uses only about 12.9 billion per token, giving it the inference cost of a much smaller dense model while matching or beating Llama 2 70B on many benchmarks ^[20].

DeepSeek-V3 pushed this further: 671 billion total parameters, 37 billion active per token, 256 routed experts plus a shared expert per layer, with auxiliary-loss-free load balancing ^[14]. Llama 4 Scout uses 16 experts with 17B active parameters out of 109B total; Llama 4 Maverick uses 128 experts with 17B active parameters out of 400B total, both natively multimodal ^[33].

State space models and hybrid architectures

Mamba, introduced by Gu and Dao (December 2023), uses selective state space models rather than attention as the core sequence-mixing operation ^[36]. Mamba scales linearly with sequence length in both computation and memory, compared to the quadratic cost of standard attention, making it attractive for very long sequences.

Hybrid architectures that interleave Mamba layers with attention layers have shown that combining the two can outperform either alone. AI21 Labs' Jamba family interleaves Mamba and attention layers and achieved production deployment, with Jamba 1.5 scaling to 398B total parameters (94B active) using 16 MoE experts ^[37]. As of 2025, pure Mamba models have not displaced transformers in frontier chat products, but hybrid research remains active.

Training pipeline

A frontier LLM is built in stages. The terminology varies between labs, but the structure is fairly stable.

Pretraining

The model is trained with self-supervised next-token prediction on a corpus of web text, books, code, scientific papers, and increasingly synthetic data. The standard public source is Common Crawl, a non-profit web archive that has been crawling the web since 2007 and releases monthly snapshots of 200 to 400 TiB ^[21]. Derivative datasets clean and deduplicate it. RefinedWeb (2023) produced 5 trillion English tokens and was used to train Falcon. FineWeb (2024) is a 15-trillion-token dataset distilled from 96 Common Crawl snapshots ^[22]. Llama 3.1 was trained on more than 15 trillion tokens; Qwen 2.5 used 18 trillion; Qwen3 used 36 trillion ^[13]^[23]^[34].

The key hyperparameters at this stage are model size (parameter count), dataset size (tokens), and compute budget. The relationships between these were formalized as scaling laws.

Scaling laws

Kaplan et al. (OpenAI, January 2020) showed that test loss scales as a power law in model size, dataset size, and compute, with the power-law trend holding over more than seven orders of magnitude in compute. Their conclusion was that, given a fixed compute budget, you should spend most of it on a larger model and undertrain it on relatively few tokens ^[24].

The DeepMind Chinchilla paper (Hoffmann et al., March 2022) revisited this by training more than 400 models from 70 million to 16 billion parameters on between 5 and 500 billion tokens. They found that for compute-optimal training, model size and training tokens should grow at the same rate: roughly 20 tokens per parameter, not the much smaller ratios used by GPT-3 and similar models. They tested this by training Chinchilla, a 70B model on 1.4 trillion tokens, with the same compute budget as the 280B Gopher; Chinchilla beat Gopher, GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a wide range of downstream tasks ^[25].

The practical effect was that post-2022 models got smaller and trained on more data. Llama 2 70B was trained on 2 trillion tokens; Llama 3.1 8B on more than 15 trillion. The cost-optimal frontier moved toward more data per parameter, then later toward investing more in inference compute, a shift sometimes called the test-time scaling regime.

Post-training: SFT, RLHF, DPO, GRPO, RLVR

A freshly pretrained model is a competent next-token predictor but not a useful assistant. Post-training converts it into one, in roughly this order:

Supervised fine-tuning (SFT) on curated demonstrations of how an assistant should respond. This is also called instruction tuning when the demonstrations follow an instruction-response format.
Preference alignment, where a separate reward model learns to score responses based on human-labeled comparisons, and the policy is updated against that reward. The classic recipe is RLHF with proximal policy optimization (PPO), as described in InstructGPT ^[10].
Optional safety-specific training, often using a list of written principles. Anthropic's Constitutional AI (Bai et al., December 2022) replaces most of the human harm-labeling step with model-generated critiques and revisions guided by a written constitution, and uses Reinforcement Learning from AI Feedback (RLAIF) to update the model ^[26].

Direct Preference Optimization (DPO), introduced by Rafailov et al. at NeurIPS 2023, replaced the reward-model-plus-PPO pipeline with a single supervised classification loss on preference pairs. The trick is that the optimal RLHF policy can be written in closed form as a function of the reward, so the reward model implicit in the policy can be optimized directly. DPO matches or beats PPO-based RLHF on summarization and dialogue tasks while being simpler to implement, and is now the default in many open-source post-training stacks ^[27].

Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-R1 paper (January 2025), is a variant that dispenses with the separate critic model used in PPO. Instead of computing a per-step advantage with a value network, GRPO generates a group of candidate responses to a prompt, scores them with a reward function, and estimates the advantage from the relative scores within the group. This significantly reduces memory requirements compared to PPO and was central to DeepSeek-R1's training pipeline ^[14].

Reinforcement Learning with Verifiable Rewards (RLVR) uses rule-based or programmatic reward signals rather than a learned reward model. For math problems, the reward is 1 if the final answer matches the ground truth and 0 otherwise; for code, the reward is whether the code passes test cases. Because verifiable rewards are less prone to reward hacking, larger-scale RL training can be performed with less risk of training collapse. RLVR was central to DeepSeek-R1's training, improving AIME 2024 pass@1 from 15.6% to 71.0% ^[14].

More recent rounds of post-training add tool-use traces (function calling, code execution, web search), agentic behavior (multi-step planning), and reasoning chains generated by either a teacher model or a separate verifiable reward signal.

Reasoning models and test-time compute

Reasoning models are LLMs specifically trained to spend more computation at inference time by generating extended chains of thought before producing a final answer. OpenAI's o1, released September 2024, was the first widely available example. o1 and its successors (o3, o4-mini) generate "thinking tokens" that are not shown to the user but allow the model to work through intermediate steps, backtrack when it detects errors, and approach problems more methodically.

Test-time compute scaling refers to the finding that, for reasoning-trained models, performance on hard problems improves with more inference-time computation, whether through longer reasoning chains or through sampling multiple solutions and choosing the best. This creates a second scaling axis beyond model parameters and training tokens: a smaller reasoning model given more compute budget at inference can match a larger model that generates answers directly.

DeepSeek-R1, released January 2025, demonstrated that GRPO combined with RLVR applied to the DeepSeek-V3 base model could produce reasoning capabilities matching OpenAI o1, as open-weights MIT-licensed models ^[14]. Qwen3 (April 2025) introduced hybrid thinking/non-thinking mode in the same model family, allowing users to toggle extended reasoning on or off per request ^[34]. Google's Gemini 2.5 Deep Think mode (August 2025) uses parallel thinking techniques, generating many candidate reasoning paths simultaneously before selecting the best answer ^[17].

Limitations of test-time scaling have also become clearer: extended reasoning does not reliably improve performance on knowledge-intensive tasks requiring factual accuracy, and models can reach a correct intermediate step and then deviate toward incorrect conclusions during prolonged reasoning chains.

Multimodal variants

Multimodal LLMs extend the standard text-only framework by accepting and often generating non-text modalities. Most frontier models as of 2025 are multimodal by default.

Vision-language models

GPT-4V (November 2023) and GPT-4o (May 2024) accept images as part of the prompt, enabling tasks like chart interpretation, document understanding, and visual question answering. Gemini was designed from the start to be natively multimodal, trained jointly on text, images, audio, and video rather than adding vision as a bolt-on capability. Claude 3 (March 2024) added vision across all model tiers; Claude Opus 4.7 (2025-2026) features a 3x jump in image resolution, reaching 2,576px for professional-grade visual analysis.

Open-source vision-language models became highly capable through 2024-2025. LLaVA, InternVL, and Qwen-VL families achieved GPT-4V-level performance in open-weight form. Meta's Llama 4 Scout and Maverick (April 2025) are natively multimodal, jointly pretrained on text, image, and video tokens with 10 million token context windows ^[33].

Audio and speech

GPT-4o extended the multimodal stack to native audio input and output, enabling near-real-time voice conversations. Gemini 1.5 Pro supports audio as a native input modality within its long-context window. Specialized audio models such as Whisper (OpenAI, 2022) handle speech-to-text transcription upstream of text-only models.

Video understanding

Gemini 1.5 Pro and 2.0 support video input directly within the context window, enabling temporal reasoning over hours of footage. Several open-source video-language models (LLaVA-Video, InternVideo) followed in 2024-2025.

Major models

This is a non-exhaustive list of LLMs that have shaped the field. Parameter counts, where reported, are total parameters; context windows are at standard pricing tier when applicable.

Model	Provider	Released	Parameters	Context	License	Notes
BERT base/large	Google	Oct 2018	110M / 340M	512	Apache 2.0	Encoder-only, masked LM ^[7]
GPT-2	OpenAI	2019	1.5B (largest)	1024	MIT (weights)	Staged release; full 1.5B weights released Nov 2019 ^[2]
T5 (11B)	Google	Oct 2019	11B	512	Apache 2.0	Text-to-text encoder-decoder ^[9]
GPT-3	OpenAI	May 2020	175B	2048	API only	Demonstrated in-context few-shot learning ^[8]
InstructGPT	OpenAI	Mar 2022	1.3B / 6B / 175B	2048	API only	First major RLHF deployment ^[10]
ChatGPT	OpenAI	Nov 2022	not disclosed	4096 (initial)	Product	Brought LLMs to general public
GPT-4	OpenAI	Mar 2023	not disclosed	8K / 32K	API only	Multimodal vision, no published params ^[11]
Llama 2	Meta	Jul 2023	7B / 13B / 70B	4096	Llama 2 Community	First weights-available chat-tuned Llama ^[29]
Mistral 7B	Mistral AI	Sep 2023	7.3B	8192	Apache 2.0	Strong small dense model
Mixtral 8x7B	Mistral AI	Dec 2023	46.7B (12.9B active)	32K	Apache 2.0	Sparse MoE ^[20]
Gemini 1.0	Google DeepMind	Dec 2023	not disclosed	32K	API only	Native multimodal training
GPT-4o	OpenAI	May 2024	not disclosed	128K	API only	Native text, audio, image I/O ^[12]
Llama 3.1	Meta	Jul 2024	8B / 70B / 405B	128K	Llama 3 Community	405B trained on 15T+ tokens, 16K H100s ^[13]
Qwen 2.5	Alibaba	Sep 2024	0.5B to 72B	up to 128K	Apache 2.0 (most)	Pretrained on 18T tokens ^[23]
Gemma 2	Google	Jun 2024	2B / 9B / 27B	8192	Gemma terms	Open-weight, distilled from Gemini
DeepSeek-V3	DeepSeek	Dec 2024	671B (37B active)	128K	MIT (weights)	MoE, 14.8T tokens, low reported training cost ^[14]
DeepSeek-R1	DeepSeek	Jan 2025	671B (37B active)	128K	MIT (weights)	RL-trained reasoning model on V3 base ^[14]
Llama 4 Scout	Meta	Apr 2025	109B (17B active)	10M	Llama 4 Community	Natively multimodal MoE, 16 experts ^[33]
Llama 4 Maverick	Meta	Apr 2025	400B (17B active)	10M	Llama 4 Community	128 experts, natively multimodal MoE ^[33]
Qwen3 235B-A22B	Alibaba	Apr 2025	235B (22B active)	131K	Apache 2.0	Hybrid thinking/non-thinking, 36T tokens ^[34]
GPT-5	OpenAI	2025	not disclosed	not disclosed	API only	94.6% AIME 2025, 74.9% SWE-bench Verified ^[32]
Gemini 2.5 Pro	Google DeepMind	Mar 2025	not disclosed	1M	API only	Thinking model; Deep Think variant ^[17]
GPT-4.1	OpenAI	Apr 2025	not disclosed	1M	API only	54.6% on SWE-bench Verified ^[18]
Claude Opus 4	Anthropic	May 2025	not disclosed	200K	API only	Released alongside Sonnet 4 ^[15]
Claude Sonnet 4	Anthropic	May 2025	not disclosed	200K (1M beta)	API only	Long-context beta
DeepSeek-V3.1	DeepSeek	Aug 2025	685B	128K	MIT (weights)	Hybrid thinking/non-thinking mode

Frontier labs

The frontier LLM market is concentrated among a small number of well-funded organizations with access to large GPU clusters and proprietary training data.

OpenAI

OpenAI, founded in 2015 and based in San Francisco, released the GPT series and ChatGPT, which catalyzed mainstream adoption. OpenAI operates as a capped-profit company partially owned by Microsoft and is the operator of the ChatGPT product (over 400 million weekly users as of early 2025) and the OpenAI API. The o-series reasoning models (o1, o3, o4-mini) represent a separate product line optimized for test-time compute scaling. GPT-5.5 achieved 84.9% on the GDPval knowledge-work benchmark and led the ARC-AGI leaderboard with a score of 95.0%.

Anthropic

Anthropic, founded in 2021 by former OpenAI researchers including Dario Amodei and Daniela Amodei, focuses on AI safety research alongside model development. Its Claude model family uses Constitutional AI and RLAIF for alignment. Claude 3 Opus briefly held the top spot on multiple benchmarks when released in March 2024. Claude Opus 4.7 (2025-2026) scored 87.6% on SWE-bench Verified and leads on agentic coding benchmarks.

Google DeepMind

Google DeepMind, formed through the merger of Google Brain and DeepMind in 2023, trains the Gemini family. Gemini 2.5 Pro and its Deep Think variant support 1-million-token context windows and native multimodal inputs across text, images, audio, and video. Google distributes Gemini through the Gemini consumer product, Google Cloud Vertex AI, and the Gemini API. The open-weight Gemma family provides smaller models under permissive terms.

Meta AI

Meta AI open-sources its Llama family, making Meta the dominant provider of open-weight base models. Llama 3.1 405B set a high bar for open-weight performance in mid-2024; Llama 4 (April 2025) introduced native multimodality and mixture-of-experts architecture. Meta's strategic motivation for open-weighting its models is partly to prevent proprietary models from controlling AI infrastructure costs for Meta's own products.

xAI

xAI, founded by Elon Musk in 2023, trains the Grok series on xAI's Colossus supercluster. Grok 3 (February 2025) was trained with 10x the compute of previous xAI models and achieved 84.6% on GPQA Diamond ^[35]. Grok 4 (mid-2025) set all-time-high scores on GPQA Diamond (88%) and Humanity's Last Exam (24%), achieving an Artificial Analysis Intelligence Index of 73, ahead of competing frontier models at the time.

DeepSeek

DeepSeek, a Chinese AI lab affiliated with the quantitative hedge fund High-Flyer, released DeepSeek-V3 and R1 in late 2024 and early 2025 under MIT licenses. These models matched frontier closed models at a reported fraction of the training cost, sparking substantial industry debate about AI economics. DeepSeek-V3.1 followed in August 2025 with hybrid thinking mode; DeepSeek-V3.2 later reportedly matched GPT-5 on several benchmarks.

Mistral AI

Mistral AI, a French startup founded in 2023 by former Google DeepMind and Meta researchers, focuses on efficient open models. Its Mistral 7B and Mixtral 8x7B were widely adopted in the open-source community. Mistral's Codestral model targets code generation; Mistral Large targets enterprise use cases.

Alibaba (Qwen team)

Alibaba's Qwen team produces the Qwen family, which covers sizes from 0.5B to over 235B parameters with strong multilingual coverage. Qwen3 supports 119 languages and was trained on 36 trillion tokens. Qwen 2.5-72B-Instruct was reported to compete with Llama 3.1 405B-Instruct, which has roughly five times its parameter count ^[23].

Inference techniques

Generating text from an LLM is a token-by-token loop. At each step, the model produces a probability distribution over the vocabulary, a sampling rule picks one token, and the new token is appended to the prompt for the next step. The main sampling controls are:

Parameter	Effect
Temperature	Sharpens (low) or flattens (high) the next-token distribution; 0 reduces to greedy decoding
Top-k	Restricts sampling to the k highest-probability tokens
Top-p (nucleus)	Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p
Min-p	Drops tokens whose probability is below a fraction of the most likely token
Beam search	Maintains multiple candidate sequences and keeps the highest-scoring overall

Serving frameworks

vLLM, released in 2023, introduced PagedAttention to manage the KV cache as pages of virtual memory, dramatically reducing memory fragmentation and enabling much higher throughput for batched inference. vLLM became the dominant open-source serving framework and supports speculative decoding, tensor parallelism, and most major model families ^[28].

SGLang, developed at UC Berkeley, uses RadixAttention to cache and reuse KV states across requests sharing a common prefix, achieving higher throughput on workloads with structured prompts and agentic patterns. Benchmarks show SGLang achieving approximately 29% higher throughput than vLLM on 7-8B models on H100 GPUs, with the gap narrowing to 3-5% on 70B+ models.

TensorRT-LLM (NVIDIA) and TGI (Hugging Face Text Generation Inference) round out the major serving options. TensorRT-LLM achieves the highest raw token throughput on NVIDIA hardware through custom CUDA kernels.

Speculative decoding

Speculative decoding uses a small draft model to propose multiple tokens that the large model verifies in parallel, giving 2-3x latency speedups without changing the output distribution ^[28]. This works well when the draft model's distribution closely matches the target model's, which holds for models of the same family at different sizes.

Quantization

Quant formats (GPTQ, AWQ, GGUF) reduce model weights from 16-bit or 32-bit floats to 4-bit or even 2-bit integers, reducing memory requirements by 4-8x with modest quality loss. This makes models that would otherwise require multiple high-end GPUs runnable on consumer hardware or a single small GPU.

Evaluation benchmarks

No single number captures LLM quality. The benchmark stack used in 2025-2026 includes:

Benchmark	Domain	Notes
MMLU	57 academic subjects, multiple choice	Frontier models exceed 88%; largely saturated ^[30]
GPQA Diamond	Expert biology, chemistry, physics (198 questions)	Non-expert PhDs score ~34%; top models exceed 85%
HumanEval	164 Python coding problems, unit-tested	Top models exceed 90% pass@1
SWE-bench Verified	Real GitHub issues, patch must pass project tests	Gold standard for agentic coding; GPT-5 hit 74.9% ^[32]
GSM8K	Grade-school math word problems	Near-saturated; top models exceed 95%
MATH	Competition-level math	Harder than GSM8K; still discriminating
AIME 2025	US math olympiad problems	GPT-5 achieved 94.6% without tools ^[32]
ARC-AGI	Abstract visual grid reasoning	Tests general intelligence; GPT-5.5 scored 95.0% ^[38]
Humanity's Last Exam (HLE)	2,500 expert questions across 100+ subjects	Early 2025 models scored under 22%; Grok 4 reached 24% ^[39]
FrontierMath	Research-level mathematics	GPT-5.2 Thinking solved 40.3% on tiers 1-3
BIG-Bench Hard	Reasoning and knowledge tasks	Broad collection for probing model capability
TruthfulQA / HaluEval	Hallucination and truthfulness	Adversarial truthfulness evaluation

Benchmark saturation is a chronic problem. MMLU reached near-ceiling scores by 2024. The reaction has been to introduce harder benchmarks (GPQA Diamond, FrontierMath, Humanity's Last Exam) and to lean on agentic, real-world evaluations like SWE-bench Verified that are harder to game with narrow optimization.

Use cases

Coding and software development

GitHub Copilot, Cursor, Claude Code, and Windsurf pair LLMs with editor integration and tool use to assist with code completion, test generation, debugging, and code review. The SWE-bench trajectory from under 5% in 2023 to 74.9% in 2025 illustrates how rapidly agentic coding capability has improved. Autonomous coding agents can now tackle multi-file refactors and resolve real-world GitHub issues without step-by-step human guidance.

Enterprise and professional applications

LLMs are used in contract analysis, legal research, financial modeling, customer support, and document summarization across industries. Retrieval-augmented generation (RAG) systems ground LLM responses in up-to-date enterprise knowledge bases by retrieving relevant documents at query time and appending them to the prompt, reducing hallucination and knowledge-cutoff problems [Lewis et al., 2020]. RAG systems convert documents into vector embeddings stored in a vector database, retrieve the most relevant chunks at query time, and feed them alongside the user's question into the LLM context.

Scientific research

LLMs accelerate literature review, hypothesis generation, and structured data extraction from research papers. Specialized fine-tuned models target protein structure prediction, drug discovery, genomics, and clinical note summarization.

Conversational AI and consumer products

ChatGPT, Claude.ai, Google Gemini, and Microsoft Copilot serve hundreds of millions of users as general-purpose assistants for writing, research, learning, and productivity. Voice interfaces powered by native audio-capable models have extended the modality beyond text.

Agents and automation

Agentic systems loop an LLM with planning, memory, and tool calls (browsers, shells, code interpreters). Frameworks such as LangChain and LlamaIndex automate retrieval and tool orchestration. Multi-agent systems assign different roles to different model instances running in parallel or sequence, enabling decomposition of complex tasks.

Open-weight versus closed-API models

The LLM market in 2025-2026 splits into two main camps. Closed-weights labs (OpenAI, Anthropic, Google DeepMind for the Gemini frontier tier) ship via API and reveal little about parameter counts, training data, or training compute. Open-weights labs (Meta with Llama, Mistral, DeepSeek, Alibaba with Qwen, Google with Gemma, UAE's Technology Innovation Institute with Falcon) publish weights under licenses that range from permissive (Apache 2.0 for Mistral, Qwen, Gemma in many cases) to bespoke and restrictive (Llama Community License, Gemma terms).

DeepSeek-V3 and R1 were a turning point: the first time a freely downloadable open-weights model from outside the United States matched the reasoning quality of frontier closed-weights models on widely cited benchmarks, while reportedly using a much smaller training budget ^[14]. This intensified an already active debate about whether open weights are a safety risk (because alignment training can be undone with cheap fine-tuning) or a safety asset (because the wider research community can study and patch the models).

Llama 2 was widely described as "open source" by Meta but the Llama 2 Community License imposes redistribution and use limits. The Open Source Initiative has argued that the term is misleading; the more accurate label is "open weights" ^[29].

Notable open-weight families

Family	Provider	Latest sizes	Notes
Llama 3.1 / 4	Meta	8B, 70B, 405B; Scout 109B, Maverick 400B	Most-downloaded open-weights base; Llama Community License ^[13]^[33]
Mistral / Mixtral	Mistral AI	7B dense, 8x7B and 8x22B MoE	Apache 2.0 for most variants ^[20]
Qwen 2.5 / 3	Alibaba	0.5B to 235B	Apache 2.0; Qwen3 trained on 36T tokens ^[34]
DeepSeek V3 / R1 / V3.1	DeepSeek	671B-685B MoE	MIT-licensed weights; reasoning-trained variant ^[14]
Gemma 2	Google	2B, 9B, 27B	Distilled from Gemini; Gemma terms
Falcon	Technology Innovation Institute	7B, 40B, 180B	Trained on RefinedWeb ^[22]

Limitations

LLMs do not understand text in the way a human reader does. They are statistical models, and several failure modes follow from that.

Hallucination, the production of confident but false statements, is intrinsic to probabilistic generation. The model is rewarded for producing plausible-sounding text, not for refusing to answer when uncertain, so it will fabricate citations, invent code that calls non-existent functions, and confidently give wrong answers in long-tail domains. RAG, tool use, and citation training reduce the rate but do not eliminate it.

Long-context degradation is the gap between the advertised context window and the model's actual ability to use information deep inside it. Even with 1-million-token windows, recall drops for content in the middle of the context, and reasoning over information scattered across long documents is harder than tasks that fit in a few thousand tokens.

Bias and toxicity inherited from the training data show up in outputs. Models can refuse requests on the basis of demographic cues, generate stereotyped descriptions, or produce harmful content under adversarial prompting. Safety training reduces some of these but trades off against helpfulness on sensitive topics.

Knowledge cutoffs are intrinsic. A model trained through, say, late 2024 knows nothing about events after that date except through retrieval or tools. This is why almost all chat products now ship with web search.

Cost and energy are nontrivial. Training a frontier model requires tens of thousands of high-end GPUs running for weeks. Llama 3.1 405B used more than 16,000 H100 GPUs ^[13]. Inference at scale is itself a major datacenter workload, which is why providers invest heavily in quantization, KV-cache reuse, and speculative decoding.

Test-time compute scaling has its own limitations: extended reasoning chains do not reliably improve performance on knowledge-intensive tasks, and models can reach a correct intermediate step and then deviate toward an incorrect conclusion during prolonged reasoning.

Safety and alignment

The security literature treats LLMs as a system component with its own threat model. The OWASP 2025 list ranks prompt injection as the top vulnerability for LLM-integrated applications ^[31]. Three related but distinct concerns:

Prompt injection: an attacker hides instructions in untrusted text (a webpage, an email, a tool output) that the model follows when it processes them, potentially overriding the developer's system prompt.
Jailbreaking: a user crafts a prompt that bypasses the model's safety training, persuading it to produce content it was trained to refuse.
Data exfiltration through tool use: a compromised model in an agentic loop can be tricked into reading private data and writing it to an attacker-controlled destination.

Defenses combine input filtering, separate trust levels for system, developer, and user content, output checks, and defense-in-depth rather than reliance on the model's own safety training.

Alignment research asks whether the stated goal of producing helpful, harmless, and honest outputs can be durably encoded into model weights. Anthropic's Constitutional AI (2022) and Scalable Oversight (2022) are two published frameworks for doing this at scale without requiring human labeling of every output. OpenAI's Preparedness Framework and Anthropic's Responsible Scaling Policy describe commitments to evaluate models at capability thresholds before deployment.

The question of whether open-weight release increases risk remains actively debated. Proponents argue that published weights allow independent safety audits and community-driven patches; critics argue that any alignment training can be undone with modest fine-tuning budgets, making open weights net-negative for safety.

Economics

Frontier-model training is expensive but the absolute numbers are usually closely held. Public anchors:

Google's PaLM (540B parameters, 2022) was estimated to cost on the order of $8 million to train ^[1].
DeepSeek-V3's technical report describes a final pretraining run on 14.8 trillion tokens; the company gave a (much-discussed) figure of around $5.6 million in GPU-hour cost for that run, which excluded prior research, failed experiments, and post-training ^[14].
Llama 3.1 405B used more than 16,000 H100 GPUs for its training run ^[13].

Inference economics shifted just as dramatically. GPT-4 launched in 2023 at $0.03 per 1,000 input tokens. By GPT-4.1 in 2025, the same provider was offering models with eight times the context window at lower per-token prices ^[18]. Open-weights models running on commodity hardware pushed marginal inference cost down to near zero for many use cases.

Relationship to other AI systems

LLMs are one face of a broader category called foundation models, which also includes vision-language models, code models, and protein models. They are the language backbone behind:

AI coding assistants like GitHub Copilot, Cursor, Claude Code, and Windsurf, which combine an LLM with editor integration and tool use.
Agentic systems that loop an LLM with planning, memory, and tool calls (browsers, shells, code interpreters).
Retrieval-augmented systems that ground generation in an external corpus through a vector index.
Multimodal models that pair an LLM with image, audio, or video encoders, often trained jointly.

The research community uses LLMs as a substrate for nearly every applied NLP problem, from clinical note summarization to legal-document review.

Future outlook

Several trajectories appear likely to shape the next few years:

Test-time compute scaling is developing into a second axis alongside training-time scaling. The o-series and DeepSeek-R1 demonstrated that RL-trained reasoning models can solve problems that overwhelm direct-answer models of the same size. Improving the reliability of long reasoning chains and extending test-time scaling to knowledge-heavy domains are active research areas.

Longer and more reliable context continues to be a focus. Context windows have grown from 2K tokens in GPT-3 to 1-10 million tokens in 2025 frontier models. The practical bottleneck has shifted from window size to the model's ability to use context reliably.

Native multimodality is becoming standard. The transition from add-on vision to natively joint-trained multimodal models (Gemini, GPT-4o, Llama 4) is largely complete at the frontier. Video understanding and audio generation are the next modalities receiving heavy investment.

Agent reliability is the open problem for commercial deployment. LLMs can generate plausible multi-step plans but still make errors in long agentic loops. Reducing error rates in tool use, code execution, and long-horizon planning is central to converting LLMs from chat assistants into autonomous workers.

Model efficiency is advancing on multiple fronts: MoE architectures reduce active parameter counts, quantization reduces memory, speculative decoding reduces latency, and distillation allows smaller models to approach larger-model quality. These trends are pushing capable models further down the hardware cost curve.

References

Wikipedia. "Large language model." https://en.wikipedia.org/wiki/large_language_model
OpenAI. "GPT-2: 1.5B Release." November 5, 2019. https://openai.com/index/gpt-2-1-5b-release/
Wikipedia. "GPT-2." https://en.wikipedia.org/wiki/GPT-2
Mikolov, Tomas et al. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP, 2014. https://nlp.stanford.edu/pubs/glove.pdf
Vaswani, Ashish et al. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762. https://arxiv.org/abs/1706.03762
Devlin, Jacob et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805, 2018. https://arxiv.org/abs/1810.04805
Brown, Tom B. et al. "Language Models are Few-Shot Learners." arXiv:2005.14165, NeurIPS 2020. https://arxiv.org/abs/2005.14165
Raffel, Colin et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 2020. https://huggingface.co/google-t5/t5-11b
Ouyang, Long et al. "Training language models to follow instructions with human feedback." arXiv:2203.02155, NeurIPS 2022. https://arxiv.org/abs/2203.02155
OpenAI. "GPT-4 Technical Report." March 14, 2023. https://en.wikipedia.org/wiki/GPT-4
OpenAI. "Hello GPT-4o." May 13, 2024. https://openai.com/index/hello-gpt-4o/
Meta AI. "Introducing Llama 3.1: Our most capable models to date." July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437 ; DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025. https://arxiv.org/abs/2501.12948
Anthropic. "Introducing Claude 4." May 22, 2025. https://www.anthropic.com/news/claude-4
Anthropic. "Models overview." Claude API documentation. https://platform.claude.com/docs/en/about-claude/models/overview
Google. "Gemini 2.5 Deep Think rolling out now for Google AI Ultra." August 2025. https://9to5google.com/2025/08/01/gemini-2-5-deep-think/
OpenAI. "Introducing GPT-4.1 in the API." April 14, 2025. https://openai.com/index/gpt-4-1/
Su, Jianlin et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, 2021. https://arxiv.org/abs/2104.09864
Mistral AI. "Mixtral of experts." December 11, 2023. https://mistral.ai/news/mixtral-of-experts
Common Crawl Foundation. https://commoncrawl.org
Penedo, Guilherme et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557, 2024. https://arxiv.org/abs/2406.17557 ; Penedo et al. "The RefinedWeb Dataset for Falcon LLM." arXiv:2306.01116, 2023.
Qwen Team. "Qwen2.5 Technical Report." arXiv:2412.15115, December 2024. https://arxiv.org/abs/2412.15115
Kaplan, Jared et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361, January 2020. https://arxiv.org/abs/2001.08361
Hoffmann, Jordan et al. "Training Compute-Optimal Large Language Models." arXiv:2203.15556, March 2022. https://arxiv.org/abs/2203.15556
Bai, Yuntao et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, December 2022. https://arxiv.org/abs/2212.08073
Rafailov, Rafael et al. "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model." arXiv:2305.18290, NeurIPS 2023. https://arxiv.org/abs/2305.18290
vLLM Project. "Speculative decoding." Documentation. https://docs.vllm.ai/en/v0.6.6/usage/spec_decode.html
Wikipedia. "Llama (language model)." https://en.wikipedia.org/wiki/Llama_(language_model)
Stanford HAI. "The 2025 AI Index Report: Technical Performance." https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance
OWASP Foundation. "LLM01:2025 Prompt Injection." OWASP Gen AI Security Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/
OpenAI. "Introducing GPT-5." https://openai.com/index/introducing-gpt-5/
Meta AI. "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." April 5, 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/
Alibaba Cloud. "Alibaba Introduces Qwen3, Setting New Benchmark in Open-Source AI with Hybrid Reasoning." April 28, 2025. https://www.alibabacloud.com/blog/alibaba-introduces-qwen3-setting-new-benchmark-in-open-source-ai-with-hybrid-reasoning_602192
xAI. "Grok 3 Beta - The Age of Reasoning Agents." February 2025. https://x.ai/news/grok-3
Gu, Albert and Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, December 2023. https://arxiv.org/abs/2312.00752
AI21 Labs. "Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model." https://www.ai21.com/blog/announcing-jamba/
ARC Prize. "ARC Prize 2025 Results and Analysis." https://arcprize.org/blog/arc-prize-2025-results-analysis
Humanity's Last Exam benchmark. https://agi.safe.ai/

What counts as an LLM

History

Statistical foundations and early neural models

The transformer era (2017-2019)

RLHF and the chat product era (2022-2023)

Scale, open weights, and reasoning (2024-2026)

Architecture

Transformer backbone

Positional encoding

Sparse mixture-of-experts (MoE)

State space models and hybrid architectures

Training pipeline

Pretraining

Scaling laws

Post-training: SFT, RLHF, DPO, GRPO, RLVR

Reasoning models and test-time compute

Multimodal variants

Vision-language models

Audio and speech

Video understanding

Major models

Frontier labs

OpenAI

Anthropic

Google DeepMind

Meta AI

xAI

DeepSeek

Mistral AI

Alibaba (Qwen team)

Inference techniques

Serving frameworks

Speculative decoding

Quantization

Evaluation benchmarks

Use cases

Coding and software development

Enterprise and professional applications

Scientific research

Conversational AI and consumer products

Agents and automation

Open-weight versus closed-API models

Notable open-weight families

Limitations

Safety and alignment

Economics

Relationship to other AI systems

Future outlook

See also

References

Improve this article

Related Articles

Context engineering

Reasoning models

DeepSeek 3.0

Agentic Context Engineering

Claude Sonnet 4.5

Context window

What counts as an LLM

History

Statistical foundations and early neural models

The transformer era (2017-2019)

RLHF and the chat product era (2022-2023)

Scale, open weights, and reasoning (2024-2026)

Architecture

Transformer backbone

Positional encoding

Sparse mixture-of-experts (MoE)

State space models and hybrid architectures

Training pipeline

Pretraining

Scaling laws

Post-training: SFT, RLHF, DPO, GRPO, RLVR

Reasoning models and test-time compute

Multimodal variants

Vision-language models

Audio and speech

Video understanding

Major models

Frontier labs