Large language models (LLMs) are a class of neural network trained on very large bodies of text, typically with billions to trillions of parameters, that learn to predict tokens in a sequence and can be steered to follow instructions, write code, summarize documents, translate, reason about images, and call external tools. The category is fuzzy, since there is no formal parameter threshold that makes a model "large," but in practice the term is used for transformer-based models trained with self-supervised objectives on web-scale text and then post-trained for chat or task use. Wikipedia's working definition is simply "a neural network trained on a vast amount of text for natural language processing tasks, especially language generation" [1].
Modern LLMs sit at the center of generative AI products such as ChatGPT, Claude, Gemini, Microsoft Copilot, and Meta AI. They are also the substrate for the open-weight ecosystem around Llama, Mistral, Qwen, DeepSeek, and Gemma.
The "large" in LLM has shifted with hardware. GPT-2's 1.5 billion-parameter model was treated as too dangerous to release in early 2019; by 2025, models with several hundred billion total parameters were running in commercial chat products [2][3]. Three properties are usually present:
Models below roughly 1 billion parameters are sometimes called "small" language models, but the boundary is informal.
Language modeling predates deep learning. Statistical n-gram models from the 1990s and 2000s estimated the probability of the next word from counts of short sequences in a fixed corpus, and were the workhorse of speech recognition and machine translation for decades. By 2001, smoothed n-gram models trained on roughly 300 million words held the state of the art in perplexity [1].
The shift to learned distributed representations began with neural probabilistic language models (Bengio et al., 2003) and accelerated with word embeddings. Word2vec, published by Tomas Mikolov and colleagues at Google in 2013, made dense word vectors cheap to train and showed that arithmetic on those vectors captured surprising semantic structure, like the famous king minus man plus woman example [4]. GloVe followed in 2014 with a co-occurrence-based formulation [5]. ELMo (2018) extended this to contextual embeddings using bidirectional LSTMs.
The modern era began with the transformer paper, "Attention Is All You Need" (Vaswani et al., NeurIPS 2017), which dropped recurrence in favor of multi-head self-attention and dramatically improved parallelism on GPUs and TPUs [6]. Two complementary directions then split off:
T5 (Raffel et al., 2019) explored encoder-decoder transformers in the "text-to-text" framing, training the same architecture on translation, summarization, and classification by recasting all tasks as sequence-to-sequence problems; the largest checkpoint had 11B parameters [9].
The transition from raw language model to chat product happened with InstructGPT (Ouyang et al., March 2022), which combined supervised fine-tuning with reinforcement learning from human feedback (RLHF). Human labelers preferred outputs from a 1.3B-parameter InstructGPT model over the 175B GPT-3 base model, despite a 100x parameter gap [10]. ChatGPT, released by OpenAI on November 30, 2022, applied this recipe at scale and brought LLMs to a general audience. GPT-4 followed on March 14, 2023, with improved reasoning and a multimodal vision capability; OpenAI did not publish parameter counts or training compute [11].
2024 and 2025 were the years of multimodal-by-default models, longer context windows, and reasoning-trained variants. GPT-4o launched May 13, 2024 with native text, image, and audio I/O and audio response times around 320 milliseconds [12]. Llama 3.1, including a 405B-parameter version trained on more than 15 trillion tokens with a 128K context window, shipped on July 23, 2024 [13]. DeepSeek-V3 (December 2024) and DeepSeek-R1 (January 2025) introduced a 671B-parameter mixture-of-experts model with 37B active per token, trained on 14.8 trillion tokens, that matched frontier closed models on reasoning benchmarks at a fraction of the reported training cost [14]. Anthropic released Claude Opus 4 and Sonnet 4 on May 22, 2025 [15][16]. Google's Gemini 2.5 Pro, released March 20, 2025, shipped a 1-million-token context window and "thinking" reasoning [17]. GPT-4.1 followed on April 14, 2025 with a 1-million-token context window and large coding-benchmark gains over GPT-4o [18].
Nearly every production LLM as of 2026 is a transformer. The core unit is the self-attention layer: each token is projected to a query, key, and value vector, attention weights are computed by a softmax over query-key dot products, and the output is a weighted sum of value vectors. Stacking dozens to hundreds of these layers, interleaved with feed-forward networks and normalization, gives the model the capacity to mix information across long token spans [6].
Three architectural families coexist:
| Family | Pretraining objective | Typical use | Examples |
|---|---|---|---|
| Encoder-only | Masked-language modeling, next-sentence prediction | Classification, retrieval, embeddings | BERT, RoBERTa, DeBERTa |
| Decoder-only | Causal next-token prediction | Generation, chat, agents | GPT-3, Llama, Claude, Gemini, Mistral |
| Encoder-decoder | Span corruption (T5) or denoising | Translation, summarization, instruction-following | T5, BART, Flan-T5 |
Decoder-only autoregressive transformers became the default for general-purpose chat models because next-token prediction works for any task that can be written as text and because the same model serves both prompt encoding and generation.
The original transformer used fixed sinusoidal positional encodings [6]. Modern models almost always use learned alternatives. Rotary Position Embedding (RoPE), introduced by Su et al. in RoFormer (2021), encodes position by rotating query and key vectors, preserves relative position under shifts, and extrapolates more gracefully than absolute encodings; it is used in Llama, GPT-NeoX, and most newer open models [19]. ALiBi (Press et al., 2022) instead biases attention scores by a linear function of token distance and continues to work well past the training context length.
Mixture of Experts (MoE) routes each token through a small subset of expert feed-forward networks rather than running the full network on every token. Mistral AI's Mixtral 8x7B, released December 11, 2023, has 46.7 billion total parameters but uses only about 12.9 billion per token, giving it the inference cost of a much smaller dense model while matching or beating Llama 2 70B on many benchmarks [20]. DeepSeek-V3 pushed this further: 671 billion total parameters, 37 billion active per token, 256 routed experts plus a shared expert per layer, with auxiliary-loss-free load balancing [14].
A frontier LLM is built in stages. The terminology varies between labs, but the structure is fairly stable.
The model is trained with self-supervised next-token prediction on a corpus of web text, books, code, scientific papers, and (increasingly) synthetic data. The standard public source is Common Crawl, a non-profit web archive that has been crawling the web since 2007 and releases monthly snapshots of 200 to 400 TiB [21]. Derivative datasets clean and deduplicate it. RefinedWeb (2023) produced 5 trillion English tokens and was used to train Falcon. FineWeb (2024) is a 15-trillion-token dataset distilled from 96 Common Crawl snapshots and is large enough to train a Chinchilla-optimal model with more than 500 billion parameters [22]. Llama 3.1 was trained on more than 15 trillion tokens; Qwen 2.5 used 18 trillion [13][23].
The key hyperparameters at this stage are model size (parameter count), dataset size (tokens), and compute budget. The relationships between these were formalized as scaling laws.
Kaplan et al. (OpenAI, January 2020) showed that test loss scales as a power law in model size, dataset size, and compute, with the power-law trend holding over more than seven orders of magnitude in compute. Their conclusion was that, given a fixed compute budget, you should spend most of it on a larger model and undertrain it on relatively few tokens [24].
The DeepMind Chinchilla paper (Hoffmann et al., March 2022) revisited this by training more than 400 models from 70 million to 16 billion parameters on between 5 and 500 billion tokens. They found that for compute-optimal training, model size and training tokens should grow at the same rate: roughly 20 tokens per parameter, not the much smaller ratios used by GPT-3 and similar models. They tested this by training Chinchilla, a 70B model on 1.4 trillion tokens, with the same compute budget as the 280B Gopher; Chinchilla beat Gopher, GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a wide range of downstream tasks [25].
The practical effect was that post-2022 models got smaller and trained on more data. Llama 2 70B was trained on 2 trillion tokens, Llama 3.1 8B on more than 15 trillion. The cost-optimal frontier moved toward more data per parameter, then later toward investing more in inference compute (the "reasoning" or test-time scaling regime that produced the o-series and DeepSeek-R1).
A freshly pretrained model is a competent next-token predictor but is not a useful assistant. Three things have to happen, in roughly this order:
Direct Preference Optimization (DPO), introduced by Rafailov et al. at NeurIPS 2023, replaced the reward-model-plus-PPO pipeline with a single supervised classification loss on preference pairs. The trick is that the optimal RLHF policy can be written in closed form as a function of the reward, so the reward model implicit in the policy can be optimized directly. DPO matches or beats PPO-based RLHF on summarization and dialogue tasks while being much simpler to implement, and is now the default in many open-source post-training stacks [27].
More recent rounds of post-training add tool-use traces (function calling, code execution, web search), agentic behavior (multi-step planning), and reasoning chains generated by either a teacher model or a separate reward signal that grades correctness on math and code problems.
A modern LLM, given a text prompt, can:
In-context learning was the surprise finding from GPT-3: the same frozen weights could do translation, arithmetic, SAT analogies, and unscrambling words, with task specification handled entirely through the prompt [8]. Chain-of-thought prompting, popularized by Wei et al. (2022), showed that asking the model to "think step by step" before producing an answer raised math and logic benchmark scores by large margins. The reasoning-trained models that emerged in late 2024 (OpenAI's o1, then DeepSeek-R1) made step-by-step reasoning a built-in capability rather than a prompting trick.
Generating text from an LLM is a token-by-token loop. At each step, the model produces a probability distribution over the vocabulary, a sampling rule picks one token, and the new token is appended to the prompt for the next step. The main sampling controls are:
| Parameter | Effect |
|---|---|
| Temperature | Sharpens (low) or flattens (high) the next-token distribution; 0 reduces to greedy argmax decoding |
| Top-k | Restricts sampling to the k highest-probability tokens |
| Top-p (nucleus) | Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p |
| Min-p | Drops tokens whose probability is below a fraction of the most likely token |
| Beam search | Maintains multiple candidate sequences and keeps the highest-scoring overall |
| Speculative decoding | Uses a small draft model to propose multiple tokens that the large model verifies in parallel, giving 2-3x latency speedups without changing the output distribution [28] |
For production chat, temperature is typically held between 0.5 and 1.0 with top-p around 0.9. Code-completion settings often use lower temperatures and rely more on greedy or speculative decoding to reduce latency.
This is a non-exhaustive list of LLMs that have shaped the field. Parameter counts, where reported, are total parameters; context windows are at standard pricing tier when applicable.
| Model | Provider | Released | Parameters | Context | License | Notes |
|---|---|---|---|---|---|---|
| GPT-2 | OpenAI | 2019 | 1.5B (largest) | 1024 | MIT (weights) | Staged release; full 1.5B weights released Nov 2019 [2] |
| GPT-3 | OpenAI | May 2020 | 175B | 2048 | API only | Demonstrated in-context few-shot learning [8] |
| T5 (11B) | Oct 2019 | 11B | 512 | Apache 2.0 | Text-to-text encoder-decoder [9] | |
| BERT base/large | Oct 2018 | 110M / 340M | 512 | Apache 2.0 | Encoder-only, masked LM [7] | |
| InstructGPT | OpenAI | Mar 2022 | 1.3B / 6B / 175B | 2048 | API only | First major RLHF deployment [10] |
| ChatGPT | OpenAI | Nov 2022 | not disclosed | 4096 (initial) | Product | Brought LLMs to general public |
| GPT-4 | OpenAI | Mar 2023 | not disclosed | 8K / 32K | API only | Multimodal vision, no published params [11] |
| Llama 2 | Meta | Jul 2023 | 7B / 13B / 70B | 4096 | Llama 2 Community | First weights-available chat-tuned Llama [29] |
| Mistral 7B | Mistral | Sep 2023 | 7.3B | 8192 | Apache 2.0 | Strong small dense model |
| Mixtral 8x7B | Mistral | Dec 2023 | 46.7B (12.9B active) | 32K | Apache 2.0 | Sparse MoE [20] |
| Gemini 1.0 | Google DeepMind | Dec 2023 | not disclosed | 32K | API only | Native multimodal training |
| GPT-4o | OpenAI | May 2024 | not disclosed | 128K | API only | Native text, audio, image I/O [12] |
| Llama 3.1 | Meta | Jul 2024 | 8B / 70B / 405B | 128K | Llama 3 Community | 405B trained on 15T+ tokens, 16K H100s [13] |
| Qwen 2.5 | Alibaba | Sep 2024 | 0.5B to 72B | up to 128K | Apache 2.0 (most) | Pretrained on 18T tokens [23] |
| Gemma 2 | Jun 2024 | 2B / 9B / 27B | 8192 | Gemma terms | Open-weight, distilled from Gemini | |
| DeepSeek-V3 | DeepSeek | Dec 2024 | 671B (37B active) | 128K | MIT (weights) | MoE, 14.8T tokens, low reported training cost [14] |
| DeepSeek-R1 | DeepSeek | Jan 2025 | 671B (37B active) | 128K | MIT (weights) | RL-trained reasoning model on V3 base [14] |
| Gemini 2.5 Pro | Google DeepMind | Mar 2025 | not disclosed | 1M | API only | Thinking model; 2M context announced [17] |
| GPT-4.1 | OpenAI | Apr 2025 | not disclosed | 1M | API only | 54.6% on SWE-bench Verified [18] |
| Claude Opus 4 | Anthropic | May 2025 | not disclosed | 200K | API only | Released alongside Sonnet 4 [15] |
| Claude Sonnet 4 | Anthropic | May 2025 | not disclosed | 200K (1M beta) | API only | Long-context beta to Apr 2026 [16] |
Llama 2 was widely described as "open source" by Meta but the Llama 2 Community License imposes redistribution and use limits. The Open Source Initiative and others have argued that the term is misleading; in practice the more accurate label is "open weights" [29].
No single number captures LLM quality. The benchmark stack used in 2025 includes:
Benchmark saturation is a chronic problem. MMLU was state of the art in 2020 and is now a near-ceiling task. The reaction has been to introduce harder benchmarks (GPQA Diamond, FrontierMath, Humanity's Last Exam) and to lean on agentic, real-world evaluations like SWE-bench Verified that are harder to game.
LLMs do not understand text in the way a human reader does. They are statistical models, and several failure modes follow from that.
Hallucination, the production of confident but false statements, is intrinsic to probabilistic generation. The model is rewarded for producing plausible-sounding text, not for refusing to answer when uncertain, so it will fabricate citations, invent code that calls non-existent functions, and confidently give wrong answers in long-tail domains. Retrieval-augmented generation, tool use, and citation training reduce the rate but do not eliminate it.
Long-context degradation is the gap between the advertised context window and the model's actual ability to use information deep inside it. Even with 1-million-token windows, recall drops in the middle of the context, and reasoning over information scattered across long documents is harder than tasks that fit in a few thousand tokens.
Bias and toxicity inherited from the training data show up in outputs. Models can refuse requests on the basis of demographic cues, generate stereotyped descriptions, or produce harmful content under adversarial prompting. Safety training reduces some of these but trades off against helpfulness on sensitive topics.
Knowledge cutoffs are intrinsic. A model trained through, say, late 2024, knows nothing about events after that date except through retrieval or tools. This is why almost all chat products now ship with web search built in.
Cost and energy are nontrivial. Training a frontier model requires tens of thousands of high-end GPUs running for weeks. Llama 3.1 405B used more than 16,000 H100 GPUs [13]. Inference at scale (hundreds of millions of users) is itself a major datacenter workload, which is why providers invest heavily in techniques like quantization, KV-cache reuse, and speculative decoding.
The security literature treats LLMs as a system component with its own threat model. The OWASP 2025 list ranks prompt injection as the top vulnerability for LLM-integrated applications [31]. Three related but distinct concerns:
Defenses combine input filtering, separate trust levels for system, developer, and user content, output checks, and "defense in depth" rather than reliance on the model's own safety training.
The LLM market in 2025 splits into roughly two camps. Closed-weights labs (OpenAI, Anthropic, Google DeepMind for the Gemini frontier tier) ship via API and reveal little about parameter counts, training data, or training compute. Open-weights labs (Meta with Llama, Mistral, DeepSeek, Alibaba with Qwen, Google with Gemma, the UAE's TII with Falcon) publish weights under licenses that range from permissive (Apache 2.0 for Mistral, Qwen, Gemma in many cases) to bespoke and restrictive (Llama Community License, Gemma terms).
DeepSeek-V3 and R1 were a turning point: the first time a freely downloadable open-weights model from outside the United States matched the reasoning quality of frontier closed-weights models on widely cited benchmarks, while reportedly using a much smaller training budget [14]. This intensified an already lively debate about whether open weights are a safety risk (because alignment training can be undone with cheap fine-tuning) or a safety asset (because the wider research community can study and patch the models).
| Family | Provider | Latest sizes | Notes |
|---|---|---|---|
| Llama 3.1 / 3.3 | Meta | 8B, 70B, 405B | Most-downloaded open-weights base; Llama Community License [13] |
| Mistral / Mixtral | Mistral AI | 7B dense, 8x7B and 8x22B MoE | Apache 2.0 for most variants [20] |
| Qwen 2.5 | Alibaba | 0.5B to 72B | Apache 2.0 for most sizes; 18T pretraining tokens [23] |
| DeepSeek V3 / R1 | DeepSeek | 671B MoE (37B active) | MIT-licensed weights; reasoning-trained variant [14] |
| Gemma 2 | 2B, 9B, 27B | Distilled from Gemini; Gemma terms | |
| Falcon | Technology Innovation Institute | 7B, 40B, 180B | Trained on RefinedWeb [22] |
Qwen 2.5-72B-Instruct was reported to outperform a number of larger open and proprietary models, and to compete with Llama 3.1 405B-Instruct, which has roughly five times its parameter count [23].
Frontier-model training is expensive but the absolute numbers are usually closely held. Public anchors:
Inference economics shifted just as dramatically. GPT-4 launched in 2023 at $0.03 per 1,000 input tokens and $0.06 per 1,000 output tokens [11]. By GPT-4.1 in 2025, the same provider was offering models with eight times the context window at lower per-token prices [18]. Open-weights models running on commodity hardware further pushed marginal inference cost down to near zero for many use cases.
LLMs are one face of a broader category called foundation models, which also includes vision-language models, code models, and protein models. They are the language backbone behind:
The research community uses LLMs as a substrate for nearly every applied NLP problem, from clinical note summarization to legal-document review. Whether this is a net good remains contested, and the same debate plays out around copyright, labor displacement, and educational use.
[1] Wikipedia. "Large language model." https://en.wikipedia.org/wiki/large_language_model
[2] OpenAI. "GPT-2: 1.5B Release." November 5, 2019. https://openai.com/index/gpt-2-1-5b-release/
[3] Wikipedia. "GPT-2." https://en.wikipedia.org/wiki/GPT-2
[4] Mikolov, Tomas et al. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781
[5] Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP, 2014. https://nlp.stanford.edu/pubs/glove.pdf
[6] Vaswani, Ashish et al. "Attention Is All You Need." NeurIPS 2017. arXiv:1706.03762. https://arxiv.org/abs/1706.03762
[7] Devlin, Jacob et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805, 2018. https://arxiv.org/abs/1810.04805
[8] Brown, Tom B. et al. "Language Models are Few-Shot Learners." arXiv:2005.14165, NeurIPS 2020. https://arxiv.org/abs/2005.14165
[9] Raffel, Colin et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 2020. T5 model card: https://huggingface.co/google-t5/t5-11b
[10] Ouyang, Long et al. "Training language models to follow instructions with human feedback." arXiv:2203.02155, NeurIPS 2022. https://arxiv.org/abs/2203.02155
[11] OpenAI. "GPT-4 Technical Report." March 14, 2023. Wikipedia summary: https://en.wikipedia.org/wiki/GPT-4
[12] OpenAI. "Hello GPT-4o." May 13, 2024. https://openai.com/index/hello-gpt-4o/
[13] Meta AI. "Introducing Llama 3.1: Our most capable models to date." July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/
[14] DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437
[15] Anthropic. "Introducing Claude 4." May 22, 2025. https://www.anthropic.com/news/claude-4
[16] Anthropic. "Models overview." Claude API documentation. https://platform.claude.com/docs/en/about-claude/models/overview
[17] Google. "Gemini 2.5: Our newest Gemini model with thinking." March 25, 2025. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-thinking-updates-march-2025/
[18] OpenAI. "Introducing GPT-4.1 in the API." April 14, 2025. https://openai.com/index/gpt-4-1/
[19] Su, Jianlin et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, 2021. https://arxiv.org/abs/2104.09864
[20] Mistral AI. "Mixtral of experts." December 11, 2023. https://mistral.ai/news/mixtral-of-experts
[21] Common Crawl Foundation. https://commoncrawl.org
[22] Penedo, Guilherme et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557, 2024. https://arxiv.org/abs/2406.17557 ; Penedo et al. "The RefinedWeb Dataset for Falcon LLM," arXiv:2306.01116, 2023.
[23] Qwen Team. "Qwen2.5 Technical Report." arXiv:2412.15115, December 2024. https://arxiv.org/abs/2412.15115
[24] Kaplan, Jared et al. "Scaling Laws for Neural Language Models." arXiv:2001.08361, January 2020. https://arxiv.org/abs/2001.08361
[25] Hoffmann, Jordan et al. "Training Compute-Optimal Large Language Models." arXiv:2203.15556, March 2022. https://arxiv.org/abs/2203.15556
[26] Bai, Yuntao et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, December 2022. https://arxiv.org/abs/2212.08073
[27] Rafailov, Rafael et al. "Direct Preference Optimization: Your Language Model Is Secretly a Reward Model." arXiv:2305.18290, NeurIPS 2023. https://arxiv.org/abs/2305.18290
[28] vLLM Project. "Speculative decoding." Documentation. https://docs.vllm.ai/en/v0.6.6/usage/spec_decode.html
[29] Wikipedia. "Llama (language model)." https://en.wikipedia.org/wiki/Llama_(language_model)
[30] Stanford HAI. "The 2025 AI Index Report: Technical Performance." https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance ; Vellum LLM Leaderboard, https://www.vellum.ai/llm-leaderboard
[31] OWASP Foundation. "LLM01:2025 Prompt Injection." OWASP Gen AI Security Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/