Large Language Model

Artificial Intelligence Deep Learning Large Language Models Machine Learning Natural Language Processing

64 min read

Updated Jul 9, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 9, 2026

Fact-checked

Jul 9, 2026

Sources

84 citations

Revision

v11 · 12,880 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

What is a large language model?

A large language model (LLM) is an artificial intelligence system built on a transformer neural network with billions to trillions of parameters, trained on massive text corpora to predict the next token in a sequence and, through that single objective, learn to understand and generate human language. LLMs perform translation, summarization, question answering, code generation, and open-ended conversation, and they are the engines behind ChatGPT, Claude, Gemini, and Microsoft Copilot. The decisive demonstration came with GPT-3 in 2020: OpenAI reported training "an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model" and showed that "scaling up language models greatly improves task-agnostic, few-shot performance" ^[9]. Since the release of GPT-1 in 2018, LLMs have become one of the most consequential developments in the history of computing, powering products used by hundreds of millions of people worldwide.

LLMs work by learning statistical patterns in text. During training, a model reads vast quantities of text from books, websites, academic papers, and code repositories, building an internal representation of language structure, factual knowledge, and reasoning patterns. The resulting model can then generate text one token at a time, predicting the most likely next token given everything that came before it. Despite this relatively simple mechanism, LLMs exhibit surprisingly complex behavior, including the ability to follow instructions, write software, solve math problems, and engage in multi-step reasoning.

The term "large" in LLM is relative and has shifted over time. In 2018, GPT-1's 117 million parameters qualified as large. By 2025, models with fewer than a billion parameters are generally considered small, and frontier LLMs contain hundreds of billions to trillions of parameters. The "language model" part refers to the core training objective: predicting the probability distribution over the next token in a sequence, a form of self-supervised learning that requires no manually labeled data.

There is no formal parameter threshold that makes a model "large" ^[1]; in practice, three properties are usually present:

a transformer (or close variant) backbone with attention as the dominant mixing operator,
self-supervised pretraining on a corpus large enough that the model never sees the same example twice,
a separate post-training stage that turns the raw next-token predictor into a usable assistant, typically supervised fine-tuning followed by preference optimization.

Modern LLMs sit at the center of generative AI products such as ChatGPT, Claude, Gemini, and Microsoft Copilot, are the substrate for the open-weight ecosystem around Llama, Mistral, Qwen, DeepSeek, and Gemma, and form one face of the broader category of foundation models, which also includes vision-language, code, and protein models.

History and evolution

The development of LLMs can be traced through several distinct phases, each marked by significant increases in model size, training data, and capability.

Statistical and neural precursors (1990s-2016)

Language modeling predates deep learning. Statistical n-gram models from the 1990s and 2000s estimated the probability of the next word from counts of short sequences in a fixed corpus, and were the workhorse of speech recognition and machine translation for decades. By 2001, smoothed n-gram models trained on roughly 300 million words held the state of the art in perplexity ^[1].

The shift to learned distributed representations began with neural probabilistic language models (Bengio et al., 2003) and accelerated with word embeddings. Word2vec, published by Tomas Mikolov and colleagues at Google in 2013, made dense word vectors cheap to train and showed that arithmetic on those vectors captured surprising semantic structure, including the famous king minus man plus woman example ^[2]. GloVe followed in 2014 with a co-occurrence-based formulation ^[3], and ELMo (2018) extended the idea to contextual embeddings using bidirectional LSTMs.

Early foundations (2017-2019)

The transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," laid the groundwork for all modern LLMs ^[4]. The paper proposed "a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely" ^[4]. By replacing the recurrent neural networks that dominated sequence modeling with a self-attention mechanism that processes an entire sequence in parallel, the design dramatically improved both training speed and the model's ability to capture long-range dependencies in text.

OpenAI released GPT-1 in June 2018 with 117 million parameters ^[5]. It was trained on the BookCorpus dataset and demonstrated that generative pre-training followed by discriminative fine-tuning could achieve strong results on a variety of natural language processing benchmarks. In February 2019, GPT-2 followed with 1.5 billion parameters, trained on WebText, a 40-gigabyte dataset of 8 million web pages ^[6]. OpenAI initially withheld the full model citing concerns about potential misuse for generating disinformation, releasing it in stages over several months; the full 1.5-billion-parameter weights were published in November 2019 ^[7].

Google introduced BERT (Bidirectional Encoder Representations from Transformers) in October 2018 with 340 million parameters. Unlike GPT, BERT used a bidirectional training approach (masked language modeling), making it particularly effective for understanding tasks like classification and question answering rather than text generation ^[8]. Encoder-only models of this family went on to dominate discriminative NLP benchmarks such as GLUE and SuperGLUE.

The scaling era (2020-2022)

The release of GPT-3 in May 2020 marked a turning point. With 175 billion parameters, a 2,048-token context window, and pre-training on roughly 300 billion tokens, GPT-3 showed that scaling up model size could unlock qualitatively new capabilities ^[9]. The paper that introduced it was titled "Language Models are Few-Shot Learners," and it documented impressive "few-shot learning" abilities: the model could perform tasks it had never been explicitly trained for simply by being given a few examples in the prompt ^[9]. GPT-3's training cost was estimated at $4.6 million in cloud compute, and required approximately 350 GB of storage for its weights alone.

Google responded with PaLM (Pathways Language Model) in April 2022, scaling to 540 billion parameters. PaLM demonstrated strong performance across NLP benchmarks and showed particular strength in reasoning tasks when combined with chain-of-thought prompting ^[10]. Google also developed LaMDA, a model focused specifically on natural conversational abilities.

This period also saw the emergence of several important open-source efforts. EleutherAI released GPT-Neo and GPT-J, providing the research community with openly accessible alternatives to proprietary models. BigScience, an international collaboration, released BLOOM, a 176-billion-parameter multilingual model, in July 2022. Google released T5 (Text-to-Text Transfer Transformer; Raffel et al., 2019), which framed all NLP tasks as text-to-text problems using an encoder-decoder architecture, with checkpoints up to 11 billion parameters ^[11], and later Flan-T5, an instruction-tuned variant that demonstrated the power of multi-task fine-tuning ^[12].

The transition from raw language model to chat assistant began with InstructGPT (Ouyang et al., March 2022), which combined supervised fine-tuning with reinforcement learning from human feedback (RLHF). Human labelers preferred outputs from a 1.3-billion-parameter InstructGPT model over the 175-billion-parameter GPT-3 base model, despite a 100x parameter gap ^[13].

The ChatGPT moment and beyond (2023-2024)

The launch of ChatGPT on November 30, 2022 (built on GPT-3.5) brought LLMs into mainstream public awareness. Within two months, it had over 100 million users, making it the fastest-growing consumer application in history at the time.

OpenAI released GPT-4 on March 14, 2023, a multimodal model capable of processing both text and images. GPT-4 was widely praised for its increased accuracy and reasoning capabilities ^[14]. Anthropic launched the Claude model family in March 2023, followed by Claude 2 in July 2023, emphasizing its Constitutional AI approach to safety. Google launched its Bard chatbot in March 2023 and released the Gemini 1.0 model family with Nano, Pro, and Ultra variants in December 2023, renaming Bard to Gemini in February 2024. ^[74]

Meta released LLaMA (Large Language Model Meta AI) in February 2023, a family of models ranging from 7 billion to 65 billion parameters. The 13B-parameter LLaMA model outperformed GPT-3 (175B) on most NLP benchmarks, demonstrating that smaller, well-trained models could match or exceed much larger ones ^[15]. LLaMA's open release catalyzed a wave of community fine-tuning projects including Alpaca, Vicuna, and Koala. Meta followed with LLaMA 2 in July 2023 and LLaMA 3 (with models up to 405 billion parameters) in 2024.

GPT-4o launched on May 13, 2024 with native text, image, and audio input and output, achieving audio response times around 320 milliseconds ^[16]. Meta's Llama 3.1, including a 405-billion-parameter version trained on more than 15 trillion tokens with a 128,000-token context window, shipped on July 23, 2024 ^[17].

In September 2024, OpenAI released o1-preview, the first in a new series of "reasoning models" trained specifically for extended chain-of-thought problem solving, representing a new paradigm in LLM capability ^[18].

The frontier era (2025-2026)

By 2025, LLMs entered a new phase characterized by massive context windows, native multimodality, mixture-of-experts architectures, and agentic capabilities.

OpenAI released GPT-4.1 on April 14, 2025, an API-only family with a 1-million-token context window and large coding-benchmark gains over GPT-4o ^[19]. GPT-5 followed on August 7, 2025, featuring a 400,000-token context window and significantly improved reliability: OpenAI reported that with web search enabled its responses were about 45% less likely to contain a factual error than GPT-4o's, and about 80% less likely than o3's when using extended thinking ^[20]. It scored 94.6% on the AIME 2025 math benchmark without tools and 74.9% on SWE-bench Verified for agentic coding ^[20]. GPT-5.2 followed in December 2025 with improved tool use and long-context processing ^[20].

Anthropic released Claude Opus 4 and Claude Sonnet 4 on May 22, 2025 ^[21], designed explicitly for agentic use cases including tool invocation, file access, and long-horizon reasoning. Claude Sonnet 4 gained a 1-million-token context window by August 2025. Claude Opus 4.5 arrived in November 2025, and the Claude 4.6 family launched in February 2026 with 1M-token context and up to 128K output tokens ^[22].

Google's Gemini 2.5 Pro, released March 25, 2025, shipped a 1-million-token context window and a "thinking" reasoning mode, with a Deep Think variant rolled out in August 2025 using parallel thinking techniques ^[23]. Google then released Gemini 3 Pro in November 2025, followed by Gemini 3.1 Pro in February 2026, which led on 12 of 18 tracked benchmarks and offered a 1-million-token context window ^[24].

Meta released the LLaMA 4 family on April 5, 2025, marking an architectural shift to mixture-of-experts (MoE) design with native multimodality, trained on more than 30 trillion tokens. LLaMA 4 Scout featured a 10-million-token context window, capable of processing approximately 7,500 pages of text ^[25].

In December 2024, the Chinese AI lab DeepSeek released DeepSeek-V3, a 671-billion-parameter mixture-of-experts model trained on 14.8 trillion tokens, and followed in January 2025 with DeepSeek R1, an open-weight reasoning model that performed comparably to OpenAI's o1 at a fraction of the cost per token ^[26]. Alibaba's Qwen3 family, released April 28, 2025, was trained on 36 trillion tokens and introduced hybrid thinking/non-thinking modes across dense and MoE models ^[27]. Mistral Large 3 launched in December 2025 with 675 billion total parameters (41 billion active) under the Apache 2.0 open-source license ^[28].

How does a large language model work?

At inference time, an LLM is a function that takes a sequence of tokens and returns a probability distribution over the next token. Text is generated by sampling one token from that distribution, appending it to the input, and repeating, so a 500-word answer is produced by running the model hundreds of times in sequence. Everything the model "knows" is encoded in the weights of its transformer layers, learned during pre-training. The sections below describe the components that make this work: the transformer backbone, how text is converted into tokens, and how the raw next-token predictor is trained into a usable assistant.

The transformer foundation

Virtually all modern LLMs are based on the Transformer architecture. Transformers rely on attention mechanisms (specifically self-attention) that allow the model to weigh the importance of different tokens in a sequence relative to each other. In each self-attention layer, every token is projected to a query, key, and value vector; attention weights are computed by a softmax over query-key dot products, and the output is a weighted sum of value vectors. Stacking dozens to hundreds of these layers, interleaved with feed-forward networks and normalization, gives the model the capacity to mix information across long token spans ^[4]. This enables the model to learn complex linguistic patterns and generate coherent, context-aware text across long sequences.

The original Transformer had both an encoder (for understanding input) and a decoder (for generating output). Modern LLMs have diverged into distinct architectural families:

Architecture type	How it works	Training objective	Strengths	Example models
Decoder-only	Generates text left-to-right using causal (unidirectional) attention	Next-token prediction	Text generation, conversation, code	GPT series, Claude, LLaMA, Mistral
Encoder-only	Processes input bidirectionally using masked attention	Masked language modeling	Classification, NER, sentence embeddings	BERT, RoBERTa, DeBERTa
Encoder-decoder	Maps input to output via cross-attention between encoder and decoder	Span corruption or text-to-text	Translation, summarization, question answering	T5, BART, Flan-T5

The decoder-only architecture has become dominant for large-scale language models because it naturally supports autoregressive text generation, scales efficiently with increasing parameter counts, and uses the same network for both prompt encoding and generation. A 2024 study found that at small scales, encoder-decoder models can outperform decoder-only models by several points on complex tasks, but this advantage diminishes at larger scales where decoder-only models match or exceed them ^[29].

Mixture-of-experts (MoE)

A significant architectural trend in 2024-2025 has been the adoption of Mixture-of-Experts designs. In an MoE model, only a fraction of the total parameters are activated for any given input token. A routing mechanism selects which "expert" subnetworks to use, allowing models to have very large total parameter counts while keeping inference costs manageable.

The first widely deployed open example was Mistral AI's Mixtral 8x7B, released December 11, 2023, with 46.7 billion total parameters but only about 12.9 billion used per token, giving it the inference cost of a much smaller dense model while matching or beating Llama 2 70B on many benchmarks ^[30]. DeepSeek-V3 pushed the approach further: 671 billion total parameters, 37 billion active per token, and 256 routed experts plus a shared expert per layer, with auxiliary-loss-free load balancing ^[26]. Frontier models followed the same pattern: Mistral Large 3 activates 41 of its 675 billion parameters per token, LLaMA 4 Scout uses 16 experts (17 billion active of 109 billion total), and LLaMA 4 Maverick uses 128 experts (17 billion active of 400 billion total) ^[25]^[28].

State space models and hybrid architectures

Mamba, introduced by Gu and Dao in December 2023, uses selective state space models rather than attention as the core sequence-mixing operation ^[31]. Mamba scales linearly with sequence length in both computation and memory, compared with the quadratic cost of standard attention, making it attractive for very long sequences. Hybrid architectures that interleave Mamba layers with attention layers have shown that combining the two can outperform either alone: AI21 Labs' Jamba family achieved production deployment, with Jamba 1.5 scaling to 398 billion total parameters (94 billion active) using 16 MoE experts ^[32]. As of 2025, pure Mamba models have not displaced Transformers in frontier chat products, but hybrid designs remain an active research direction.

Key architectural components

Multi-head attention: Allows the model to attend to information from different representation subspaces at different positions simultaneously. Each "head" learns to focus on different types of relationships (syntactic, semantic, positional).
Positional encoding: Since Transformers process all tokens in parallel, positional encodings inject information about token order. The original Transformer used fixed sinusoidal encodings; modern LLMs typically use Rotary Positional Embeddings (RoPE), introduced by Su et al. in RoFormer (2021), which encode relative position by rotating query and key vectors and can be extended to support longer context windows than those seen during training. RoPE is used in LLaMA, GPT-NeoX, and most newer open models ^[33]. An alternative, ALiBi (Press et al., 2022), biases attention scores by a linear function of token distance and continues to work past the training context length.
Layer normalization: Stabilizes training by normalizing activations within each layer. Most modern LLMs use pre-layer normalization (applying LayerNorm before rather than after the attention and feed-forward sublayers), which improves training stability.
Feed-forward networks: Each Transformer layer includes a feed-forward network (FFN) that processes each position independently. Some architectures use gated linear units (GLU) or SwiGLU activations in place of standard ReLU for improved performance.
KV cache: During inference, previously computed key-value pairs are cached to avoid redundant computation, which is essential for efficient autoregressive generation.

Tokenization

LLMs process text as tokens rather than individual characters or whole words. A token is typically a subword unit: common words like "the" are single tokens, while less frequent words may be split into multiple tokens. On average, one token corresponds to roughly 3/4 of a word in English. Tokenization is a foundational preprocessing step that bridges the gap between raw text and the model's numerical representations.

Tokenization algorithms

Algorithm	How it works	Used by	Key characteristic
Byte Pair Encoding (BPE)	Starts with individual characters and iteratively merges the most frequent adjacent pair until reaching target vocabulary size	GPT-2, GPT-3, GPT-4, LLaMA	Most popular; byte-level variant treats every possible byte as a basic unit
WordPiece	Similar to BPE but merges based on which pair maximizes the likelihood of the training data, not just frequency	BERT, DistilBERT, Electra	Tends to keep frequent words intact while splitting rare words
SentencePiece	Language-agnostic; treats input as raw byte stream and learns subword units using BPE or Unigram algorithms	T5, LLaMA, many multilingual models	Works directly on raw text without language-specific preprocessing; uses special marker for word boundaries
Unigram	Starts with a large vocabulary and iteratively removes tokens that least reduce the training data likelihood	SentencePiece-based models, XLNet	Probabilistic approach; can assign multiple tokenizations to the same text

Byte-level BPE, used by models like GPT-2 and later, operates at the byte level rather than the character level. This ensures that any text (including emojis, non-Latin scripts, and special characters) can be tokenized without unknown tokens, since every input can be decomposed into its constituent bytes ^[34].

Vocabulary sizes for modern LLMs typically range from 32,000 to 256,000 tokens. Larger vocabularies reduce the average number of tokens needed to represent text (improving efficiency) but increase the size of the embedding layer. GPT-4 uses a vocabulary of approximately 100,000 tokens, while LLaMA 3 expanded to 128,000 tokens to improve multilingual performance.

How are LLMs trained?

The development of a modern LLM typically follows a multi-stage pipeline: pre-training, supervised fine-tuning (SFT), and alignment. Pre-training teaches the model language and world knowledge from raw text; SFT teaches it to follow instructions; and alignment shapes its behavior to be helpful, harmless, and honest.

Stage 1: Pre-training

During pre-training, the model is exposed to enormous quantities of text, learning to predict the next token given the preceding context. This self-supervised phase is by far the most computationally expensive step. GPT-3, for example, was trained on roughly 300 billion tokens ^[9]. More recent models use far more data: the LLaMA 3 family was trained on over 15 trillion tokens, which for the 8B model works out to roughly 1,875 tokens per parameter ^[17].

Pre-training data typically includes web crawls (Common Crawl), books, Wikipedia, academic papers, code repositories (GitHub), and increasingly synthetic data. Common Crawl, a non-profit web archive that has been crawling the web since 2007, releases monthly snapshots of 200 to 400 TiB and is the standard public source ^[35]. Derivative datasets clean and deduplicate it: RefinedWeb (2023) produced 5 trillion English tokens and was used to train Falcon, and FineWeb (2024) distilled 15 trillion tokens from 96 Common Crawl snapshots ^[36]. Data quality matters enormously; deduplication, filtering, and careful curation of training data have been shown to significantly improve model performance relative to simply adding more data. Token budgets keep climbing: Qwen 2.5 was pretrained on 18 trillion tokens and Qwen3 on 36 trillion ^[37]^[27].

Pre-training requires massive compute infrastructure. LLaMA 4 was trained on a cluster of thousands of NVIDIA GPUs, Mistral Large 3 used approximately 3,000 H200 GPUs ^[28], and Llama 3.1 405B used more than 16,000 H100 GPUs ^[17]. Training runs for frontier models cost tens to hundreds of millions of dollars, though efficiency outliers exist: DeepSeek-V3's technical report gave a much-discussed figure of around $5.6 million in GPU-hour cost for its final pre-training run, a number that excluded prior research, failed experiments, and post-training ^[26]. Epoch AI estimates that training costs for frontier models have grown by a factor of 2 to 3 times per year over the past eight years, with projections suggesting the largest models may cost over a billion dollars by 2027 ^[38].

Model	Year	Estimated training cost
GPT-3	2020	$4.6 million
PaLM (540B)	2022	~$8-12 million (estimates vary)
GPT-4	2023	$78-100+ million
Gemini Ultra 1.0	2023	~$192 million
DeepSeek-V3	2024	~$5.6 million (final run GPU-hours only)
GPT-5	2025	Undisclosed (est. $200M+)

Stage 2: Supervised fine-tuning (SFT)

After pre-training, the model is fine-tuned on a smaller, curated dataset of high-quality instruction-response pairs. Human annotators or AI systems write examples of ideal responses to various prompts, and the model is trained to mimic these responses. This stage, also called instruction tuning when the demonstrations follow an instruction-response format, transforms the base model from a raw text predictor into an assistant that can follow instructions and engage in conversation.

Stage 3: Alignment

Alignment techniques adjust the model's behavior to be helpful, harmless, and honest. The two dominant approaches are:

RLHF (Reinforcement Learning from Human Feedback): Introduced by OpenAI and refined by Anthropic, RLHF involves three sub-steps: (1) collecting human preference data by having annotators rank model outputs, (2) training a reward model to predict human preferences, and (3) using reinforcement learning (specifically PPO, Proximal Policy Optimization) to fine-tune the LLM to maximize the reward model's score ^[13]. Notable RLHF-trained models include ChatGPT, Claude, and Gemini.

DPO (Direct Preference Optimization): Introduced by Rafailov et al. in 2023, DPO simplifies alignment by eliminating the separate reward model and RL loop. Instead, it directly optimizes the LLM on preference pairs using a classification-style loss function, exploiting the fact that the optimal RLHF policy can be written in closed form as a function of the reward. DPO is simpler to implement and computationally cheaper than PPO-based RLHF, since it does not require training a separate reward model or sampling from the policy during training, and it has been shown to produce comparable results in many settings ^[39].

Constitutional AI, published by Anthropic (Bai et al., December 2022), replaces most of the human harm-labeling step with model-generated critiques and revisions guided by a written constitution, and uses Reinforcement Learning from AI Feedback (RLAIF) to update the model ^[40].

Several newer methods extend this toolkit. Group Relative Policy Optimization (GRPO), introduced in the DeepSeekMath paper in February 2024 ^[75] and applied at scale in the DeepSeek-R1 work, dispenses with the separate critic model used in PPO: the model generates a group of candidate responses to a prompt, scores them with a reward function, and estimates the advantage from the relative scores within the group, significantly reducing memory requirements ^[26]. Reinforcement Learning with Verifiable Rewards (RLVR) uses rule-based or programmatic reward signals instead of a learned reward model: for math problems the reward is 1 if the final answer matches the ground truth and 0 otherwise, and for code it is whether the output passes test cases. Because verifiable rewards are less prone to reward hacking, larger-scale RL training can be performed with less risk of collapse; RLVR was central to DeepSeek-R1's training, improving AIME 2024 pass@1 from 15.6% to 71.0% ^[26]. Meta's LLaMA 4 uses a multi-round alignment process combining SFT, rejection sampling, PPO, and DPO ^[25]. Recent post-training rounds also add tool-use traces (function calling, code execution, web search), agentic behavior, and teacher-generated reasoning chains.

Parameter-efficient fine-tuning

Full fine-tuning (updating all model parameters) is prohibitively expensive for most practitioners. Parameter-efficient fine-tuning (PEFT) methods enable adaptation of LLMs by modifying only a small fraction of the model's weights.

LoRA

LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2021, injects trainable low-rank decomposition matrices into specific layers of the frozen pre-trained model ^[41]. Instead of updating a full weight matrix W of dimension d x d, LoRA learns two smaller matrices A (d x r) and B (r x d) where r is much smaller than d (typically 8 to 64). The effective update is W + BA, adding only a tiny number of parameters while capturing task-specific adaptations. LoRA typically trains 0.1% to 1% of the original parameters.

QLoRA

QLoRA (Quantized LoRA), introduced by Dettmers et al. in 2023, combines LoRA with aggressive quantization of the base model ^[42]. The pre-trained model weights are quantized to 4-bit precision using a new data type called NormalFloat4 (NF4), which is information-theoretically optimal for normally distributed weights. LoRA adapters are then trained in 16-bit precision on top of the frozen quantized base. Key innovations include double quantization (quantizing the quantization constants themselves) and paged optimizers to handle memory spikes. QLoRA makes it possible to fine-tune a 65-billion-parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning performance.

Other PEFT methods

Method	Approach	Typical parameters trained
Full fine-tuning	Updates all parameters	100%
LoRA	Low-rank adapter matrices	0.1-1%
QLoRA	LoRA on 4-bit quantized base	0.1-1%
DoRA	Decomposed weight-norm LoRA	~0.5%
Prefix tuning	Learnable prefix tokens prepended to each layer	<0.1%
Adapter layers	Small bottleneck modules inserted between layers	1-5%

With LoRA and QLoRA, practitioners can adapt a 7-billion-parameter model on a single consumer GPU in a few hours for roughly $10. Frameworks like LLaMA-Factory and Hugging Face's PEFT library integrate these methods into streamlined training pipelines.

Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a technique that enhances LLM outputs by retrieving relevant documents from an external knowledge base before generating a response ^[43]. RAG addresses several core LLM limitations: it provides access to up-to-date information beyond the training cutoff, reduces hallucination by grounding responses in retrieved evidence, and enables source attribution so users can verify claims.

A typical RAG pipeline involves three steps:

Indexing: Documents are split into chunks, converted to vector embeddings, and stored in a vector database.
Retrieval: When a user submits a query, the system retrieves the most relevant document chunks using semantic similarity search.
Generation: The retrieved chunks are appended to the prompt as context, and the LLM generates a response grounded in that evidence.

RAG saw explosive research growth in 2024, with over 1,200 RAG-related papers published on arXiv compared to fewer than 100 the previous year ^[43]. Advanced variants include GraphRAG (Microsoft, 2024), which builds knowledge graphs from documents for more structured retrieval, and Agentic RAG, where an LLM-powered agent plans multi-step retrieval strategies before generating. For enterprise applications, RAG offers a cost-effective alternative to full fine-tuning: rather than retraining the model on proprietary data, organizations can simply index their documents and retrieve relevant passages at query time.

What are scaling laws for LLMs?

Scaling laws describe the predictable relationship between a model's performance (measured by loss on held-out data) and the resources used to train it: model size (parameters), dataset size (tokens), and compute (FLOPs). They are the reason it became possible to forecast that a bigger model trained on more data would be better before spending the money to train it.

Kaplan scaling laws

In January 2020, researchers at OpenAI (Kaplan et al.) published one of the first systematic studies of neural language model scaling. They found that "the loss scales as a power-law with model size, dataset size, and the amount of compute used for training, with some trends spanning more than seven orders of magnitude" ^[44]. Their work suggested that, given a fixed compute budget, model size should be prioritized over dataset size when scaling up ^[44].

Chinchilla scaling laws

In 2022, DeepMind's "Chinchilla" paper (Hoffmann et al.) challenged this view, reporting that "current large language models are significantly undertrained, a consequence of the recent focus on scaling language models whilst keeping the amount of training data constant" ^[45]. The researchers trained more than 400 models ranging from 70 million to 16 billion parameters on between 5 and 500 billion tokens, and concluded that for compute-optimal training, "for every doubling of model size the number of training tokens should also be doubled," a ratio of approximately 20 tokens per parameter ^[45]. The 70-billion-parameter Chinchilla model, trained on 1.4 trillion tokens with the same compute budget as the much larger 280-billion-parameter Gopher, "uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks" ^[45].

Beyond Chinchilla

Subsequent research has pushed well beyond the Chinchilla-optimal ratio. Practitioners discovered that models intended for wide deployment benefit from being trained on far more tokens than Chinchilla recommends, because the marginal cost of additional training is small compared to the ongoing cost of serving a larger model to millions of users. The practical effect was that post-2022 models got smaller and trained on more data: Llama 2 70B was trained on 2 trillion tokens, while the LLaMA 3 family's 15-trillion-token corpus gives its smallest (8B) model a ratio of roughly 1,875 tokens per parameter, nearly 100 times the Chinchilla-optimal ratio; the 405B model trained on the same corpus sits near 37 tokens per parameter ^[17]. Research from Tsinghua University's MiniCPM project suggested a ratio of 192:1 may be more practical for many settings ^[46]. Loss continues to decrease well beyond the Chinchilla-optimal point, though with diminishing returns.

A 2024 paper from researchers at MosaicML ("Beyond Chinchilla-Optimal") formalized this intuition, showing that when inference costs are factored in, the optimal strategy is to train smaller models for longer than the original Chinchilla prescription ^[47]. The frontier later shifted again toward investing more in inference compute, a regime sometimes called test-time scaling.

What is the context window of an LLM?

The context window (or context length) is the maximum number of tokens the model can process in a single forward pass, including both the input prompt and the generated output. Larger context windows allow the model to work with longer documents, maintain coherence over extended conversations, and perform tasks like whole-codebase analysis or book-length summarization.

Context windows have grown by a factor of approximately 20,000 since 2018, from 512 tokens to 10 million tokens in LLaMA 4 Scout.

Model	Year	Context window
GPT-1	2018	512 tokens
GPT-2	2019	1,024 tokens
GPT-3	2020	2,048 tokens
GPT-3.5-Turbo	2023	16,384 tokens
GPT-4	2023	128,000 tokens
Claude 3 Opus	2024	200,000 tokens
Gemini 1.5 Pro	2024	1,000,000 tokens
GPT-5	2025	400,000 tokens
Claude Opus 4.6	2026	1,000,000 tokens
Gemini 3.1 Pro	2026	1,000,000 tokens
LLaMA 4 Scout	2025	10,000,000 tokens

This expansion has been driven by algorithmic improvements (RoPE and its extensions like LongRoPE and YaRN), more efficient attention mechanisms (FlashAttention, sparse attention), and hardware advances in memory capacity. However, longer context windows introduce new challenges. Performance can degrade when relevant information is buried in the middle of a long document (the "lost in the middle" problem), and processing long contexts increases both latency and cost. KV-cache memory usage grows linearly with sequence length, making million-token contexts expensive to serve at scale. The practical bottleneck has accordingly shifted from advertised window size to the model's actual ability to use information deep inside the context reliably.

Emergent abilities

One of the most discussed phenomena in LLM research is the concept of emergent abilities: capabilities that appear in larger models but are absent or negligible in smaller ones. Examples include the ability to perform multi-step arithmetic, follow complex instructions, and reason about abstract concepts.

In-context learning

LLMs can learn new tasks from examples provided directly in the prompt, without any weight updates. This capability, known as in-context learning (ICL), scales with model size and context length. With expanded context windows, "many-shot" in-context learning (providing hundreds or thousands of examples rather than just a few) has shown significant performance gains across generative and discriminative tasks. A 2024 paper on many-shot ICL was accepted as a Spotlight Presentation at NeurIPS 2024, documenting performance improvements across a wide variety of tasks ^[48].

Chain-of-thought reasoning

Chain-of-thought (CoT) prompting guides LLMs to break complex problems into intermediate reasoning steps. By prefacing a prompt with "Let's think step by step" or providing worked examples, models produce more accurate answers on math, logic, and science problems. This capability emerges primarily in models above approximately 100 billion parameters and is the foundation for dedicated reasoning models like OpenAI's o1/o3 and DeepSeek R1.

The emergence debate

The existence and nature of emergent abilities is debated. Wei et al. (2022) documented numerous tasks where performance appeared to jump discontinuously at certain model scales ^[49]. However, Schaeffer et al. (2023) argued that apparent emergence may be an artifact of the choice of evaluation metric; when smooth, continuous metrics are used instead of sharp accuracy thresholds, performance improvements look gradual rather than sudden ^[50].

Regardless of the theoretical debate, it is empirically clear that larger and better-trained models can perform tasks that smaller models cannot. The practical question for researchers and engineers is whether a given capability requires a model above a certain size threshold or whether clever training techniques (better data, improved architectures, distillation) can bring that capability to smaller models.

Reasoning models and test-time compute

Reasoning models are LLMs trained specifically to spend more computation at inference time by generating extended chains of thought before producing a final answer. OpenAI's o1 was the first widely available example; it and its successors (o3, o4-mini) generate "thinking tokens" that are not shown to the user but allow the model to work through intermediate steps, backtrack when it detects errors, and approach problems more methodically.

Test-time compute scaling refers to the finding that, for reasoning-trained models, performance on hard problems improves with more inference-time computation, whether through longer reasoning chains or through sampling multiple solutions and choosing the best. This creates a second scaling axis beyond model parameters and training tokens: a smaller reasoning model given a larger compute budget at inference can match a larger model that generates answers directly.

DeepSeek-R1 showed the recipe could be reproduced openly: GRPO plus RLVR applied to the DeepSeek-V3 base yielded reasoning capability matching o1 in MIT-licensed open weights ^[26]. Qwen3 introduced hybrid thinking/non-thinking modes within a single model family, letting users toggle extended reasoning on or off per request ^[27], and Google's Gemini 2.5 Deep Think mode uses parallel thinking, generating many candidate reasoning paths simultaneously before selecting the best answer ^[23].

The limitations of test-time scaling have also become clearer: extended reasoning does not reliably improve performance on knowledge-intensive tasks requiring factual accuracy, and models can reach a correct intermediate step and then deviate toward an incorrect conclusion during prolonged reasoning chains.

Inference optimization

As LLMs grow larger, efficient inference becomes increasingly important. A 2025 ACL study found that proper inference optimization techniques can reduce energy usage by up to 73% compared to naive serving, typically translating to a 2-3x reduction in cloud costs ^[51].

Sampling and decoding

Generating text from an LLM is a token-by-token loop. At each step, the model produces a probability distribution over the vocabulary, a sampling rule picks one token, and the new token is appended to the context for the next step. The main sampling controls are:

Parameter	Effect
Temperature	Sharpens (low) or flattens (high) the next-token distribution; 0 reduces to greedy decoding
Top-k	Restricts sampling to the k highest-probability tokens
Top-p (nucleus)	Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p
Min-p	Drops tokens whose probability is below a fraction of the most likely token
Beam search	Maintains multiple candidate sequences and keeps the highest-scoring overall

Quantization

Quantization reduces the numerical precision of model weights from their training precision (typically 16-bit floating point) to lower-bit representations such as 8-bit, 4-bit, or even 2-bit. This cuts memory usage by 4-8x with modest quality loss, making models that would otherwise require multiple high-end GPUs runnable on consumer hardware, and can speed up inference significantly. NVIDIA's NVFP4 format, for instance, enables 4-bit quantization with minimal accuracy loss, delivering up to 4x throughput improvement on B200 GPUs compared to FP8 on H100 ^[52]. Common quantization approaches include GPTQ, AWQ, and GGUF.

Speculative decoding

Speculative decoding uses a small, fast "draft" model to generate candidate tokens, which are then verified in parallel by the larger target model. Since the large model can verify multiple tokens simultaneously (a single forward pass over several positions), this approach achieves 2-3x speedups without changing the output distribution. It works best when the draft model's distribution closely matches the target's, which holds for models of the same family at different sizes ^[53]. NVIDIA's TensorRT-LLM demonstrated up to 3.55x throughput improvement with Llama 3.3 70B using speculative decoding ^[54].

KV-cache optimization

Techniques like PagedAttention (used in vLLM) manage the key-value cache more efficiently, reducing memory waste during batched inference. NVFP4 KV cache quantization can cut KV cache memory by up to 50%, effectively doubling context budgets and unlocking larger batch sizes ^[52].

Continuous batching

Traditional static batching waits for a batch of requests to complete before starting a new batch, leaving GPUs idle. Continuous batching (also called in-flight batching) allows new requests to enter mid-batch and completed requests to exit immediately, dramatically improving GPU utilization and throughput.

Serving frameworks

vLLM, released in 2023, introduced PagedAttention to manage the KV cache as pages of virtual memory, dramatically reducing memory fragmentation; it became the dominant open-source serving framework and supports speculative decoding, tensor parallelism, and most major model families ^[53]. SGLang, developed at UC Berkeley, uses RadixAttention to cache and reuse KV states across requests that share a common prefix; benchmarks show roughly 29% higher throughput than vLLM on 7-8B models on H100 GPUs, with the gap narrowing to 3-5% on 70B+ models. TensorRT-LLM (NVIDIA) and TGI (Hugging Face) round out the major serving options; TensorRT-LLM achieves the highest raw token throughput on NVIDIA hardware through custom CUDA kernels.

Other techniques

Knowledge distillation: Training a smaller "student" model to replicate the behavior of a larger "teacher" model, producing compact models suitable for edge deployment.
Pruning: Removing less important weights or attention heads to reduce model size.
Sparse attention: Methods like DeepSeek's Fine-Grained Sparse Attention selectively compute attention only over relevant parts of the context, improving efficiency by up to 50% for long sequences ^[26].
FlashAttention: A memory-efficient exact attention algorithm that reduces the number of memory reads/writes by tiling the computation, achieving 2-4x speedups over standard attention.

Evaluation benchmarks

No single number captures LLM quality. The benchmark stack used in 2025-2026 includes:

Benchmark	Domain	Notes
MMLU	57 academic subjects, multiple choice	Frontier models exceed 88%; largely saturated ^[55]
GPQA Diamond	Expert biology, chemistry, physics (198 questions)	Skilled non-experts score only ~22% even with web access ^[76]; top models exceed 85%
HumanEval	164 Python coding problems, unit-tested	Top models exceed 90% pass@1
SWE-bench Verified	Real GitHub issues, patch must pass project tests	Gold standard for agentic coding; GPT-5 hit 74.9% ^[20]
GSM8K	Grade-school math word problems	Near-saturated; top models exceed 95%
MATH	Competition-level math	Harder than GSM8K; still discriminating
AIME 2025	US math olympiad problems	GPT-5 achieved 94.6% without tools ^[20]
ARC-AGI	Abstract visual grid reasoning	Tests general intelligence; GPT-5.5 scored 95.0% ^[56]
Humanity's Last Exam (HLE)	2,500 expert questions across 100+ subjects	Early 2025 models scored under 22%; Grok 4 reached 24% ^[57]
FrontierMath	Research-level mathematics	GPT-5.2 Thinking solved 40.3% on tiers 1-3
BIG-Bench Hard	Reasoning and knowledge tasks	Broad collection for probing model capability
TruthfulQA / HaluEval	Hallucination and truthfulness	Adversarial truthfulness evaluation

Benchmark saturation is a chronic problem. MMLU reached near-ceiling scores by 2024 ^[55]. The reaction has been to introduce harder benchmarks (GPQA Diamond, FrontierMath, Humanity's Last Exam) and to lean on agentic, real-world evaluations like SWE-bench Verified that are harder to game with narrow optimization.

Are LLMs open source?

The LLM ecosystem is split between proprietary (closed) models and open-weight (open) models, with ongoing debate about the advantages of each approach. Strictly speaking, very few LLMs are "open source" in the traditional sense: most open releases publish the trained weights but not the training data or full training code, which is why the term "open weights" is now preferred.

Closed models

Closed models like GPT-5, Claude, and Gemini are developed by companies that do not release the model weights. Users access them through APIs or chat interfaces. Advantages include strong safety measures, regular updates, and state-of-the-art performance. Drawbacks include vendor lock-in, limited customization, unpredictable pricing changes, and data privacy concerns (since user inputs are sent to third-party servers).

Open-weight models

Open-weight models like LLaMA 4, Mistral Large 3, DeepSeek V3, and Qwen 3 release their trained weights for anyone to download and run. This allows full customization, fine-tuning for specific domains, and local deployment without sending data to external servers. Other notable open-weight families include Google's Gemma (distilled from Gemini) and the Technology Innovation Institute's Falcon series (7B to 180B parameters), trained on RefinedWeb ^[36]. Licenses span a spectrum from permissive (Apache 2.0 for Mistral, Qwen, and many Gemma releases) to bespoke and restrictive (the Llama Community License, Gemma terms). Meta widely described Llama 2 as "open source," but the Open Source Initiative has argued that the term is misleading for models whose licenses impose redistribution and use limits, preferring the label "open weights" ^[58]; the OSI's Open Source AI Definition (OSAID) 1.0, published in October 2024, sets formal criteria that most open-weight models fail because they do not release their training data ^[59].

DeepSeek-V3 and R1 marked a turning point: the first time a freely downloadable open-weight model from outside the United States matched the reasoning quality of frontier closed models on widely cited benchmarks, while reportedly using a much smaller training budget ^[26]. This intensified an already active debate about whether open weights are a safety risk (because alignment training can be undone with cheap fine-tuning) or a safety asset (because the wider research community can study and patch the models).

By early 2026, the performance gap between open and closed models has narrowed substantially. Open-weight models trail proprietary frontier models by only about three months on average across standard benchmarks ^[60]. However, closed models maintain a lead on complex agentic tasks, production-quality coding benchmarks (SWE-bench), and overall human preference ratings on platforms like Chatbot Arena. For domain-specific applications such as legal document analysis or medical coding, a fine-tuned 7B open model can often outperform a general-purpose frontier model while running on a single consumer GPU.

Notable models

Parameter counts, where reported, are total parameters; context windows are at the standard pricing tier where applicable.

Landmark models (2018-2025)

Model	Provider	Released	Parameters	Context	License	Notes
BERT base/large	Google	Oct 2018	110M / 340M	512	Apache 2.0	Encoder-only, masked LM ^[8]
GPT-2	OpenAI	2019	1.5B (largest)	1024	MIT (weights)	Staged release; full 1.5B weights released Nov 2019 ^[7]
T5 (11B)	Google	Oct 2019	11B	512	Apache 2.0	Text-to-text encoder-decoder ^[11]
GPT-3	OpenAI	May 2020	175B	2048	API only	Demonstrated in-context few-shot learning ^[9]
InstructGPT	OpenAI	Mar 2022	1.3B / 6B / 175B	2048	API only	First major RLHF deployment ^[13]
ChatGPT	OpenAI	Nov 2022	not disclosed	4096 (initial)	Product	Brought LLMs to general public
GPT-4	OpenAI	Mar 2023	not disclosed	8K / 32K	API only	Multimodal vision, no published params ^[14]
Llama 2	Meta	Jul 2023	7B / 13B / 70B	4096	Llama 2 Community	First weights-available chat-tuned Llama ^[58]
Mistral 7B	Mistral AI	Sep 2023	7.3B	8192	Apache 2.0	Strong small dense model
Mixtral 8x7B	Mistral AI	Dec 2023	46.7B (12.9B active)	32K	Apache 2.0	Sparse MoE ^[30]
Gemini 1.0	Google DeepMind	Dec 2023	not disclosed	32K	API only	Native multimodal training
GPT-4o	OpenAI	May 2024	not disclosed	128K	API only	Native text, audio, image I/O ^[16]
Llama 3.1	Meta	Jul 2024	8B / 70B / 405B	128K	Llama 3 Community	405B trained on 15T+ tokens, 16K H100s ^[17]
Qwen 2.5	Alibaba	Sep 2024	0.5B to 72B	up to 128K	Apache 2.0 (most)	Pretrained on 18T tokens ^[37]
Gemma 2	Google	Jun 2024	2B / 9B / 27B	8192	Gemma terms	Open-weight, distilled from Gemini
DeepSeek-R1	DeepSeek	Jan 2025	671B (37B active)	128K	MIT (weights)	RL-trained reasoning model on V3 base ^[26]
Gemini 2.5 Pro	Google DeepMind	Mar 2025	not disclosed	1M	API only	Thinking model; Deep Think variant ^[23]
Llama 4 Scout	Meta	Apr 2025	109B (17B active)	10M	Llama 4 Community	Natively multimodal MoE, 16 experts ^[25]
Llama 4 Maverick	Meta	Apr 2025	400B (17B active)	1M	Llama 4 Community	128 experts, natively multimodal MoE ^[25]
Qwen3 235B-A22B	Alibaba	Apr 2025	235B (22B active)	131K	Apache 2.0	Hybrid thinking/non-thinking, 36T tokens ^[27]
GPT-4.1	OpenAI	Apr 2025	not disclosed	1M	API only	54.6% on SWE-bench Verified ^[19]
Claude Opus 4	Anthropic	May 2025	not disclosed	200K	API only	Released alongside Sonnet 4 ^[21]
DeepSeek-V3.1	DeepSeek	Aug 2025	685B	128K	MIT (weights)	Hybrid thinking/non-thinking mode

Frontier model comparison (as of mid-2026)

Model	Developer	Release date	Total parameters	Active parameters	Context window	Architecture	License
GPT-5	OpenAI	Aug 2025	Undisclosed	Undisclosed	400K tokens	Decoder-only	Proprietary
Claude Opus 4.6	Anthropic	Feb 2026	Undisclosed	Undisclosed	1M tokens	Decoder-only	Proprietary
Gemini 3.1 Pro	Google DeepMind	Feb 2026	Undisclosed	Undisclosed	1M tokens	Decoder-only	Proprietary
LLaMA 4 Maverick	Meta	Apr 2025	400B	17B	1M tokens	MoE	Open-weight (Llama license)
Mistral Large 3	Mistral AI	Dec 2025	675B	41B	256K tokens	MoE	Apache 2.0
DeepSeek V3	DeepSeek	Dec 2024	671B	37B	128K tokens	MoE	Open-weight (MIT)
DeepSeek R1	DeepSeek	Jan 2025	671B	37B	128K tokens	MoE	Open-weight (MIT)
DeepSeek V4 Pro	DeepSeek	Apr 2026	1.6T	49B	1M tokens	MoE	Open-weight (MIT) ^[77]
Claude Fable 5	Anthropic	Jun 2026	Undisclosed	Undisclosed	1M tokens	Undisclosed	Proprietary ^[78]
GLM-5.2	Zhipu AI	Jun 2026	~750B	~40B	1M tokens	MoE	Open-weight (MIT) ^[79]

Frontier labs

The frontier LLM market is concentrated among a small number of well-funded organizations with access to large GPU clusters and proprietary training data.

OpenAI

OpenAI, founded in 2015 and based in San Francisco, released the GPT series and ChatGPT, which catalyzed mainstream adoption. In an October 28, 2025 recapitalization the capped-profit structure was eliminated: the for-profit arm became OpenAI Group PBC, a public benefit corporation controlled by the nonprofit OpenAI Foundation, with Microsoft holding roughly 27%. ^[80] ChatGPT surpassed 800 million weekly users in October 2025. ^[81] The o-series reasoning models (o1, o3, o4-mini) form a separate product line optimized for test-time compute scaling, and GPT-5.5 achieved 84.9% on the GDPval knowledge-work benchmark and led the ARC-AGI leaderboard at 95.0% ^[56].

Anthropic

Anthropic, founded in 2021 by former OpenAI researchers including Dario Amodei and Daniela Amodei, focuses on AI safety research alongside model development; its Claude family uses Constitutional AI and RLAIF for alignment. Claude 3 Opus briefly held the top spot on multiple benchmarks when released in March 2024, and Claude Opus 4.7 (2025-2026) scored 87.6% on SWE-bench Verified, leading on agentic coding benchmarks. In June 2026 Anthropic released Claude Fable 5, its first generally available Mythos-class model (a capability tier above Opus), followed by Claude Sonnet 5 on June 30, 2026 as the new default model for free and Pro users. ^[78]^[82]

Google DeepMind

Google DeepMind, formed through the 2023 merger of Google Brain and DeepMind, trains the Gemini family, distributed through the Gemini consumer product, Google Cloud Vertex AI, and the Gemini API; the open-weight Gemma family provides smaller models under permissive terms.

Meta AI

Meta AI open-sources its Llama family, making Meta the dominant provider of open-weight base models. Its strategic motivation is partly to prevent proprietary models from controlling AI infrastructure costs for Meta's own products.

xAI

xAI, founded by Elon Musk in 2023, trains the Grok series on its Colossus supercluster. Grok 3 (February 2025) was trained with 10x the compute of previous xAI models and achieved 84.6% on GPQA Diamond ^[61]; Grok 4 (mid-2025) set then-record scores on GPQA Diamond (88%) and Humanity's Last Exam (24%), achieving an Artificial Analysis Intelligence Index of 73, ahead of competing frontier models at the time.

DeepSeek

DeepSeek, a Chinese AI lab affiliated with the quantitative hedge fund High-Flyer, released the MIT-licensed V3 and R1 models that sparked the 2025 debate about AI training economics. DeepSeek-V3.1 followed in August 2025 with hybrid thinking mode, and DeepSeek-V3.2 later reportedly matched GPT-5 on several benchmarks. In April 2026 the lab shipped the MIT-licensed DeepSeek V4 family in V4-Pro (1.6 trillion total parameters, 49 billion active per token) and V4-Flash (284 billion total, 13 billion active) variants, both with 1-million-token context windows. ^[77]

Mistral AI

Mistral AI, a French startup founded in 2023 by former Google DeepMind and Meta researchers, focuses on efficient open models. Its Mistral 7B and Mixtral 8x7B were widely adopted in the open-source community; Codestral targets code generation and Mistral Large targets enterprise use.

Alibaba (Qwen team)

Alibaba's Qwen team produces the Qwen family, covering sizes from 0.5B to over 235B parameters with strong multilingual coverage (Qwen3 supports 119 languages ^[27]). Qwen 2.5-72B-Instruct was reported to compete with Llama 3.1 405B-Instruct, which has roughly five times its parameter count ^[37].

What are LLMs used for?

Modern LLMs demonstrate a broad range of capabilities that have expanded significantly with each generation.

Text generation and conversation: LLMs can produce fluent, coherent text on virtually any topic. They power chatbots, writing assistants, and content generation tools used by millions of people daily.

Reasoning and problem-solving: Frontier models can perform multi-step logical reasoning, solve mathematical problems, and pass standardized exams. GPT-5 scored 94.6% on the AIME 2025 math benchmark without tools, and reasoning-focused models like OpenAI's o1 and DeepSeek R1 can tackle complex problems using extended chain-of-thought processing ^[20]^[26].

Code generation: LLMs have become powerful programming assistants. Claude 4.5 achieved 77.2% on SWE-bench Verified (a benchmark of real-world software engineering tasks), and models can write, debug, refactor, and explain code in dozens of programming languages ^[22].

Translation: LLMs perform high-quality translation between many language pairs, often rivaling or exceeding dedicated machine translation systems. LLaMA 4 was trained across over 200 languages ^[25].

Summarization: Models can condense long documents into concise summaries while preserving key information, a capability that improves substantially with larger context windows.

Agentic behavior: A significant development in 2025-2026 has been the emergence of agentic LLMs that can plan multi-step tasks, use external tools, browse the web, write and execute code, and interact with computer interfaces. Anthropic's Model Context Protocol (MCP) and Google's Agent-to-Agent Protocol (A2A) are establishing standards for how agents connect to external tools and APIs ^[62]. Gartner predicts that 40% of enterprise applications will embed AI agents by the end of 2026.

Multimodal models

Multimodal LLMs extend the standard text-only framework by accepting, and often generating, non-text modalities. GPT-4 introduced image understanding in 2023, and by 2025 native multimodal training (jointly on text, images, and video) had become standard for frontier models.

Vision-language models

GPT-4V (November 2023) and GPT-4o accept images as part of the prompt, enabling tasks like chart interpretation, document understanding, and visual question answering. Gemini was designed from the start to be natively multimodal, trained jointly on text, images, audio, and video rather than adding vision as a bolt-on capability. Claude 3 (March 2024) added vision across all model tiers, and Claude Opus 4.7 (2025-2026) features a 3x jump in image resolution, reaching 2,576px for professional-grade visual analysis.

Open-source vision-language models became highly capable through 2024-2025: the LLaVA, InternVL, and Qwen-VL families achieved GPT-4V-level performance in open-weight form, and Meta's LLaMA 4 models are jointly pretrained on text, image, and video tokens ^[25].

Audio and speech

GPT-4o extended the multimodal stack to native audio input and output, enabling near-real-time voice conversations ^[16], and Gemini 1.5 Pro supports audio as a native input modality within its long-context window. Specialized audio models such as Whisper (OpenAI, 2022) handle speech-to-text transcription upstream of text-only models.

Video understanding

Gemini 1.5 Pro and 2.0 support video input directly within the context window, enabling temporal reasoning over hours of footage. Several open-source video-language models (LLaVA-Video, InternVideo) followed in 2024-2025.

Applications

LLMs have found applications across nearly every sector of the economy.

Sector	Applications	Examples
Software development	Code generation, debugging, testing, refactoring	GitHub Copilot, Cursor, Claude Code
Customer service	Chatbots, virtual assistants, ticket routing	ChatGPT Enterprise, Intercom Fin
Healthcare	Clinical documentation, literature review, patient communication	Med-PaLM, ambient scribes
Legal	Contract analysis, legal research, document drafting	Harvey AI, CoCounsel
Education	Personalized tutoring, grading, content generation	Khan Academy Khanmigo, Duolingo
Scientific research	Literature review, hypothesis generation, data analysis	Elicit, Consensus
Finance	Sentiment analysis, compliance, report generation	Bloomberg GPT, FinGPT
Content creation	Writing assistance, marketing copy, creative writing	Jasper, Copy.ai

Agentic coding capability has improved especially rapidly: SWE-bench resolution rates went from under 5% in 2023 to 74.9% in 2025 ^[20], and autonomous coding agents now tackle multi-file refactors and resolve real-world GitHub issues without step-by-step guidance. Orchestration frameworks such as LangChain and LlamaIndex automate retrieval and tool use, and multi-agent systems assign different roles to different model instances to decompose complex tasks.

Industry analysts project that the agentic AI market will grow from $7.8 billion in 2025 to over $52 billion by 2030 ^[62].

What are the limitations of LLMs?

Despite rapid progress, LLMs face several fundamental limitations. The most important are hallucination (confident but false output), reasoning failures on novel problems, learned bias, fixed knowledge cutoffs, and vulnerability to misuse and prompt injection.

Hallucinations

LLMs sometimes generate plausible but factually incorrect information, a phenomenon known as hallucination. Theoretical work has shown that hallucination is an inherent property of LLMs and cannot be completely eliminated through architecture, data, or algorithmic improvements alone ^[63]. The problem stems from the fact that LLMs learn statistical patterns rather than grounding their knowledge in verified facts: the model is rewarded for producing plausible-sounding text, not for refusing to answer when uncertain, so it will fabricate citations, invent code that calls non-existent functions, and confidently give wrong answers in long-tail domains. On constraint satisfaction tasks, hallucination rates scale linearly with problem complexity.

Mitigation approaches include RAG (grounding responses in retrieved documents), chain-of-verification (having the model check its own outputs), and calibrated uncertainty (systems that transparently signal doubt and can safely refuse to answer rather than guessing). A 2025 multi-model study showed that simple prompt-based mitigation cut GPT-4o's hallucination rate from 53% to 23% ^[63]. While frontier models have reduced hallucination rates significantly (OpenAI reported that GPT-5's responses were about 45% less likely than GPT-4o's to contain a factual error, and about 80% less likely than o3's when thinking ^[20]), the problem persists.

Reasoning failures

While LLMs have improved substantially at reasoning tasks, they still fail on problems that require genuine logical deduction, spatial reasoning, or common sense in unfamiliar contexts. State-of-the-art models perform poorly on certain clinical reasoning tasks and can struggle with novel problem formulations that differ from their training distribution ^[64]. Reasoning models (o1, DeepSeek R1) have partially addressed this through extended chain-of-thought processing, but at the cost of significantly increased inference time and expense.

Bias

Since LLMs are trained on internet text, they can learn and reproduce societal biases present in the training data. These biases can manifest in harmful stereotypes, uneven performance across languages and demographics, and skewed representations. Alignment techniques (RLHF, DPO) mitigate but do not eliminate these issues.

Knowledge cutoffs

A model trained through a given date knows nothing about later events except through retrieval or tools; the weights encode a snapshot of the world as of the training cutoff. This is why almost all chat products now ship with web search, and why RAG pipelines are standard in enterprise deployments.

Security and misuse

LLMs can be exploited for generating disinformation, phishing emails, malicious code, and other harmful content. Prompt injection attacks can manipulate LLM-powered applications into ignoring their instructions; the OWASP 2025 list ranks prompt injection as the top vulnerability for LLM-integrated applications ^[65]. Three related but distinct concerns dominate the security literature:

Prompt injection: an attacker hides instructions in untrusted text (a webpage, an email, a tool output) that the model follows when it processes them, potentially overriding the developer's system prompt.
Jailbreaking: a user crafts a prompt that bypasses the model's safety training, persuading it to produce content it was trained to refuse.
Data exfiltration through tool use: a compromised model in an agentic loop can be tricked into reading private data and writing it to an attacker-controlled destination.

Defenses combine input filtering, separate trust levels for system, developer, and user content, output checks, and defense-in-depth rather than reliance on the model's own safety training. Defending against these attacks remains an active area of research.

Safety and alignment

Alignment research asks whether the stated goal of producing helpful, harmless, and honest outputs can be durably encoded into model weights. Anthropic's Constitutional AI and scalable oversight research are two published frameworks for pursuing this at scale without requiring human labeling of every output ^[40]. At the organizational level, OpenAI's Preparedness Framework and Anthropic's Responsible Scaling Policy describe commitments to evaluate models at capability thresholds before deployment.

Environmental and computational costs

Training and deploying LLMs requires enormous computational resources, raising significant environmental concerns.

Training costs

Training GPT-3 consumed an estimated 1,287 megawatt-hours (MWh) of electricity and produced over 550 metric tons of CO2 equivalent emissions, while requiring more than 700 kiloliters of water for cooling ^[66]. As models have grown, costs have scaled accordingly. GPT-4's training is estimated at $78-100 million, and Gemini Ultra 1.0 reached approximately $192 million. Epoch AI projects that the cost of frontier training runs has grown by 2-3x per year over the past eight years ^[38].

Inference costs

Recent research reveals that inference (rather than training) is emerging as the primary contributor to ongoing environmental costs, since inference occurs continuously at massive scale while training is a one-time event. A 2025 study estimated that GPT-4o inference alone would require approximately 391,000 to 463,000 MWh of electricity annually at current usage levels, consuming energy comparable to 35,000 U.S. homes ^[66]. The most energy-intensive models consume over 29 Wh per long prompt, more than 65 times the most efficient systems.

Inference pricing has fallen as dramatically as capability has risen. GPT-4 launched in 2023 at $0.03 per 1,000 input tokens; by 2025 GPT-4.1 offered eight times the context window at lower per-token prices ^[19], and open-weight models on commodity hardware pushed marginal inference cost to near zero for many use cases.

Comparative perspective

Research has also shown that LLMs can have dramatically lower environmental impact than human labor for equivalent output. For a typical LLM like Llama-3-70B, the human-to-LLM emissions ratio ranges from 40:1 to 150:1, meaning the LLM produces 40 to 150 times less carbon per unit of output than the human equivalent ^[66]. Optimization techniques (quantization, efficient serving, renewable-powered data centers) continue to improve the efficiency of LLM deployment.

Current state (2025-2026)

As of early 2026, the LLM field is characterized by several major trends.

Million-token context windows are now standard among frontier models. Claude 4.6 and Gemini 3.1 Pro both offer 1-million-token windows, and LLaMA 4 Scout pushes to 10 million tokens. These expanded windows enable processing of entire codebases, book-length documents, and multi-hour conversation histories in a single pass.

Agentic capabilities have become a defining feature. Frontier models can use tools, browse the web, write and execute code, manage files, and carry out multi-step tasks with minimal human supervision. Frameworks built on MCP and A2A allow agents to connect to external services and APIs through standardized protocols. Multi-agent systems, where orchestrated teams of specialized agents collaborate on tasks, saw a 1,445% increase in interest from Q1 2024 to Q2 2025 according to Gartner ^[62]. Agent reliability remains the open problem for commercial deployment: models still make errors in long agentic loops, and reducing error rates in tool use, code execution, and long-horizon planning is central to converting chat assistants into autonomous workers.

Reasoning models represent a distinct category. OpenAI's o1/o3 series and DeepSeek R1 use extended internal "thinking" to solve complex problems, trading speed for accuracy on mathematical, scientific, and coding tasks.

Mixture-of-experts architectures have become widespread, allowing models to scale total parameter counts into the hundreds of billions or trillions while keeping inference costs practical by activating only a fraction of parameters per token.

The open-weight ecosystem continues to mature. Models like LLaMA 4, DeepSeek V3, Mistral Large 3, and Qwen 3 provide near-frontier capabilities with full weight access, enabling fine-tuning, local deployment, and research that would be impossible with closed models.

Hybrid architectures are beginning to appear. NVIDIA's Nemotron 3 family (announced December 2025) combines Mamba (a state-space model) with Transformer layers in an MoE configuration, targeting improved inference throughput and long-context efficiency for agent workloads ^[67].

Recent developments (2026)

The pace of frontier model releases continued through the second quarter of 2026. OpenAI launched GPT-5.5 on April 23, 2026, describing it as its most capable and intuitive model to date, with particular gains in agentic coding, scientific research, and computer use. The model offers a roughly 1-million-token context window with up to 128,000 output tokens, and is priced at $5 per million input tokens and $30 per million output tokens, double the cost of GPT-5.4 ^[68]. It scored 82.7% on Terminal-Bench 2.0 and posted strong results on the FrontierMath benchmark ^[68]. On May 5, 2026, OpenAI released GPT-5.5 Instant as the new default model for all ChatGPT users, replacing GPT-5.3 Instant ^[69].

Google introduced Gemini 3.5 Flash at Google I/O on May 19, 2026, positioning it as its strongest agentic and coding model and reporting that it runs roughly four times faster (in output tokens per second) than comparable frontier models. It scored 76.2% on Terminal-Bench 2.1, with Gemini 3.5 Pro slated to follow ^[70].

Anthropic released Claude Opus 4.8 on May 28, 2026, citing improved agentic coding, reasoning, and honesty. The company reported the model reaches 84% on the Online-Mind2Web browser-agent benchmark and is about four times less likely than its predecessor to let flaws in its own code pass unremarked ^[71]. Pricing remained $5 per million input tokens and $25 per million output tokens ^[71]. The same day, Anthropic announced a $65 billion Series H round at a $965 billion post-money valuation ^[72], a figure that news outlets reported surpassed OpenAI's valuation, making Anthropic the most valuable AI startup at the time ^[73].

June 2026 brought the year's densest release wave. Anthropic announced Claude Fable 5 and Claude Mythos 5 on June 9, 2026. Fable 5, the first generally available model in the Mythos class (a capability tier above Opus), launched with a 1-million-token context window, up to 128,000 output tokens, and pricing of $10 per million input tokens and $50 per million output tokens; Mythos 5, the same underlying model with fewer cyber-capability restrictions, was initially limited to vetted cyberdefense partners through Project Glasswing. ^[78] The U.S. government placed both models under export controls on June 12, then cleared Mythos 5 for U.S. critical-infrastructure defenders and lifted the controls on Fable 5 by June 30. ^[83] Anthropic closed the month with Claude Sonnet 5 on June 30, 2026, which became the default model for free and Pro users and which the company described as approaching Opus 4.8 performance at much lower cost, with introductory pricing of $2 per million input tokens and $10 per million output tokens. ^[82]

OpenAI began a limited preview of the GPT-5.6 family on June 26, 2026, spanning the flagship Sol, the mid-tier Terra, and the fast, low-cost Luna; access was initially restricted to a small group of trusted partners while a U.S. government safety review completed, and general availability followed on July 9, 2026. ^[84] Among open-weight models, Zhipu AI released GLM-5.2 in mid-June 2026 under an MIT license: a mixture-of-experts design with roughly 750 billion total parameters (about 40 billion active) and a 1-million-token context window, it posted the strongest open-model results on agentic coding benchmarks at release, including 62.1% on SWE-bench Pro and 81.0% on Terminal-Bench 2.1. ^[79] Together with DeepSeek V4, released April 24, 2026 in V4-Pro and V4-Flash variants under the MIT license, these releases continued to narrow the gap between open-weight and closed frontier models. ^[77]

Several notable frontier model pages were added as dedicated entries rather than being covered only inside this overview:

Model	Developer	Why it matters
GPT-5.4	OpenAI	Mainline reasoning model with computer use and tool search
GPT-4.1	OpenAI	API-only family focused on coding and 1M context
Gemini 3 Pro	Google DeepMind	Gemini 3-series flagship preview model
Claude Opus 4.7	Anthropic	Anthropic's April 2026 flagship generally available model
Grok 4.1 Fast	xAI	2M-context tool-calling model for agentic tasks

Explain like I'm 5 (ELI5)

A large language model is like a super-smart computer program that has read billions of books, articles, and web pages. By reading all that text, it learned how words and sentences fit together. When you ask it a question or give it a task, it figures out what words should come next, one at a time, to write a helpful answer. It can do lots of things: translate languages, answer questions, write stories, help with homework, or even write computer code. But it is not perfect. Sometimes it makes things up that sound right but are not true, because it learned patterns in language rather than actually understanding the world the way people do.

References

Wikipedia. "Large language model." https://en.wikipedia.org/wiki/large_language_model ↩
Mikolov, Tomas et al. "Efficient Estimation of Word Representations in Vector Space." arXiv:1301.3781, 2013. https://arxiv.org/abs/1301.3781 ↩
Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "GloVe: Global Vectors for Word Representation." EMNLP, 2014. https://nlp.stanford.edu/pubs/glove.pdf ↩
Vaswani, A., et al. "Attention Is All You Need." Advances in Neural Information Processing Systems, 2017. arXiv:1706.03762. https://arxiv.org/abs/1706.03762 ↩
Radford, A., et al. "Improving Language Understanding by Generative Pre-Training." OpenAI, June 2018. ↩
Radford, A., et al. "Language Models are Unsupervised Multitask Learners." OpenAI, February 2019. ↩
OpenAI. "GPT-2: 1.5B Release." November 5, 2019. https://openai.com/index/gpt-2-1-5b-release/ ↩
Devlin, J., et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL, 2019. arXiv:1810.04805. https://arxiv.org/abs/1810.04805 ↩
Brown, T., et al. "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems, 2020. arXiv:2005.14165. https://arxiv.org/abs/2005.14165 ↩
Chowdhery, A., et al. "PaLM: Scaling Language Modeling with Pathways." Google Research, 2022. ↩
Raffel, Colin et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 2020. https://huggingface.co/google-t5/t5-11b ↩
Chung, H. W., et al. "Scaling Instruction-Finetuned Language Models." Google Research, 2022. ↩
Ouyang, L., et al. "Training language models to follow instructions with human feedback." NeurIPS, 2022. arXiv:2203.02155. https://arxiv.org/abs/2203.02155 ↩
OpenAI. "GPT-4 Technical Report." March 2023. ↩
Touvron, H., et al. "LLaMA: Open and Efficient Foundation Language Models." Meta AI, February 2023. ↩
OpenAI. "Hello GPT-4o." May 13, 2024. https://openai.com/index/hello-gpt-4o/ ↩
Meta AI. "Introducing Llama 3.1: Our most capable models to date." July 23, 2024. https://ai.meta.com/blog/meta-llama-3-1/ ↩
OpenAI. "Learning to Reason with LLMs." September 2024. ↩
OpenAI. "Introducing GPT-4.1 in the API." April 14, 2025. https://openai.com/index/gpt-4-1/ ↩
OpenAI. "Introducing GPT-5." August 2025. https://openai.com/index/introducing-gpt-5/ ↩
Anthropic. "Introducing Claude 4." May 22, 2025. https://www.anthropic.com/news/claude-4 ↩
Anthropic. "Claude 4 Model Card and System Prompt." May 2025. ↩
Google. "Gemini 2.5: Our newest Gemini model with thinking." March 25, 2025. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/ . See also: Google. "Gemini 2.5 Deep Think rolling out now for Google AI Ultra." August 2025. https://9to5google.com/2025/08/01/gemini-2-5-deep-think/ ↩
Google DeepMind. "Gemini 3.1 Pro Technical Report." February 2026. ↩
Meta AI. "The Llama 4 Herd: The Beginning of a New Era of Natively Multimodal AI Innovation." April 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ ↩
DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, December 2024. https://arxiv.org/abs/2412.19437 . See also: DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025. https://arxiv.org/abs/2501.12948 ↩
Alibaba Cloud. "Alibaba Introduces Qwen3, Setting New Benchmark in Open-Source AI with Hybrid Reasoning." April 28, 2025. https://www.alibabacloud.com/blog/alibaba-introduces-qwen3-setting-new-benchmark-in-open-source-ai-with-hybrid-reasoning_602192 ↩
Mistral AI. "Introducing Mistral 3." December 2025. ↩
"Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model." arXiv:2510.26622, 2025. ↩
Mistral AI. "Mixtral of experts." December 11, 2023. https://mistral.ai/news/mixtral-of-experts ↩
Gu, Albert and Tri Dao. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, December 2023. https://arxiv.org/abs/2312.00752 ↩
AI21 Labs. "Introducing Jamba: AI21's Groundbreaking SSM-Transformer Model." https://www.ai21.com/blog/announcing-jamba/ ↩
Su, Jianlin et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, 2021. https://arxiv.org/abs/2104.09864 ↩
Sennrich, R., Haddow, B., and Birch, A. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016. ↩
Common Crawl Foundation. https://commoncrawl.org ↩
Penedo, Guilherme et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557, 2024. https://arxiv.org/abs/2406.17557 . See also: Penedo et al. "The RefinedWeb Dataset for Falcon LLM." arXiv:2306.01116, 2023. ↩
Qwen Team. "Qwen2.5 Technical Report." arXiv:2412.15115, December 2024. https://arxiv.org/abs/2412.15115 ↩
Epoch AI. "How Much Does It Cost to Train Frontier AI Models?" 2024. ↩
Rafailov, R., et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." NeurIPS, 2023. arXiv:2305.18290. https://arxiv.org/abs/2305.18290 ↩
Bai, Yuntao et al. "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073, December 2022. https://arxiv.org/abs/2212.08073 ↩
Hu, E., et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022. ↩
Dettmers, T., et al. "QLoRA: Efficient Finetuning of Quantized LLMs." NeurIPS, 2023. ↩
Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS, 2020. See also: "A Comprehensive Survey of Retrieval-Augmented Generation (RAG)." arXiv:2410.12837, 2024. ↩
Kaplan, J., et al. "Scaling Laws for Neural Language Models." OpenAI, January 2020. arXiv:2001.08361. https://arxiv.org/abs/2001.08361 ↩
Hoffmann, J., et al. "Training Compute-Optimal Large Language Models." DeepMind, NeurIPS 2022. arXiv:2203.15556. https://arxiv.org/abs/2203.15556 ↩
Hu, Shengding, et al. "MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies." arXiv:2404.06395, 2024. https://arxiv.org/abs/2404.06395 . See also: Muennighoff, Niklas, et al. "Scaling Data-Constrained Language Models." NeurIPS 2023. arXiv:2305.16264. ↩
Sardana, Nikhil, Jacob Portes, Sasha Doubov, and Jonathan Frankle. "Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws." MosaicML (Databricks), ICML 2024. arXiv:2401.00448. https://arxiv.org/abs/2401.00448 ↩
Agarwal, R., et al. "Many-Shot In-Context Learning." NeurIPS, 2024. ↩
Wei, J., et al. "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research, 2022. ↩
Schaeffer, R., et al. "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS, 2023. ↩
ACL 2025. "Energy-Efficient LLM Serving: A Benchmark Study." 2025. ↩
NVIDIA. "Model Optimizer: Quantization, Pruning, Distillation, Speculative Decoding." GTC, 2025. ↩
vLLM Project. "Speculative decoding." Documentation. https://docs.vllm.ai/en/v0.6.6/usage/spec_decode.html ↩
NVIDIA. "Boost Llama 3.3 70B Inference Throughput 3x with TensorRT-LLM Speculative Decoding." 2024. ↩
Stanford HAI. "The 2025 AI Index Report: Technical Performance." https://hai.stanford.edu/ai-index/2025-ai-index-report/technical-performance ↩
ARC Prize. "ARC Prize 2025 Results and Analysis." https://arcprize.org/blog/arc-prize-2025-results-analysis ↩
Humanity's Last Exam benchmark. https://agi.safe.ai/ ↩
Wikipedia. "Llama (language model)." https://en.wikipedia.org/wiki/Llama_(language_model) ↩
Open Source Initiative. "Open Source AI Definition (OSAID) 1.0." October 2024. ↩
Interconnects. "2025 Open Models Year in Review." 2025. ↩
xAI. "Grok 3 Beta - The Age of Reasoning Agents." February 2025. https://x.ai/news/grok-3 ↩
Gartner. "Agentic AI Market Forecast." 2025. ↩
Xu, Z., et al. "Hallucination is Inevitable: An Innate Limitation of Large Language Models." arXiv:2401.11817, 2024. See also: Frontiers in AI. "Survey and analysis of hallucinations in large language models." 2025. ↩
Nature Scientific Reports. "Limitations of large language models in clinical problem-solving arising from inflexible reasoning." 2025. ↩
OWASP Foundation. "LLM01:2025 Prompt Injection." OWASP Gen AI Security Project. https://genai.owasp.org/llmrisk/llm01-prompt-injection/ ↩
Reconciling the contrasting narratives on the environmental impact of large language models. Nature Scientific Reports, 2024. See also: "How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference." arXiv:2505.09598, 2025. ↩
NVIDIA. "Nemotron 3: Hybrid Mamba-Transformer MoE for Agent Workloads." December 2025. ↩
OpenAI. "Introducing GPT-5.5." April 2026. https://openai.com/index/introducing-gpt-5-5/ . See also OpenAI API model reference: https://developers.openai.com/api/docs/models/gpt-5.5 ↩
Wiggers, K. "OpenAI releases GPT-5.5 Instant, a new default model for ChatGPT." TechCrunch, May 5, 2026. https://techcrunch.com/2026/05/05/openai-releases-gpt-5-5-instant-a-new-default-model-for-chatgpt/ ↩
Google. "Gemini 3.5: frontier intelligence with action." The Keyword (Google Blog), May 19, 2026. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/ ↩
Anthropic. "Introducing Claude Opus 4.8." May 28, 2026. https://www.anthropic.com/news/claude-opus-4-8 ↩
Anthropic. "Anthropic raises $65B in Series H funding at $965B post-money valuation." May 28, 2026. https://www.anthropic.com/news/series-h ↩
Bloomberg. "Anthropic's Valuation Nears $1 Trillion After Raising $65 Billion." May 28, 2026. https://www.bloomberg.com/news/articles/2026-05-28/anthropic-raises-at-965-billion-valuation-eclipsing-openai ↩
Google. "Introducing Gemini: our largest and most capable AI model." December 6, 2023. https://blog.google/technology/ai/google-gemini-ai/ . See also: Google. "Bard becomes Gemini." February 8, 2024. ↩
Shao, Zhihong, et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300, February 2024. https://arxiv.org/abs/2402.03300 ↩
Rein, David, et al. "GPQA: A Graduate-Level Google-Proof Q&A Benchmark." arXiv:2311.12022, November 2023. https://arxiv.org/abs/2311.12022 ↩
DeepSeek-AI. "DeepSeek-V4-Pro." Hugging Face model card, April 2026. https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro . See also: DataCamp. "DeepSeek V4: Features, Benchmarks, and Comparisons." April 2026. https://www.datacamp.com/blog/deepseek-v4 ↩
Anthropic. "Claude Fable 5 and Claude Mythos 5." June 9, 2026. https://www.anthropic.com/news/claude-fable-5-mythos-5 . See also: TechCrunch. "Anthropic's Claude Fable 5 is a version of Mythos the public can access today." June 9, 2026. ↩
VentureBeat. "Z.ai's open-weights GLM-5.2 beats GPT-5.5 on multiple long-horizon coding benchmarks for 1/6th the cost." June 2026. https://venturebeat.com/technology/z-ais-open-weights-glm-5-2-beats-gpt-5-5-on-multiple-long-horizon-coding-benchmarks-for-1-6th-the-cost ↩
CNBC. "OpenAI completes restructure, solidifying Microsoft as a major shareholder." October 28, 2025. https://www.cnbc.com/2025/10/28/open-ai-for-profit-microsoft.html . See also: Microsoft. "The next chapter of the Microsoft-OpenAI partnership." October 28, 2025. https://blogs.microsoft.com/blog/2025/10/28/the-next-chapter-of-the-microsoft-openai-partnership/ ↩
Zeff, Maxwell. "Sam Altman says ChatGPT has hit 800M weekly active users." TechCrunch, October 6, 2025. https://techcrunch.com/2025/10/06/sam-altman-says-chatgpt-has-hit-800m-weekly-active-users/ ↩
Anthropic. "Introducing Claude Sonnet 5." June 30, 2026. https://www.anthropic.com/news/claude-sonnet-5 . See also: TechCrunch. "Anthropic launches Claude Sonnet 5 as a cheaper way to run agents." June 30, 2026. ↩
Fortune. "Anthropic's Mythos 5 AI model cleared by U.S. for wider use." June 27, 2026. https://fortune.com/2026/06/27/anthropic-mythos-5-ai-model-us-commerce-department-clearance-fable/ . See also: Anthropic. "Redeploying Claude Fable 5." June 2026. https://www.anthropic.com/news/redeploying-fable-5 ↩
OpenAI. "Previewing GPT-5.6 Sol: a next-generation model." June 26, 2026. https://openai.com/index/previewing-gpt-5-6-sol/ . See also: Axios. "OpenAI releases powerful new GPT-5.6 model under restrictions." June 26, 2026; Wikipedia. "GPT-5.6." ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

10 revisions by 1 contributors · full history

Suggest edit