See also: Machine learning terms, Natural language processing, Transformer
A large language model (LLM) is a type of artificial intelligence system built on neural networks with billions (or trillions) of parameters, trained on massive text corpora to understand and generate human language. These models can perform a wide range of tasks including translation, summarization, question answering, code generation, and open-ended conversation. Since the release of GPT-1 in 2018, LLMs have become one of the most consequential developments in the history of computing, powering products used by hundreds of millions of people worldwide.
LLMs work by learning statistical patterns in text. During training, a model reads vast quantities of text from books, websites, academic papers, and code repositories, building an internal representation of language structure, factual knowledge, and reasoning patterns. The resulting model can then generate text one token at a time, predicting the most likely next token given everything that came before it. Despite this relatively simple mechanism, LLMs exhibit surprisingly complex behavior, including the ability to follow instructions, write software, solve math problems, and engage in multi-step reasoning.
The term "large" in LLM is relative and has shifted over time. In 2018, GPT-1's 117 million parameters qualified as large. By 2025, models with fewer than a billion parameters are generally considered small, and frontier LLMs contain hundreds of billions to trillions of parameters. The "language model" part refers to the core training objective: predicting the probability distribution over the next token in a sequence, a form of self-supervised learning that requires no manually labeled data.
The development of LLMs can be traced through several distinct phases, each marked by significant increases in model size, training data, and capability.
The transformer architecture, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need," laid the groundwork for all modern LLMs [1]. The Transformer replaced recurrent neural networks with a self-attention mechanism that could process entire sequences in parallel, dramatically improving both training speed and the model's ability to capture long-range dependencies in text.
OpenAI released GPT-1 in June 2018 with 117 million parameters [2]. It was trained on the BookCorpus dataset and demonstrated that generative pre-training followed by discriminative fine-tuning could achieve strong results on a variety of natural language processing benchmarks. In February 2019, GPT-2 followed with 1.5 billion parameters, trained on WebText, a 40-gigabyte dataset of 8 million web pages [3]. OpenAI initially withheld the full model citing concerns about potential misuse for generating disinformation, releasing it in stages over several months.
Google introduced BERT (Bidirectional Encoder Representations from Transformers) in October 2018 with 340 million parameters. Unlike GPT, BERT used a bidirectional training approach (masked language modeling), making it particularly effective for understanding tasks like classification and question answering rather than text generation [4].
The release of GPT-3 in May 2020 marked a turning point. With 175 billion parameters and a 2,048-token context window, GPT-3 showed that scaling up model size could unlock qualitatively new capabilities [5]. The model demonstrated impressive "few-shot learning" abilities, performing tasks it had never been explicitly trained for simply by being given a few examples in the prompt. GPT-3's training cost was estimated at $4.6 million in cloud compute, and required approximately 350 GB of storage for its weights alone.
Google responded with PaLM (Pathways Language Model) in April 2022, scaling to 540 billion parameters. PaLM demonstrated strong performance across NLP benchmarks and showed particular strength in reasoning tasks when combined with chain-of-thought prompting [6]. Google also developed LaMDA, a model focused specifically on natural conversational abilities.
This period also saw the emergence of several important open-source efforts. EleutherAI released GPT-Neo and GPT-J, providing the research community with openly accessible alternatives to proprietary models. BigScience, an international collaboration, released BLOOM, a 176-billion-parameter multilingual model, in July 2022. Google released T5 (Text-to-Text Transfer Transformer), which framed all NLP tasks as text-to-text problems using an encoder-decoder architecture, and later Flan-T5, an instruction-tuned variant that demonstrated the power of multi-task fine-tuning [7].
The launch of ChatGPT in November 2022 (built on GPT-3.5) brought LLMs into mainstream public awareness. Within two months, it had over 100 million users, making it the fastest-growing consumer application in history at the time.
OpenAI released GPT-4 in March 2023, a multimodal model capable of processing both text and images. GPT-4 was widely praised for its increased accuracy and reasoning capabilities [8]. Anthropic launched the Claude model family, and Google released Gemini (initially called Bard) with Nano, Pro, and Ultra variants.
Meta released LLaMA (Large Language Model Meta AI) in February 2023, a family of models ranging from 7 billion to 65 billion parameters. The 13B-parameter LLaMA model outperformed GPT-3 (175B) on most NLP benchmarks, demonstrating that smaller, well-trained models could match or exceed much larger ones [9]. LLaMA's open release catalyzed a wave of community fine-tuning projects including Alpaca, Vicuna, and Koala. Meta followed with LLaMA 2 in July 2023 and LLaMA 3 (with models up to 405 billion parameters) in 2024.
In September 2024, OpenAI released o1-preview, the first in a new series of "reasoning models" trained specifically for extended chain-of-thought problem solving, representing a new paradigm in LLM capability [10].
By 2025, LLMs entered a new phase characterized by massive context windows, native multimodality, mixture-of-experts architectures, and agentic capabilities.
OpenAI released GPT-5 on August 7, 2025, featuring a 400,000-token context window and significantly improved reliability, with hallucination rates reduced to approximately 6.2%. It achieved a perfect score on the AIME 2025 math benchmark. GPT-5.2 followed in December 2025 with improved tool use and long-context processing [11].
Anthropic released Claude Opus 4 and Claude Sonnet 4 in May 2025, designed explicitly for agentic use cases including tool invocation, file access, and long-horizon reasoning. Claude Sonnet 4 gained a 1-million-token context window by August 2025. Claude Opus 4.5 arrived in November 2025, and the Claude 4.6 family launched in February 2026 with 1M-token context and up to 128K output tokens [12].
Google released Gemini 3 Pro in November 2025, followed by Gemini 3.1 Pro in February 2026, which led on 12 of 18 tracked benchmarks and offered a 1-million-token context window [13].
Meta released the LLaMA 4 family on April 5, 2025, marking an architectural shift to mixture-of-experts (MoE) design with native multimodality. LLaMA 4 Scout featured a 10-million-token context window, capable of processing approximately 7,500 pages of text [14].
In January 2025, the Chinese AI lab DeepSeek released DeepSeek R1, a 671-billion-parameter open-weight reasoning model that performed comparably to OpenAI's o1 at a fraction of the cost per token [15]. Mistral Large 3 launched in December 2025 with 675 billion total parameters (41 billion active) under the Apache 2.0 open-source license [16].
Virtually all modern LLMs are based on the Transformer architecture. Transformers rely on attention mechanisms (specifically self-attention) that allow the model to weigh the importance of different tokens in a sequence relative to each other. This enables the model to learn complex linguistic patterns and generate coherent, context-aware text across long sequences.
The original Transformer had both an encoder (for understanding input) and a decoder (for generating output). Modern LLMs have diverged into distinct architectural families:
| Architecture type | How it works | Training objective | Strengths | Example models |
|---|---|---|---|---|
| Decoder-only | Generates text left-to-right using causal (unidirectional) attention | Next-token prediction | Text generation, conversation, code | GPT series, Claude, LLaMA, Mistral |
| Encoder-only | Processes input bidirectionally using masked attention | Masked language modeling | Classification, NER, sentence embeddings | BERT, RoBERTa, DeBERTa |
| Encoder-decoder | Maps input to output via cross-attention between encoder and decoder | Span corruption or text-to-text | Translation, summarization, question answering | T5, BART, Flan-T5 |
The decoder-only architecture has become dominant for large-scale language models because it naturally supports autoregressive text generation and scales efficiently with increasing parameter counts. A 2024 study found that at small scales, encoder-decoder models can outperform decoder-only models by several points on complex tasks, but this advantage diminishes at larger scales where decoder-only models match or exceed them [17].
A significant architectural trend in 2024-2025 has been the adoption of Mixture-of-Experts designs. In an MoE model, only a fraction of the total parameters are activated for any given input token. A routing mechanism selects which "expert" subnetworks to use, allowing models to have very large total parameter counts while keeping inference costs manageable. For example, Mistral Large 3 has 675 billion total parameters but only 41 billion active per token, and LLaMA 4 Maverick has 400 billion total parameters with 17 billion active [14][16].
LLMs process text as tokens rather than individual characters or whole words. A token is typically a subword unit: common words like "the" are single tokens, while less frequent words may be split into multiple tokens. On average, one token corresponds to roughly 3/4 of a word in English. Tokenization is a foundational preprocessing step that bridges the gap between raw text and the model's numerical representations.
| Algorithm | How it works | Used by | Key characteristic |
|---|---|---|---|
| Byte Pair Encoding (BPE) | Starts with individual characters and iteratively merges the most frequent adjacent pair until reaching target vocabulary size | GPT-2, GPT-3, GPT-4, LLaMA | Most popular; byte-level variant treats every possible byte as a basic unit |
| WordPiece | Similar to BPE but merges based on which pair maximizes the likelihood of the training data, not just frequency | BERT, DistilBERT, Electra | Tends to keep frequent words intact while splitting rare words |
| SentencePiece | Language-agnostic; treats input as raw byte stream and learns subword units using BPE or Unigram algorithms | T5, LLaMA, many multilingual models | Works directly on raw text without language-specific preprocessing; uses special marker for word boundaries |
| Unigram | Starts with a large vocabulary and iteratively removes tokens that least reduce the training data likelihood | SentencePiece-based models, XLNet | Probabilistic approach; can assign multiple tokenizations to the same text |
Byte-level BPE, used by models like GPT-2 and later, operates at the byte level rather than the character level. This ensures that any text (including emojis, non-Latin scripts, and special characters) can be tokenized without unknown tokens, since every input can be decomposed into its constituent bytes [18].
Vocabulary sizes for modern LLMs typically range from 32,000 to 256,000 tokens. Larger vocabularies reduce the average number of tokens needed to represent text (improving efficiency) but increase the size of the embedding layer. GPT-4 uses a vocabulary of approximately 100,000 tokens, while LLaMA 3 expanded to 128,000 tokens to improve multilingual performance.
The development of a modern LLM typically follows a multi-stage pipeline: pre-training, supervised fine-tuning (SFT), and alignment.
During pre-training, the model is exposed to enormous quantities of text, learning to predict the next token given the preceding context. This self-supervised phase is by far the most computationally expensive step. GPT-3, for example, was trained on roughly 300 billion tokens. More recent models use far more data: LLaMA 3 was trained on over 15 trillion tokens, a ratio of roughly 1,875 tokens per parameter [9].
Pre-training data typically includes web crawls (Common Crawl), books, Wikipedia, academic papers, code repositories (GitHub), and curated datasets. Data quality matters enormously; deduplication, filtering, and careful curation of training data have been shown to significantly improve model performance relative to simply adding more data.
Pre-training requires massive compute infrastructure. LLaMA 4 was trained on a cluster of thousands of NVIDIA GPUs, and Mistral Large 3 used approximately 3,000 H200 GPUs [16]. Training runs for frontier models cost tens to hundreds of millions of dollars. Epoch AI estimates that training costs for frontier models have grown by a factor of 2 to 3 times per year over the past eight years, with projections suggesting the largest models may cost over a billion dollars by 2027 [19].
| Model | Year | Estimated training cost |
|---|---|---|
| GPT-3 | 2020 | $4.6 million |
| PaLM (540B) | 2022 | ~$12 million |
| GPT-4 | 2023 | $78-100+ million |
| Gemini Ultra 1.0 | 2023 | ~$192 million |
| GPT-5 | 2025 | Undisclosed (est. $200M+) |
After pre-training, the model is fine-tuned on a smaller, curated dataset of high-quality instruction-response pairs. Human annotators or AI systems write examples of ideal responses to various prompts, and the model is trained to mimic these responses. This stage transforms the base model from a raw text predictor into an assistant that can follow instructions and engage in conversation.
Alignment techniques adjust the model's behavior to be helpful, harmless, and honest. The two dominant approaches are:
RLHF (Reinforcement Learning from Human Feedback): Introduced by OpenAI and refined by Anthropic, RLHF involves three sub-steps: (1) collecting human preference data by having annotators rank model outputs, (2) training a reward model to predict human preferences, and (3) using reinforcement learning (specifically PPO, Proximal Policy Optimization) to fine-tune the LLM to maximize the reward model's score [20]. Notable RLHF-trained models include ChatGPT, Claude, and Gemini.
DPO (Direct Preference Optimization): Introduced by Rafailov et al. in 2023, DPO simplifies alignment by eliminating the separate reward model and RL loop. Instead, it directly optimizes the LLM on preference pairs using a classification-style loss function. DPO is simpler to implement, cuts compute costs by approximately 40% compared to RLHF, and has been shown to produce comparable results in many settings [21]. By 2024, Hugging Face reported a 210% year-over-year increase in DPO usage.
Emerging alignment methods include Group Relative Policy Optimization (GRPO), which approximates the value function using average rewards from multiple completions, and Reinforcement Learning from AI Feedback (RLAIF), which uses LLM-generated preferences to reduce human labeling costs. Meta's LLaMA 4 uses a multi-round alignment process combining SFT, rejection sampling, PPO, and DPO [14].
Full fine-tuning (updating all model parameters) is prohibitively expensive for most practitioners. Parameter-efficient fine-tuning (PEFT) methods enable adaptation of LLMs by modifying only a small fraction of the model's weights.
LoRA (Low-Rank Adaptation), introduced by Hu et al. in 2021, injects trainable low-rank decomposition matrices into specific layers of the frozen pre-trained model [22]. Instead of updating a full weight matrix W of dimension d x d, LoRA learns two smaller matrices A (d x r) and B (r x d) where r is much smaller than d (typically 8 to 64). The effective update is W + BA, adding only a tiny number of parameters while capturing task-specific adaptations. LoRA typically trains 0.1% to 1% of the original parameters.
QLoRA (Quantized LoRA), introduced by Dettmers et al. in 2023, combines LoRA with aggressive quantization of the base model [23]. The pre-trained model weights are quantized to 4-bit precision using a new data type called NormalFloat4 (NF4), which is information-theoretically optimal for normally distributed weights. LoRA adapters are then trained in 16-bit precision on top of the frozen quantized base. Key innovations include double quantization (quantizing the quantization constants themselves) and paged optimizers to handle memory spikes. QLoRA makes it possible to fine-tune a 65-billion-parameter model on a single 48GB GPU while preserving full 16-bit fine-tuning performance.
| Method | Approach | Typical parameters trained |
|---|---|---|
| Full fine-tuning | Updates all parameters | 100% |
| LoRA | Low-rank adapter matrices | 0.1-1% |
| QLoRA | LoRA on 4-bit quantized base | 0.1-1% |
| DoRA | Decomposed weight-norm LoRA | ~0.5% |
| Prefix tuning | Learnable prefix tokens prepended to each layer | <0.1% |
| Adapter layers | Small bottleneck modules inserted between layers | 1-5% |
With LoRA and QLoRA, practitioners can adapt a 7-billion-parameter model on a single consumer GPU in a few hours for roughly $10. Frameworks like LLaMA-Factory and Hugging Face's PEFT library integrate these methods into streamlined training pipelines.
Retrieval-Augmented Generation (RAG) is a technique that enhances LLM outputs by retrieving relevant documents from an external knowledge base before generating a response [24]. RAG addresses several core LLM limitations: it provides access to up-to-date information beyond the training cutoff, reduces hallucination by grounding responses in retrieved evidence, and enables source attribution so users can verify claims.
A typical RAG pipeline involves three steps:
RAG saw explosive research growth in 2024, with over 1,200 RAG-related papers published on arXiv compared to fewer than 100 the previous year [24]. Advanced variants include GraphRAG (Microsoft, 2024), which builds knowledge graphs from documents for more structured retrieval, and Agentic RAG, where an LLM-powered agent plans multi-step retrieval strategies before generating. For enterprise applications, RAG offers a cost-effective alternative to full fine-tuning: rather than retraining the model on proprietary data, organizations can simply index their documents and retrieve relevant passages at query time.
Scaling laws describe the predictable relationship between a model's performance (measured by loss on held-out data) and the resources used to train it: model size (parameters), dataset size (tokens), and compute (FLOPs).
In January 2020, researchers at OpenAI (Kaplan et al.) published one of the first systematic studies of neural language model scaling, finding that loss follows a power-law relationship with each of these three factors. Their work suggested that model size should be prioritized over dataset size when scaling up [25].
In 2022, DeepMind's "Chinchilla" paper (Hoffmann et al.) challenged this view. The researchers found that for compute-optimal training, model size and training data should be scaled roughly equally. They proposed a ratio of approximately 20 tokens per parameter as optimal [26]. The 70-billion-parameter Chinchilla model, trained on 1.4 trillion tokens, outperformed the much larger 280-billion-parameter Gopher trained on fewer tokens.
Subsequent research has pushed well beyond the Chinchilla-optimal ratio. Practitioners discovered that models intended for wide deployment benefit from being trained on far more tokens than Chinchilla recommends, because the marginal cost of additional training is small compared to the ongoing cost of serving a larger model to millions of users. LLaMA 3 models were trained at a ratio of roughly 1,875 tokens per parameter, nearly 100 times the Chinchilla-optimal ratio [9]. Research from Tsinghua University suggested a ratio of 192:1 may be more practical for many settings [27]. Loss continues to decrease well beyond the Chinchilla-optimal point, though with diminishing returns.
A 2024 paper from UC Berkeley ("Beyond Chinchilla-Optimal") formalized this intuition, showing that when inference costs are factored in, the optimal strategy is to train smaller models for longer than the original Chinchilla prescription [28].
The context window (or context length) is the maximum number of tokens the model can process in a single forward pass, including both the input prompt and the generated output. Larger context windows allow the model to work with longer documents, maintain coherence over extended conversations, and perform tasks like whole-codebase analysis or book-length summarization.
Context windows have grown by a factor of approximately 20,000 since 2018, from 512 tokens to 10 million tokens in LLaMA 4 Scout.
| Model | Year | Context window |
|---|---|---|
| GPT-1 | 2018 | 512 tokens |
| GPT-2 | 2019 | 1,024 tokens |
| GPT-3 | 2020 | 2,048 tokens |
| GPT-3.5-Turbo | 2023 | 16,384 tokens |
| GPT-4 | 2023 | 128,000 tokens |
| Claude 3 Opus | 2024 | 200,000 tokens |
| Gemini 1.5 Pro | 2024 | 1,000,000 tokens |
| GPT-5 | 2025 | 400,000 tokens |
| Claude Opus 4.6 | 2026 | 1,000,000 tokens |
| Gemini 3.1 Pro | 2026 | 1,000,000 tokens |
| LLaMA 4 Scout | 2025 | 10,000,000 tokens |
This expansion has been driven by algorithmic improvements (RoPE and its extensions like LongRoPE and YaRN), more efficient attention mechanisms (FlashAttention, sparse attention), and hardware advances in memory capacity. However, longer context windows introduce new challenges. Performance can degrade when relevant information is buried in the middle of a long document (the "lost in the middle" problem), and processing long contexts increases both latency and cost. KV-cache memory usage grows linearly with sequence length, making million-token contexts expensive to serve at scale.
One of the most discussed phenomena in LLM research is the concept of emergent abilities: capabilities that appear in larger models but are absent or negligible in smaller ones. Examples include the ability to perform multi-step arithmetic, follow complex instructions, and reason about abstract concepts.
LLMs can learn new tasks from examples provided directly in the prompt, without any weight updates. This capability, known as in-context learning (ICL), scales with model size and context length. With expanded context windows, "many-shot" in-context learning (providing hundreds or thousands of examples rather than just a few) has shown significant performance gains across generative and discriminative tasks. A 2024 paper on many-shot ICL was accepted as a Spotlight Presentation at NeurIPS 2024, documenting performance improvements across a wide variety of tasks [29].
Chain-of-thought (CoT) prompting guides LLMs to break complex problems into intermediate reasoning steps. By prefacing a prompt with "Let's think step by step" or providing worked examples, models produce more accurate answers on math, logic, and science problems. This capability emerges primarily in models above approximately 100 billion parameters and is the foundation for dedicated reasoning models like OpenAI's o1/o3 and DeepSeek R1.
The existence and nature of emergent abilities is debated. Wei et al. (2022) documented numerous tasks where performance appeared to jump discontinuously at certain model scales [30]. However, Schaeffer et al. (2023) argued that apparent emergence may be an artifact of the choice of evaluation metric; when smooth, continuous metrics are used instead of sharp accuracy thresholds, performance improvements look gradual rather than sudden [31].
Regardless of the theoretical debate, it is empirically clear that larger and better-trained models can perform tasks that smaller models cannot. The practical question for researchers and engineers is whether a given capability requires a model above a certain size threshold or whether clever training techniques (better data, improved architectures, distillation) can bring that capability to smaller models.
As LLMs grow larger, efficient inference becomes increasingly important. A 2025 ACL study found that proper inference optimization techniques can reduce energy usage by up to 73% compared to naive serving, typically translating to a 2-3x reduction in cloud costs [32].
Quantization reduces the numerical precision of model weights from their training precision (typically 16-bit floating point) to lower-bit representations such as 8-bit, 4-bit, or even lower. This cuts memory usage proportionally and can speed up inference significantly. NVIDIA's NVFP4 format, for instance, enables 4-bit quantization with minimal accuracy loss, delivering up to 4x throughput improvement on B200 GPUs compared to FP8 on H100 [33]. Common quantization approaches include GPTQ, AWQ, and GGUF.
Speculative decoding uses a small, fast "draft" model to generate candidate tokens, which are then verified in parallel by the larger target model. Since the large model can verify multiple tokens simultaneously (a single forward pass over several positions), this approach achieves 2-3x speedups without changing the output distribution. NVIDIA's TensorRT-LLM demonstrated up to 3.55x throughput improvement with Llama 3.3 70B using speculative decoding [34].
Techniques like PagedAttention (used in vLLM) manage the key-value cache more efficiently, reducing memory waste during batched inference. NVFP4 KV cache quantization can cut KV cache memory by up to 50%, effectively doubling context budgets and unlocking larger batch sizes [33].
Traditional static batching waits for a batch of requests to complete before starting a new batch, leaving GPUs idle. Continuous batching (also called in-flight batching) allows new requests to enter mid-batch and completed requests to exit immediately, dramatically improving GPU utilization and throughput.
The LLM ecosystem is split between proprietary (closed) models and open-weight (open) models, with ongoing debate about the advantages of each approach.
Closed models like GPT-5, Claude, and Gemini are developed by companies that do not release the model weights. Users access them through APIs or chat interfaces. Advantages include strong safety measures, regular updates, and state-of-the-art performance. Drawbacks include vendor lock-in, limited customization, unpredictable pricing changes, and data privacy concerns (since user inputs are sent to third-party servers).
Open-weight models like LLaMA 4, Mistral Large 3, DeepSeek V3, and Qwen 3 release their trained weights for anyone to download and run. This allows full customization, fine-tuning for specific domains, and local deployment without sending data to external servers. The Open Source Initiative published the Open Source AI Definition (OSAID) 1.0 in October 2024, establishing criteria for what counts as truly "open source" in AI; most open-weight models do not meet this standard because they do not release their training data [35].
By early 2026, the performance gap between open and closed models has narrowed substantially. Open-weight models trail proprietary frontier models by only about three months on average across standard benchmarks [36]. However, closed models maintain a lead on complex agentic tasks, production-quality coding benchmarks (SWE-bench), and overall human preference ratings on platforms like Chatbot Arena. For domain-specific applications such as legal document analysis or medical coding, a fine-tuned 7B open model can often outperform a general-purpose frontier model while running on a single consumer GPU.
| Model | Developer | Release date | Total parameters | Active parameters | Context window | Architecture | License |
|---|---|---|---|---|---|---|---|
| GPT-5 | OpenAI | Aug 2025 | Undisclosed | Undisclosed | 400K tokens | Decoder-only | Proprietary |
| Claude Opus 4.6 | Anthropic | Feb 2026 | Undisclosed | Undisclosed | 1M tokens | Decoder-only | Proprietary |
| Gemini 3.1 Pro | Google DeepMind | Feb 2026 | Undisclosed | Undisclosed | 1M tokens | Decoder-only | Proprietary |
| LLaMA 4 Maverick | Meta | Apr 2025 | 400B | 17B | 1M tokens | MoE | Open-weight (Llama license) |
| Mistral Large 3 | Mistral AI | Dec 2025 | 675B | 41B | 256K tokens | MoE | Apache 2.0 |
| DeepSeek V3 | DeepSeek | Dec 2024 | 671B | 37B | 128K tokens | MoE | Open-weight (MIT) |
| DeepSeek R1 | DeepSeek | Jan 2025 | 671B | 37B | 128K tokens | MoE | Open-weight (MIT) |
Modern LLMs demonstrate a broad range of capabilities that have expanded significantly with each generation.
Text generation and conversation: LLMs can produce fluent, coherent text on virtually any topic. They power chatbots, writing assistants, and content generation tools used by millions of people daily.
Reasoning and problem-solving: Frontier models can perform multi-step logical reasoning, solve mathematical problems, and pass standardized exams. GPT-5 achieved a perfect score on the AIME 2025 math benchmark, and reasoning-focused models like OpenAI's o1 and DeepSeek R1 can tackle complex problems using extended chain-of-thought processing [11][15].
Code generation: LLMs have become powerful programming assistants. Claude 4.5 achieved 77.2% on SWE-bench Verified (a benchmark of real-world software engineering tasks), and models can write, debug, refactor, and explain code in dozens of programming languages [12].
Translation: LLMs perform high-quality translation between many language pairs, often rivaling or exceeding dedicated machine translation systems. LLaMA 4 was trained across over 200 languages [14].
Summarization: Models can condense long documents into concise summaries while preserving key information, a capability that improves substantially with larger context windows.
Multimodal understanding: Many frontier models now process images, audio, and video alongside text. GPT-4 introduced image understanding in 2023, and by 2025, native multimodal training (jointly on text, images, and video) became standard for frontier models.
Agentic behavior: A significant development in 2025-2026 has been the emergence of agentic LLMs that can plan multi-step tasks, use external tools, browse the web, write and execute code, and interact with computer interfaces. Anthropic's Model Context Protocol (MCP) and Google's Agent-to-Agent Protocol (A2A) are establishing standards for how agents connect to external tools and APIs [37]. Gartner predicts that 40% of enterprise applications will embed AI agents by the end of 2026.
LLMs have found applications across nearly every sector of the economy.
| Sector | Applications | Examples |
|---|---|---|
| Software development | Code generation, debugging, testing, refactoring | GitHub Copilot, Cursor, Claude Code |
| Customer service | Chatbots, virtual assistants, ticket routing | ChatGPT Enterprise, Intercom Fin |
| Healthcare | Clinical documentation, literature review, patient communication | Med-PaLM, ambient scribes |
| Legal | Contract analysis, legal research, document drafting | Harvey AI, CoCounsel |
| Education | Personalized tutoring, grading, content generation | Khan Academy Khanmigo, Duolingo |
| Scientific research | Literature review, hypothesis generation, data analysis | Elicit, Consensus |
| Finance | Sentiment analysis, compliance, report generation | Bloomberg GPT, FinGPT |
| Content creation | Writing assistance, marketing copy, creative writing | Jasper, Copy.ai |
Industry analysts project that the agentic AI market will grow from $7.8 billion in 2025 to over $52 billion by 2030 [37].
Despite rapid progress, LLMs face several fundamental limitations.
LLMs sometimes generate plausible but factually incorrect information, a phenomenon known as hallucination. Theoretical work has shown that hallucination is an inherent property of LLMs and cannot be completely eliminated through architecture, data, or algorithmic improvements alone [38]. The problem stems from the fact that LLMs learn statistical patterns rather than grounding their knowledge in verified facts. On constraint satisfaction tasks, hallucination rates scale linearly with problem complexity.
Mitigation approaches include RAG (grounding responses in retrieved documents), chain-of-verification (having the model check its own outputs), and calibrated uncertainty (systems that transparently signal doubt and can safely refuse to answer rather than guessing). A 2025 multi-model study showed that simple prompt-based mitigation cut GPT-4o's hallucination rate from 53% to 23% [38]. While frontier models have reduced hallucination rates significantly (GPT-5 reports approximately 6.2%), the problem persists.
While LLMs have improved substantially at reasoning tasks, they still fail on problems that require genuine logical deduction, spatial reasoning, or common sense in unfamiliar contexts. State-of-the-art models perform poorly on certain clinical reasoning tasks and can struggle with novel problem formulations that differ from their training distribution [39]. Reasoning models (o1, DeepSeek R1) have partially addressed this through extended chain-of-thought processing, but at the cost of significantly increased inference time and expense.
Since LLMs are trained on internet text, they can learn and reproduce societal biases present in the training data. These biases can manifest in harmful stereotypes, uneven performance across languages and demographics, and skewed representations. Alignment techniques (RLHF, DPO) mitigate but do not eliminate these issues.
LLMs can be exploited for generating disinformation, phishing emails, malicious code, and other harmful content. Prompt injection attacks can manipulate LLM-powered applications into ignoring their instructions. Defending against these attacks remains an active area of research.
Training and deploying LLMs requires enormous computational resources, raising significant environmental concerns.
Training GPT-3 consumed an estimated 1,287 megawatt-hours (MWh) of electricity and produced over 550 metric tons of CO2 equivalent emissions, while requiring more than 700 kiloliters of water for cooling [40]. As models have grown, costs have scaled accordingly. GPT-4's training is estimated at $78-100 million, and Gemini Ultra 1.0 reached approximately $192 million. Epoch AI projects that the cost of frontier training runs has grown by 2-3x per year over the past eight years [19].
Recent research reveals that inference (rather than training) is emerging as the primary contributor to ongoing environmental costs, since inference occurs continuously at massive scale while training is a one-time event. A 2025 study estimated that GPT-4o inference alone would require approximately 391,000 to 463,000 MWh of electricity annually at current usage levels, consuming energy comparable to 35,000 U.S. homes [40]. The most energy-intensive models consume over 29 Wh per long prompt, more than 65 times the most efficient systems.
Research has also shown that LLMs can have dramatically lower environmental impact than human labor for equivalent output. For a typical LLM like Llama-3-70B, the human-to-LLM emissions ratio ranges from 40:1 to 150:1, meaning the LLM produces 40 to 150 times less carbon per unit of output than the human equivalent [40]. Optimization techniques (quantization, efficient serving, renewable-powered data centers) continue to improve the efficiency of LLM deployment.
As of early 2026, the LLM field is characterized by several major trends.
Million-token context windows are now standard among frontier models. Claude 4.6 and Gemini 3.1 Pro both offer 1-million-token windows, and LLaMA 4 Scout pushes to 10 million tokens. These expanded windows enable processing of entire codebases, book-length documents, and multi-hour conversation histories in a single pass.
Agentic capabilities have become a defining feature. Frontier models can use tools, browse the web, write and execute code, manage files, and carry out multi-step tasks with minimal human supervision. Frameworks built on MCP and A2A allow agents to connect to external services and APIs through standardized protocols. Multi-agent systems, where orchestrated teams of specialized agents collaborate on tasks, saw a 1,445% increase in interest from Q1 2024 to Q2 2025 according to Gartner [37].
Reasoning models represent a distinct category. OpenAI's o1/o3 series and DeepSeek R1 use extended internal "thinking" to solve complex problems, trading speed for accuracy on mathematical, scientific, and coding tasks.
Mixture-of-experts architectures have become widespread, allowing models to scale total parameter counts into the hundreds of billions or trillions while keeping inference costs practical by activating only a fraction of parameters per token.
The open-weight ecosystem continues to mature. Models like LLaMA 4, DeepSeek V3, Mistral Large 3, and Qwen 3 provide near-frontier capabilities with full weight access, enabling fine-tuning, local deployment, and research that would be impossible with closed models.
Hybrid architectures are beginning to appear. NVIDIA's Nemotron 3 family (announced December 2025) combines Mamba (a state-space model) with Transformer layers in an MoE configuration, targeting improved inference throughput and long-context efficiency for agent workloads [41].
A large language model is like a super-smart computer program that has read billions of books, articles, and web pages. By reading all that text, it learned how words and sentences fit together. When you ask it a question or give it a task, it figures out what words should come next, one at a time, to write a helpful answer. It can do lots of things: translate languages, answer questions, write stories, help with homework, or even write computer code. But it is not perfect. Sometimes it makes things up that sound right but are not true, because it learned patterns in language rather than actually understanding the world the way people do.