In-context learning (ICL) is the ability of large language models to perform new tasks by conditioning on examples or instructions provided directly in the input prompt, without any updates to the model's weights. The model reads a few input-output examples (demonstrations) and then generates the correct output for a new query, effectively "learning" the task from the context alone. This capability was first systematically documented in the GPT-3 paper by Brown et al. in 2020, where it was shown that a sufficiently large language model could perform tasks ranging from translation to arithmetic simply by being shown a handful of examples in the prompt [1].
In-context learning has fundamentally changed how practitioners interact with language models. Rather than fine-tuning a separate model for each task, users can describe the task through examples and natural language instructions. This flexibility is the foundation of prompt engineering, and it explains why general-purpose language models have largely displaced task-specific models in production NLP systems.
The GPT-3 paper, "Language Models are Few-Shot Learners" (Brown et al., 2020), introduced in-context learning as a core capability of large autoregressive models. The authors evaluated GPT-3 (175 billion parameters) on dozens of NLP benchmarks under three conditions [1]:
Across these conditions, GPT-3 demonstrated remarkably strong performance without any gradient updates. On some tasks, few-shot GPT-3 matched or exceeded the performance of models that had been specifically fine-tuned on thousands of labeled examples. For example, on the TriviaQA benchmark, few-shot GPT-3 achieved state-of-the-art performance, outperforming fine-tuned T5-11B [1].
Critically, the paper showed that in-context learning ability scaled with model size. Larger models were consistently better at learning from in-context examples. While a 125M parameter model showed minimal improvement from zero-shot to few-shot, the 175B parameter model showed dramatic gains. This suggested that in-context learning is an emergent capability that arises from sufficient scale in both model parameters and training data [1].
| Setting | Description | Example Format | GPT-3 175B Performance (avg. across tasks) |
|---|---|---|---|
| Zero-shot | Task description only | "Translate to French: [input]" | Moderate |
| One-shot | Description + 1 example | Description + "sea otter -> loutre de mer" + [input] | Good |
| Few-shot | Description + K examples | Description + K examples + [input] | Strong (often near fine-tuned) |
In zero-shot in-context learning, the model receives only a task instruction with no demonstrations. The model must rely entirely on its pre-training knowledge to understand what is being asked and how to respond. For example, a prompt might read: "Classify the following movie review as positive or negative: [review text]." Zero-shot performance depends heavily on how well the instruction aligns with patterns the model encountered during pre-training [2].
Zero-shot capabilities have improved substantially with instruction tuning, where models are further trained on datasets of instruction-response pairs. Models like FLAN-T5, ChatGPT, and Claude are explicitly trained to follow instructions, which dramatically improves their zero-shot performance compared to base language models.
Few-shot learning provides the model with a small number of demonstrations (typically 2-32 examples) in the prompt. Each demonstration consists of an input paired with its correct output. The model uses these examples to infer the task pattern and apply it to the final query. Few-shot ICL is the most commonly studied and practically used form of in-context learning [1].
The number of demonstrations is constrained by the model's context window. Early models like GPT-3 had a 2,048-token context window, limiting demonstrations to perhaps 10-20 examples depending on their length. Modern models with 128K-1M+ token context windows can accommodate far more examples, enabling a new regime called many-shot learning.
Many-shot in-context learning, studied systematically by Agarwal et al. at Google DeepMind in 2024, leverages the massively expanded context windows of modern LLMs to provide hundreds or thousands of demonstrations. Using Gemini 1.5 Pro with its 1M-token context window, the researchers found that ICL performance continues to improve as the number of demonstrations increases into the hundreds and thousands, following a power-law scaling relationship [3].
Many-shot ICL can significantly outperform few-shot ICL and, in some cases, match or exceed fine-tuned models. The study showed gains on complex tasks including mathematical problem solving, question answering, code generation, and translation of low-resource languages. Notably, many-shot ICL enabled the model to learn genuinely new skills from the prompt alone, such as translating a language not well represented in the training data, by providing a sufficiently large grammar reference and example translations [3].
However, many-shot ICL also introduces risks. Anthropic documented "many-shot jailbreaking" in 2024, where providing hundreds of examples of the model complying with harmful requests can override safety training, exploiting the same in-context learning mechanism that makes many-shot ICL powerful for legitimate tasks [4].
The mechanism underlying in-context learning is one of the most actively debated topics in language model research. Several competing (and potentially complementary) hypotheses have been proposed.
The simplest hypothesis is that in-context examples help the model identify a task it already learned during pre-training, rather than teaching it something genuinely new. Under this view, the demonstrations serve as a kind of lookup key that retrieves the relevant behavior from the model's weights. The model already "knows" how to perform sentiment classification, translation, and so on; the examples simply disambiguate which task is being requested [5].
Evidence for this hypothesis comes from studies showing that in-context learning can work even with random or incorrect labels in the demonstrations. Min et al. (2022) found that replacing correct labels with random labels in few-shot prompts degraded performance only modestly on many tasks, suggesting the model was largely recognizing the task format rather than learning the input-label mapping from the examples [5].
Xie et al. (2022) proposed that in-context learning can be understood as implicit Bayesian inference. In this framework, the pre-training data is modeled as a mixture of latent "concepts" (topics, tasks, styles), and the model learns to infer which concept generated the current prompt based on the provided examples. The demonstrations provide evidence about the latent concept, and the model's prediction is effectively the posterior predictive distribution given this evidence [6].
Formally, if theta represents the latent concept, the model computes something analogous to p(y|x, demonstrations) = integral of p(y|x, theta) * p(theta|demonstrations) over theta. The demonstrations narrow the posterior over concepts, concentrating probability on the relevant task. This framework explains why more examples generally improve performance (more evidence tightens the posterior) and why model scale helps (larger models can represent more concepts and compute better posteriors) [6].
Von Oswald et al. (2023) provided a striking theoretical result: linear transformer layers can implement a single step of gradient descent on a regression loss defined by the in-context examples. They showed an explicit weight construction for a single linear self-attention layer that makes the data transformation equivalent to one step of gradient descent on a least-squares objective, where the "training data" consists of the in-context examples [7].
Extending this, they demonstrated that multi-layer transformers can implement multiple steps of an algorithm similar to gradient descent (which they termed GD++). Trained transformers converge to solutions that closely match this GD++ algorithm. This means that during the forward pass, the transformer is effectively running an optimization algorithm, with each layer corresponding to one optimization step, using the in-context examples as training data [7].
This perspective connects in-context learning to mesa-optimization, the idea that a learned model can itself contain an internal optimization process. The transformer's weights encode an optimizer, and the in-context examples are the data that this internal optimizer trains on during inference.
Garg et al. (2022) provided foundational empirical evidence that transformers can learn to perform in-context learning over various function classes. They trained transformers from scratch on synthetic datasets where each training sequence consisted of input-output pairs drawn from a function class (linear functions, two-layer neural networks, decision trees, etc.), followed by a query input [8].
The trained transformers successfully learned to perform in-context learning on these function classes, matching or approaching the performance of optimal estimators (such as ordinary least squares for linear regression). Remarkably, the models generalized to function classes not seen during training, suggesting that they learned a general-purpose learning algorithm rather than merely memorizing specific function classes [8].
Olsson et al. (2022), from Anthropic, identified specific attention patterns called "induction heads" that appear to be a key mechanism for in-context learning. An induction head is a circuit consisting of two attention heads working together: one head copies information from previous tokens to later positions, and the other uses this information to predict the next token by looking for matching patterns earlier in the context [9].
Induction heads implement a simple but powerful algorithm: if the model has seen the pattern [A][B] earlier in the context, and it now encounters [A] again, the induction head predicts [B]. This pattern-matching capability generalizes beyond exact token matches to semantic similarity, enabling more abstract forms of in-context learning. The researchers found that induction heads form during a specific phase transition in training and that their emergence coincides with a sharp improvement in in-context learning ability [9].
In-context learning performance is highly sensitive to both which examples are chosen and the order in which they appear in the prompt. Different selections of demonstrations from the same training set can produce dramatically different performance, with variance sometimes exceeding 20 percentage points on classification tasks [10].
| Factor | Effect on ICL Performance | Mitigation Strategy |
|---|---|---|
| Example selection | Different examples can cause 10-30% variance | Select examples similar to the query (e.g., using embedding similarity) |
| Example order | Permuting examples can cause 5-20% variance | Use calibration, test multiple orderings, or curriculum ordering |
| Label balance | Imbalanced demonstrations bias predictions | Ensure balanced representation of classes |
| Prompt format | Template wording affects performance significantly | Test multiple templates, use instruction-tuned models |
| Recency bias | Model may over-weight the last few examples | Place the most representative examples last |
Several strategies have been developed to address example selection sensitivity. Retrieval-based selection chooses demonstrations that are semantically similar to the query, typically using embedding cosine similarity. Liu et al. (2022) showed that selecting examples nearest to the test input in embedding space significantly outperforms random selection [10].
For example order, research has explored curriculum-based ordering (easy to hard examples based on perplexity), diversity-maximizing orderings, and calibration techniques that estimate and correct for the model's bias toward certain labels.
In-context learning ability improves with model scale along multiple dimensions:
A central question is whether in-context learning truly enables learning of novel tasks or merely activates pre-existing capabilities. Evidence suggests both occur. For tasks closely aligned with pre-training (sentiment analysis, translation between common languages), ICL primarily activates existing knowledge. For genuinely novel tasks (new symbolic rules, rare language translation with grammar references), many-shot ICL can achieve performance that goes substantially beyond what the model could do without demonstrations [3][8].
In-context learning is the theoretical foundation underlying prompt engineering. The practical techniques of prompt engineering, including writing effective instructions, selecting good demonstrations, structuring prompts with clear formatting, and using techniques like chain-of-thought prompting, are all methods for improving the quality of in-context learning.
Chain-of-thought prompting (Wei et al., 2022) is a particularly important extension. By including reasoning steps in the demonstrations (not just input-output pairs but input-reasoning-output triples), the model is induced to generate intermediate reasoning steps for new queries. This dramatically improves performance on mathematical, logical, and multi-step reasoning tasks. Chain-of-thought can be seen as teaching the model a problem-solving procedure through in-context learning, rather than just the input-output mapping [11].
Other prompting techniques that build on ICL include:
Despite dramatic growth in context windows (from 2K tokens in GPT-3 to 1M+ in modern models), the context window still imposes a hard limit on the number and length of demonstrations. For tasks that require many long examples (e.g., code generation with complex specifications), even a 128K token context window may be insufficient. Additionally, models may not utilize information uniformly across very long contexts; the "lost in the middle" phenomenon, where models attend less to information in the middle of long contexts, can affect ICL quality with many demonstrations [12].
In-context learning cannot teach a model computations that are fundamentally outside its architectural capacity. A transformer with finite depth and width has bounded computational complexity per forward pass. Tasks requiring unbounded recursion, very large working memory, or computation that exceeds the model's effective capacity will not be learnable through in-context examples alone, regardless of how many demonstrations are provided [8].
ICL remains more brittle than fine-tuning for many tasks. Small changes to the prompt format, example selection, or wording can cause significant performance fluctuations. Fine-tuned models, having updated their weights on the training data, tend to be more robust to input variations. For applications requiring reliable, consistent performance, fine-tuning (or PEFT methods like LoRA) is often preferred over in-context learning [2].
Including demonstrations in the prompt increases the input length, which increases both latency and cost (since most API providers charge per token). A few-shot prompt with 20 examples might be 10x longer than a zero-shot prompt, incurring 10x the input token cost. For high-volume applications, this cost can be substantial compared to fine-tuning, which has a one-time training cost but no per-request overhead [3].
In-context learning connects to several broader theoretical frameworks:
| Framework | Key Idea | Primary Reference |
|---|---|---|
| Meta-learning | ICL is a form of "learning to learn" from pre-training | Brown et al., 2020 [1] |
| Bayesian inference | Demonstrations provide evidence for latent task inference | Xie et al., 2022 [6] |
| Gradient descent in forward pass | Attention layers implement optimization steps | von Oswald et al., 2023 [7] |
| Mesa-optimization | Pre-trained models contain internal optimizers | Hubinger et al., 2019 |
| Function learning | Transformers learn general-purpose function approximation algorithms | Garg et al., 2022 [8] |
| Kernel regression | Attention mechanism implements a form of kernel smoothing | Han et al., 2023 |
The gradient descent perspective has been particularly influential. It suggests that pre-training on diverse text implicitly trains the transformer to implement a general-purpose learning algorithm. When given in-context examples at inference time, this internal algorithm runs on those examples, producing behavior adapted to the demonstrated task. This unifies the seemingly separate phenomena of pre-training and in-context learning: pre-training creates the optimizer, and in-context examples are the data it optimizes on.
As of early 2026, in-context learning remains a central paradigm for interacting with language models, and research continues on multiple fronts.
The expansion of context windows to 1M+ tokens has enabled many-shot ICL to become practically relevant. Anthropic's Claude and Google's Gemini both support context windows large enough to include hundreds of detailed demonstrations, and empirical work consistently shows that more examples improve performance up to very high counts.
Theoretical understanding has deepened. Building on the work of Garg et al. and von Oswald et al., researchers in 2025 proved that transformers can act as universal in-context learners under certain conditions, approximating any continuous function class given sufficient demonstrations and model capacity. Li et al. (2025) showed that transformers can implement in-context learning algorithms that approximate Bayes-optimal predictors for various function classes [13].
The boundary between in-context learning and fine-tuning has blurred. Methods like context distillation compress the effect of long in-context examples into shorter prompts or adapter weights, combining the flexibility of ICL with the efficiency of fine-tuning. Conversely, techniques like activation steering and representation engineering modify model behavior at inference time in ways that resemble fine-tuning but operate without gradient updates.
In-context learning for agentic applications has become an active area. When LLMs are used as AI agents that take actions in the world (calling tools, browsing the web, writing code), the "demonstrations" in the prompt often include examples of multi-step action sequences. This form of ICL, sometimes called in-context reinforcement learning, allows agents to adapt their behavior to new environments or tools without retraining [14].
The sensitivity and brittleness issues documented in earlier research have been partially mitigated by better models (instruction tuning helps), better prompting techniques (chain-of-thought, self-consistency), and meta-prompting approaches where an LLM generates its own optimal prompt. However, for applications requiring maximum reliability, the combination of ICL with lightweight fine-tuning (using PEFT methods) remains the most robust approach.