See also: large language model, few-shot learning, prompt engineering, transfer learning, fine-tuning
In-context learning (ICL) is the ability of a large language model to learn a new task at inference time by conditioning on a prompt that contains a few input-output examples (demonstrations), without updating any of the model's parameters. The term was introduced by Brown et al. in the GPT-3 paper "Language Models are Few-Shot Learners" (2020), where the authors showed that a sufficiently large autoregressive language model could perform a wide range of natural language processing tasks simply by being given a handful of examples in the prompt.
What makes in-context learning remarkable is that it requires no gradient descent, no backpropagation, and no changes to the model's weights. The model receives a sequence of demonstrations followed by a query input, and it produces the appropriate output by leveraging patterns it learned during pretraining. This stands in sharp contrast to the traditional machine learning paradigm of fine-tuning, where a pretrained model's weights are explicitly updated on task-specific data.
Since the GPT-3 paper, in-context learning has become one of the most widely studied phenomena in modern deep learning. Researchers have proposed multiple theoretical explanations for why it works, identified the factors that influence its effectiveness, and documented its limitations. This article covers the definition, mechanisms, theoretical frameworks, practical considerations, and open questions surrounding in-context learning.
In a standard in-context learning setup, the user constructs a prompt consisting of three components:
For example, a sentiment analysis prompt might look like this:
Review: "The food was excellent." Sentiment: Positive
Review: "Terrible service, never coming back." Sentiment: Negative
Review: "The ambiance was lovely but the pasta was overcooked." Sentiment:
The model processes this entire sequence and generates a completion for the final, incomplete example. Because the model's weights are frozen during this process, all of the "learning" happens through the forward pass of the transformer architecture. The attention mechanism allows the model to attend to the demonstration examples, identify the pattern connecting inputs to outputs, and apply that pattern to the query.
The self-attention layers in a transformer are central to in-context learning. Each attention head can, in principle, compare the query input against the demonstration examples and extract task-relevant information. As the prompt passes through multiple transformer layers, the model progressively refines its internal representation of the task.
Research by Olsson et al. (2022) identified specific attention heads called "induction heads" that play a mechanistic role in this process. An induction head detects a pattern of the form [A][B]...[A] in the input sequence and predicts that B should follow the second occurrence of A. While this is a simple pattern-completion algorithm, the authors present evidence that induction heads serve as a building block for the more general in-context learning ability observed in large transformers.
In-context learning is closely related to several prompting paradigms that differ primarily in the number of demonstration examples provided.
| Paradigm | Number of demonstrations | Description |
|---|---|---|
| Zero-shot | 0 | The model receives only a task instruction or query with no examples. It must rely entirely on knowledge acquired during pretraining. |
| One-shot | 1 | The model receives a single demonstration example before the query. |
| Few-shot | 2 to ~30 | The model receives a small set of demonstrations. This is the most common ICL setting. |
| Many-shot | Dozens to thousands | The model receives a large number of demonstrations, enabled by expanded context windows. |
Brown et al. (2020) evaluated GPT-3 across all of these settings and found that performance generally improved as more demonstrations were provided. They also found that the gap between zero-shot and few-shot performance widened with model scale: larger models benefited more from additional demonstrations.
Agarwal et al. (2024) from Google DeepMind extended this analysis to the many-shot regime, showing that models with long context windows (such as Gemini 1.5 Pro) can achieve significant performance gains when given hundreds or even thousands of in-context examples. Their work also introduced "Reinforced ICL," which substitutes human-written demonstrations with model-generated chain-of-thought rationales, making many-shot ICL practical even when human-labeled examples are scarce.
One of the central questions in in-context learning research is: what computational process is the model actually performing when it learns from demonstrations? Several complementary theoretical frameworks have been proposed.
Multiple research groups have independently drawn connections between in-context learning and gradient-based optimization.
Dai et al. (2023) showed that the transformer attention mechanism has a mathematical "dual form" that resembles gradient descent. In their framework, the demonstration examples produce "meta-gradients" that are implicitly applied to the model's representations, effectively fine-tuning the model's behavior for the current task without actually modifying the stored weights. They provided empirical evidence that in-context learning and explicit fine-tuning produce similar behavioral patterns across multiple tasks.
Von Oswald et al. (2023) proved a stronger result for linear self-attention layers: a single linear self-attention layer can exactly replicate one step of gradient descent on a regression loss. They showed that trained transformers become "mesa-optimizers," meaning they learn to implement an optimization algorithm (gradient descent) within their forward pass. Their experiments demonstrated that transformers can even learn to apply curvature corrections analogous to second-order optimization methods, outperforming plain gradient descent on regression tasks.
Akyurek et al. (2023) investigated what specific learning algorithm in-context learning implements by studying transformers trained on linear regression tasks. They found that the in-context predictions of trained transformers closely match the predictions of gradient descent, ridge regression, and exact least-squares regression, depending on the model's depth and the noise level in the training data.
| Study | Key finding | Setting |
|---|---|---|
| Dai et al. (2023) | Attention has a dual form of gradient descent; ICL behaves like implicit fine-tuning | Real NLP tasks |
| Von Oswald et al. (2023) | Linear self-attention exactly implements gradient descent; transformers are mesa-optimizers | Linear regression |
| Akyurek et al. (2023) | ICL predictions match gradient descent, ridge regression, or least-squares depending on depth/noise | Linear models |
Xie et al. (2022) proposed an alternative theoretical framework, arguing that in-context learning is best understood as implicit Bayesian inference. Their central insight is that when pretraining data consists of documents drawn from a mixture of latent concepts (for example, different topics, writing styles, or domains), the language model learns to infer the latent concept underlying a given sequence. At inference time, the demonstration examples in a prompt provide evidence about the latent concept, and the model performs approximate Bayesian updating to narrow down which concept best explains the examples.
To test this theory, Xie et al. constructed GINC (Generative In-Context learning dataset), a synthetic pretraining dataset with explicit latent concept structure based on mixtures of hidden Markov models. They proved theoretically that in-context learning emerges in this setting and showed empirically that both transformers and LSTMs pretrained on GINC exhibit in-context learning behavior. This work was published at ICLR 2022.
The Bayesian perspective and the gradient descent perspective are not mutually exclusive. Bayesian inference can be implemented through iterative optimization, and some researchers have argued that the two views describe the same underlying computation at different levels of abstraction.
Olsson et al. (2022), working at Anthropic, proposed a mechanistic explanation rooted in the internal circuitry of transformers. They identified "induction heads" as a specific type of attention head that implements a token-level pattern-completion algorithm: given a sequence of the form [A][B]...[A], an induction head predicts that B will follow.
The key evidence comes from training dynamics. The authors observed that induction heads emerge at a specific point during training, and this emergence coincides precisely with a sudden, sharp increase in the model's in-context learning ability (measured as decreasing loss at later positions in the sequence). This co-occurrence is visible as a distinctive "bump" in the training loss curve.
For small, attention-only transformer models, Olsson et al. presented strong causal evidence: ablating induction heads directly impairs in-context learning performance. For larger models with MLP layers, the evidence is correlational, since induction heads are more difficult to isolate in complex architectures. The authors hypothesize that induction heads may constitute the primary mechanism underlying in-context learning in large transformers, though this remains an area of active investigation.
Garg et al. (2022) took an empirical approach, training transformers from scratch on synthetic in-context learning tasks to study what function classes transformers can learn in-context. Their key findings include:
This work, published at NeurIPS 2022, helped establish that transformers can implement surprisingly powerful learning algorithms within their forward pass.
A parallel line of research has investigated how the "knowledge" of a task is represented inside the model during in-context learning.
Hendel et al. (2023) discovered that in-context learning often compresses the information from all demonstration examples into a single vector in the model's residual stream, which they called a "task vector." This task vector acts as a compact representation of the task being performed. When extracted from one prompt and injected into a different forward pass (replacing the demonstrations), the task vector alone is sufficient to steer the model toward the correct task behavior. This work was published in the Findings of EMNLP 2023.
Independently and simultaneously, Todd et al. (2024) identified a closely related phenomenon they called "function vectors." Using causal mediation analysis, they showed that function vectors exist inherently within the transformer architecture and have strong causal effects on model outputs. When a function vector is added to the model's residual stream at inference time, it causes the model to apply the corresponding function, even without any demonstration examples in the prompt.
These findings suggest that in-context learning involves two stages: (1) the model reads the demonstrations and compresses them into an internal task representation, and (2) the model applies this task representation to the query input. This two-stage view aligns with both the Bayesian inference framework (where the task vector encodes the inferred latent concept) and the gradient descent framework (where the task vector encodes the implicit weight update).
Model scale is one of the strongest predictors of in-context learning ability. Brown et al. (2020) observed that while zero-shot performance improves steadily with model size, few-shot performance increases more rapidly. This means that the benefit of providing demonstrations grows as models become larger. At the scale of GPT-3's 175 billion parameters, in-context learning becomes reliable across a broad range of tasks.
Wei et al. (2023) showed that model scale also affects how models use in-context demonstrations. Small models tend to rely on semantic priors from pretraining and largely ignore the input-label mappings in demonstrations. Large models, by contrast, can override their semantic priors when the demonstrations present conflicting information (for example, flipped labels where positive examples are labeled "Negative"). This ability to override priors is an emergent property that appears at sufficient scale.
The choice of which examples to include in the prompt significantly affects in-context learning performance.
Liu et al. (2022) proposed KATE (kNN-Augmented in-conText Example selection), a method that selects demonstration examples based on their semantic similarity to the query input in an embedding space. Their approach retrieves the nearest neighbors of the test input and uses them as demonstrations. This consistently outperformed random example selection across multiple NLU and NLG benchmarks.
Other research has shown that selecting diverse examples that cover different aspects of the task can also improve performance. The optimal selection strategy depends on the task, the model, and the number of demonstrations available.
Lu et al. (2022) demonstrated that the order in which demonstration examples appear in the prompt can dramatically affect performance. In some cases, the difference between the best and worst orderings spans from near state-of-the-art accuracy to random-chance accuracy. This sensitivity exists across model sizes, meaning that even the largest models are not immune to it.
The authors proposed a method based on entropy statistics to identify high-performing orderings without access to a development set. Their approach generates a synthetic development set using the language model itself and selects orderings that produce low-entropy (confident) predictions on this set. This method achieved a 13% relative improvement for GPT-family models across eleven classification tasks.
A surprising finding by Min et al. (2022) revealed that correct input-label mappings in demonstrations are not always necessary for in-context learning. Randomly replacing the labels in demonstration examples barely affected performance on a range of classification and multiple-choice tasks, consistently across 12 different models including GPT-3.
What does matter, according to their analysis, is that the demonstrations convey:
This result suggests that at least for certain task types, in-context learning relies more on task specification (telling the model what kind of task to perform) than on task learning (learning the input-output mapping from examples). However, Wei et al. (2023) later showed that larger models do learn from the actual input-label mappings, especially when the mappings contradict the model's semantic priors. So the importance of correct labels appears to increase with model scale.
Beyond example selection and ordering, the specific formatting of the prompt matters. Small changes to the template (such as the separator between input and output, the label words used, or the presence of a task instruction) can meaningfully affect performance. This sensitivity has motivated research on automatic prompt optimization and prompt tuning methods.
In-context learning and fine-tuning represent two fundamentally different approaches to adapting a pretrained model to a new task.
| Aspect | In-context learning | Fine-tuning |
|---|---|---|
| Parameter updates | None | Yes (gradient-based) |
| Training data needed | A few examples (fits in prompt) | Typically hundreds to thousands of examples |
| Computational cost | Single forward pass | Multiple epochs of training |
| Task switching | Instant (change the prompt) | Requires separate fine-tuned model per task |
| Performance ceiling | Generally lower than fine-tuning for specialized tasks | Higher with sufficient data |
| Risk of catastrophic forgetting | None (weights unchanged) | Yes, especially without careful regularization |
| Storage | Single model serves all tasks | Separate model (or adapter) per task |
| Requires labeled data | Minimally (a few examples) | Yes (labeled dataset) |
Dai et al. (2023) showed that in-context learning and fine-tuning produce similar internal representations and behavioral patterns, supporting the view that ICL is a form of implicit fine-tuning. However, fine-tuning generally achieves higher performance when sufficient labeled data is available, because it can make persistent changes to the model's weights rather than relying on limited prompt space.
In practice, the choice between ICL and fine-tuning depends on the use case. In-context learning is preferred when labeled data is scarce, rapid task switching is needed, or the cost of fine-tuning is prohibitive. Fine-tuning is preferred when maximum task performance is required and sufficient training data exists.
Traditional few-shot ICL is limited by the model's context window. Early models like GPT-3 had a context window of 2,048 or 4,096 tokens, constraining the number of demonstrations to a handful. As context windows have expanded to 128,000 tokens and beyond (for example, in GPT-4 Turbo, Claude, and Gemini 1.5 Pro), a new regime of "many-shot" in-context learning has become feasible.
Agarwal et al. (2024) systematically studied this regime and found:
The main bottleneck for many-shot ICL is the availability of human-labeled demonstrations. To address this, Agarwal et al. proposed Reinforced ICL (using model-generated chain-of-thought rationales as demonstrations) and Unsupervised ICL (using unlabeled examples). Both approaches proved effective, particularly on complex reasoning tasks.
Despite its practical utility, in-context learning has several well-documented limitations.
As discussed above, in-context learning is sensitive to the choice, ordering, and formatting of demonstration examples. This fragility means that small, seemingly inconsequential changes to the prompt can cause large swings in performance. Reliably optimizing prompts often requires experimentation or automated search methods.
Although context windows have grown substantially, they still impose an upper bound on the amount of information that can be provided as demonstrations. For tasks that require learning from large datasets, in-context learning cannot match the capacity of fine-tuning, which can iterate over arbitrary amounts of data through multiple epochs.
While in-context learning works well for pattern matching and classification tasks, it struggles with tasks that require multi-step reasoning or compositional generalization beyond the patterns in the demonstrations. Chain-of-thought prompting can partially address this limitation by providing intermediate reasoning steps in the demonstrations, but it does not fully close the gap.
In-context learning is ephemeral. Each new prompt starts from scratch, with no memory of previous interactions. The model cannot accumulate knowledge across sessions or build on previous in-context learning episodes. This is by design (the weights are frozen), but it means that in-context learning is not a substitute for training or fine-tuning when persistent adaptation is needed.
While the theoretical frameworks discussed in this article (gradient descent, Bayesian inference, induction heads) have provided valuable insights, none of them fully explains in-context learning in large, practical language models. Most theoretical results apply to simplified settings (linear models, small transformers, synthetic data), and the extent to which they generalize to models with hundreds of billions of parameters and MLP layers remains an open question.
Because in-context learning depends heavily on the provided demonstrations, biased or unrepresentative examples can lead the model to produce biased outputs. Unlike fine-tuning, where bias mitigation techniques can be applied during training, there is limited ability to control for bias in the in-context learning setting beyond careful curation of the demonstrations.
In-context learning has found widespread use across many areas of NLP and beyond.
| Application | How ICL is used |
|---|---|
| Text classification | Demonstrations show input texts paired with category labels |
| Machine translation | Source-target sentence pairs serve as demonstrations |
| Question answering | Question-answer pairs demonstrate the desired format and reasoning |
| Code generation | Input-output pairs or natural language descriptions paired with code |
| Summarization | Document-summary pairs establish the desired compression level and style |
| Data extraction | Examples show how to extract structured information from unstructured text |
| Reasoning tasks | Chain-of-thought demonstrations provide step-by-step reasoning templates |
| Format conversion | Examples demonstrate the mapping between data formats (e.g., JSON to CSV) |
Imagine you are doing a new kind of worksheet at school that you have never seen before. Your teacher shows you two completed examples at the top of the page so you can see the pattern. Then you try to do the next one on your own by copying what the examples did.
That is basically what in-context learning is. A big language model (like a very smart parrot that has read billions of sentences) gets shown a few "here is the question, here is the answer" examples right before a new question. The model looks at those examples, figures out the pattern, and then answers the new question the same way. The interesting part is that nobody had to re-teach the model anything. It just looked at the examples and figured it out on the spot.