In-Context Learning
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 ยท 7,539 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v8 ยท 7,539 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: large language model, few-shot learning, zero-shot learning, prompt engineering, transfer learning, fine-tuning, chain-of-thought
In-context learning (ICL) is the ability of a large language model to learn a new task at inference time by conditioning on a prompt that contains a few input-output examples (demonstrations), without updating any of the model's parameters. The term was introduced by Brown et al. in the GPT-3 paper "Language Models are Few-Shot Learners" (2020), where the authors showed that a sufficiently large autoregressive language model could perform a wide range of natural language processing tasks simply by being given a handful of examples in the prompt.
What makes in-context learning remarkable is that it requires no gradient descent, no backpropagation, and no changes to the model's weights. The model receives a sequence of demonstrations followed by a query input, and it produces the appropriate output by leveraging patterns it learned during pretraining. This stands in sharp contrast to the traditional machine learning paradigm of fine-tuning, where a pretrained model's weights are explicitly updated on task-specific data.
Since the GPT-3 paper, in-context learning has become one of the most widely studied phenomena in modern deep learning and is widely viewed as one of the defining emergent properties of large transformer language models. Researchers have proposed multiple theoretical explanations for why it works, identified the factors that influence its effectiveness, and documented its limitations. The expansion of context windows from 2,048 tokens in early GPT-3 deployments to 1 million tokens or more in models such as Gemini 1.5 Pro and Claude 3 has further reshaped how ICL is used in practice. This article covers the definition, mechanisms, theoretical frameworks, practical considerations, and open questions surrounding in-context learning.
The term "in-context learning" was popularized by Brown et al. (2020), though the underlying behavior had been observed in smaller form in earlier work. Radford et al. (2019), in the GPT-2 paper "Language Models are Unsupervised Multitask Learners," had already demonstrated that a large language model could perform downstream tasks zero-shot by interpreting natural-language task descriptions during sampling. Brown et al. extended this idea by systematically studying the few-shot setting and giving it a name.
In the GPT-3 paper, in-context learning describes the procedure of conditioning a frozen language model on a prompt of the form
[task description]
[example 1 input] -> [example 1 output]
[example 2 input] -> [example 2 output]
...
[query input] ->
and reading off the model's continuation as the predicted output. The number of examples is referred to as the "shot" count.
| Term | Number of demonstrations | Description |
|---|---|---|
| Zero-shot | 0 | The model receives only a task instruction or query with no examples. It must rely entirely on knowledge acquired during pretraining. |
| One-shot | 1 | The model receives a single demonstration example before the query. |
| Few-shot | 2 to roughly 30 | The model receives a small set of demonstrations. This is the most common ICL setting in research and production. |
| Many-shot | Dozens to thousands | The model receives a large number of demonstrations, enabled by long context windows of 100,000 tokens or more. |
| k-shot | k examples | A general notation used in research to denote that exactly k demonstrations are provided. |
Brown et al. (2020) evaluated GPT-3 across all of these settings and found that performance generally improved as more demonstrations were provided. They also reported that the gap between zero-shot and few-shot performance widened with model scale: larger models benefited more from additional demonstrations, suggesting that ICL itself becomes stronger with scale.
In modern usage, "in-context learning" is sometimes restricted to the case where labeled demonstrations are provided, while purely instruction-based zero-shot prompting is treated as a separate paradigm. This article uses the broader interpretation that includes zero-shot prompting as a degenerate case of ICL with no demonstrations.
In a standard in-context learning setup, the user constructs a prompt consisting of three components:
For example, a sentiment analysis prompt might look like this:
Review: "The food was excellent." Sentiment: Positive
Review: "Terrible service, never coming back." Sentiment: Negative
Review: "The ambiance was lovely but the pasta was overcooked." Sentiment:
The model processes this entire sequence and generates a completion for the final, incomplete example. Because the model's weights are frozen during this process, all of the "learning" happens through the forward pass of the transformer architecture. The attention mechanism allows the model to attend to the demonstration examples, identify the pattern connecting inputs to outputs, and apply that pattern to the query.
The self-attention layers in a transformer are central to in-context learning. Each attention head can, in principle, compare the query input against the demonstration examples and extract task-relevant information. As the prompt passes through multiple transformer layers, the model progressively refines its internal representation of the task. The model's behavior can be thought of as performing a kind of conditional generation: given the context window of demonstrations, the model produces the most likely continuation, which for well-formatted demonstrations turns out to correspond to the correct output for the query.
Research by Olsson et al. (2022) identified specific attention heads called "induction heads" that play a mechanistic role in this process. An induction head detects a pattern of the form [A][B]...[A] in the input sequence and predicts that B should follow the second occurrence of A. While this is a simple pattern-completion algorithm, the authors present evidence that induction heads serve as a building block for the more general in-context learning ability observed in large transformers.
A central conceptual shift introduced by ICL is that the forward pass of the transformer can itself be interpreted as a small-scale learning algorithm. The pretrained weights play the role of a fixed meta-learner, and the prompt plays the role of a tiny dataset. The transformer produces the output as if it had performed a brief training procedure on that dataset. This view, sometimes called the meta-learning view of ICL, motivates much of the theoretical work discussed below.
The study of in-context learning has progressed quickly since 2020. The following table summarizes key papers in the development of the field.
| Year | Paper | Authors | Contribution |
|---|---|---|---|
| 2019 | "Language Models are Unsupervised Multitask Learners" | Radford et al. (OpenAI) | Demonstrated zero-shot task transfer in GPT-2, the precursor to ICL. |
| 2020 | "Language Models are Few-Shot Learners" (arXiv:2005.14165) | Brown et al. (OpenAI) | Introduced the term in-context learning, demonstrated few-shot ICL in GPT-3 across many tasks. |
| 2021 | "Calibrate Before Use" (arXiv:2102.09690) | Zhao et al. | Identified majority-label, recency, and common-token biases in ICL and proposed contextual calibration. |
| 2021 | "Fantastically Ordered Prompts and Where to Find Them" (arXiv:2104.08786) | Lu et al. | Demonstrated that demonstration order can swing performance from near state-of-the-art to random. |
| 2021 | "What Makes Good In-Context Examples for GPT-3?" (arXiv:2101.06804) | Liu et al. | Proposed retrieval of semantically similar demonstrations (KATE). |
| 2022 | "An Explanation of In-context Learning as Implicit Bayesian Inference" (arXiv:2111.02080) | Xie et al. | Framed ICL as implicit Bayesian inference over latent concepts; introduced GINC. |
| 2022 | "MetaICL: Learning to Learn In Context" (arXiv:2110.15943) | Min et al. | Showed that meta-training on a diverse task mixture improves few-shot ICL on held-out tasks. |
| 2022 | "Rethinking the Role of Demonstrations" (arXiv:2202.12837) | Min et al. | Found that randomized labels barely hurt ICL; format and label space matter most. |
| 2022 | "Chain-of-Thought Prompting Elicits Reasoning" (arXiv:2201.11903) | Wei et al. | Introduced chain-of-thought prompting as a form of ICL for reasoning tasks. |
| 2022 | "Self-Consistency Improves Chain of Thought" (arXiv:2203.11171) | Wang et al. | Augmented chain-of-thought with majority voting over sampled reasoning paths. |
| 2022 | "What Can Transformers Learn In-Context?" (arXiv:2208.01066) | Garg et al. | Trained transformers from scratch on synthetic ICL tasks and characterized learnable function classes. |
| 2022 | "In-context Learning and Induction Heads" (arXiv:2209.11895) | Olsson et al. (Anthropic) | Mechanistic explanation: induction heads emerge during a phase change and drive ICL. |
| 2022 | "Transformers Learn In-Context by Gradient Descent" (arXiv:2212.07677) | von Oswald et al. | Showed linear self-attention layers can implement gradient descent steps; transformers as mesa-optimizers. |
| 2022 | "Why Can GPT Learn In-Context?" (arXiv:2212.10559) | Dai et al. | Argued attention has a dual form of gradient descent and that ICL behaves like implicit fine-tuning. |
| 2023 | "What Learning Algorithm is In-Context Learning?" (arXiv:2211.15661) | Akyurek et al. | Identified that trained transformers match gradient descent or ridge regression depending on depth and noise. |
| 2023 | "Larger Language Models Do In-Context Learning Differently" (arXiv:2303.03846) | Wei et al. | Showed that large models can override semantic priors and learn from flipped labels. |
| 2023 | "In-Context Learning Creates Task Vectors" (arXiv:2310.15916) | Hendel et al. | Identified compact internal task vectors produced by ICL. |
| 2024 | "Function Vectors in Large Language Models" (arXiv:2310.15213) | Todd et al. | Identified causally relevant function vectors in autoregressive transformers. |
| 2024 | "Many-Shot In-Context Learning" (arXiv:2404.11018) | Agarwal et al. (Google DeepMind) | Studied ICL with hundreds to thousands of demonstrations; introduced Reinforced and Unsupervised ICL. |
| 2024 | "A Survey on In-context Learning" (arXiv:2301.00234) | Dong et al. | Comprehensive survey of ICL methods, theory, and applications. |
One of the central questions in in-context learning research is: what computational process is the model actually performing when it learns from demonstrations? Several complementary theoretical frameworks have been proposed. None is universally accepted, and they may describe the same underlying phenomenon at different levels of abstraction.
Multiple research groups have independently drawn connections between in-context learning and gradient-based optimization.
Dai et al. (2023) showed that the transformer attention mechanism has a mathematical "dual form" that resembles gradient descent. In their framework, the demonstration examples produce "meta-gradients" that are implicitly applied to the model's representations, effectively fine-tuning the model's behavior for the current task without actually modifying the stored weights. They provided empirical evidence that in-context learning and explicit fine-tuning produce similar behavioral patterns across multiple tasks. They also designed a momentum-based attention variant, inspired by gradient descent with momentum, that improved performance, offering further support for the gradient-descent interpretation.
Von Oswald et al. (2023) proved a stronger result for linear self-attention layers: a single linear self-attention layer can exactly replicate one step of gradient descent on a regression loss. They showed that trained transformers become "mesa-optimizers," meaning they learn to implement an optimization algorithm (gradient descent) within their forward pass. Their experiments demonstrated that transformers can even learn to apply curvature corrections analogous to second-order optimization methods, outperforming plain gradient descent on regression tasks. They also showed that this perspective subsumes induction-head behavior as a specific case of in-context learning by gradient descent.
Akyurek et al. (2023) investigated what specific learning algorithm in-context learning implements by studying transformers trained on linear regression tasks. They found that the in-context predictions of trained transformers closely match the predictions of gradient descent, ridge regression, and exact least-squares regression, depending on the model's depth and the noise level in the training data. Their probing analyses suggested that deeper layers of the transformer encode weight vectors and moment matrices in ways that resemble classical statistical estimators.
| Study | Key finding | Setting |
|---|---|---|
| Dai et al. (2023) | Attention has a dual form of gradient descent; ICL behaves like implicit fine-tuning | Real NLP tasks |
| Von Oswald et al. (2023) | Linear self-attention exactly implements gradient descent; transformers are mesa-optimizers | Linear regression |
| Akyurek et al. (2023) | ICL predictions match gradient descent, ridge regression, or least-squares depending on depth and noise | Linear models |
Xie et al. (2022) proposed an alternative theoretical framework, arguing that in-context learning is best understood as implicit Bayesian inference. Their central insight is that when pretraining data consists of documents drawn from a mixture of latent concepts (for example, different topics, writing styles, or domains), the language model learns to infer the latent concept underlying a given sequence in order to maintain long-range coherence. At inference time, the demonstration examples in a prompt provide evidence about the latent concept, and the model performs approximate Bayesian updating to narrow down which concept best explains the examples.
To test this theory, Xie et al. constructed GINC (Generative In-Context learning dataset), a synthetic pretraining dataset with explicit latent concept structure based on mixtures of hidden Markov models. They proved theoretically that in-context learning emerges in this setting and showed empirically that both transformers and LSTMs pretrained on GINC exhibit in-context learning behavior. They also reproduced realistic phenomena from the literature, such as sensitivity to demonstration ordering and cases where zero-shot can outperform few-shot, within the GINC setting. This work was published at ICLR 2022.
The Bayesian perspective and the gradient descent perspective are not mutually exclusive. Bayesian inference can be implemented through iterative optimization, and some researchers have argued that the two views describe the same underlying computation at different levels of abstraction.
Olsson et al. (2022), working at Anthropic, proposed a mechanistic explanation rooted in the internal circuitry of transformers. They identified "induction heads" as a specific type of attention head that implements a token-level pattern-completion algorithm: given a sequence of the form [A][B]...[A], an induction head predicts that B will follow.
A full induction circuit requires a composition of two attention heads in different layers. A previous-token head in an earlier layer marks each token with information about its predecessor, and an induction head in a later layer searches for that mark and copies the continuation. Because the circuit requires composition across layers, induction heads cannot exist in single-layer attention models, which is consistent with the observation that single-layer transformers do not exhibit ICL.
The key evidence for the induction-heads hypothesis comes from training dynamics. The authors observed that induction heads emerge at a specific point during training, and this emergence coincides precisely with a sudden, sharp increase in the model's in-context learning ability (measured as decreasing loss at later positions in the sequence). This co-occurrence is visible as a distinctive "bump" in the training loss curve, sometimes called the induction bump. Perturbations to the architecture that shift the bump in training time also shift the formation of induction heads correspondingly, providing causal evidence for the link.
For small, attention-only transformer models, Olsson et al. presented strong causal evidence: ablating induction heads directly impairs in-context learning performance. For larger models with MLP layers, the evidence is correlational, since induction heads are more difficult to isolate in complex architectures. The authors hypothesize that induction heads may constitute the primary mechanism underlying in-context learning in large transformers, though this remains an area of active investigation. Subsequent work, including Akyurek et al. (2024) on "In-context language learning: Architectures and algorithms," has explored richer circuits including "semantic" induction heads that match abstract token relationships rather than literal repetitions.
Garg et al. (2022) took an empirical approach, training transformers from scratch on synthetic in-context learning tasks to study what function classes transformers can learn in-context. Their key findings include:
This work, published at NeurIPS 2022, helped establish that transformers can implement surprisingly powerful learning algorithms within their forward pass.
The table below contrasts the major theoretical accounts.
| Account | Core claim | Supporting work | Strength | Limitation |
|---|---|---|---|---|
| Implicit gradient descent | Forward pass executes steps of gradient descent on demonstrations | Dai et al. (2023); von Oswald et al. (2023); Akyurek et al. (2023) | Mathematically clean, explains regression-style ICL | Mostly proved for linear self-attention or simplified settings |
| Implicit Bayesian inference | Demonstrations narrow a posterior over latent concepts learned during pretraining | Xie et al. (2022) | Naturally explains why scale and pretraining diversity help | Hard to test directly in real models; requires latent-concept structure |
| Induction heads | Attention circuits perform pattern completion ([A][B]...[A] -> [B]) | Olsson et al. (2022) | Strong mechanistic, causal evidence in small models | Correlational only in large MLP-equipped models |
| Function-class learning | Transformers can implement statistical estimators for specific function classes | Garg et al. (2022) | Quantifies the space of tasks transformers can solve in-context | Synthetic tasks; does not directly explain language ICL |
A parallel line of research has investigated how the "knowledge" of a task is represented inside the model during in-context learning.
Hendel et al. (2023) discovered that in-context learning often compresses the information from all demonstration examples into a single vector in the model's residual stream, which they called a "task vector." This task vector acts as a compact representation of the task being performed. When extracted from one prompt and injected into a different forward pass (replacing the demonstrations), the task vector alone is sufficient to steer the model toward the correct task behavior. This work was published in the Findings of EMNLP 2023.
Independently and at nearly the same time, Todd et al. (2024), published at ICLR 2024, identified a closely related phenomenon they called "function vectors." Using causal mediation analysis, they showed that a small number of attention heads transport a compact representation of the demonstrated task, and this representation can be extracted as a vector with strong causal effects on model outputs. When a function vector is added to the model's residual stream at inference time, it causes the model to apply the corresponding function, even without any demonstration examples in the prompt. Function vectors also exhibit partial compositionality: summing the vectors for two tasks can produce behavior corresponding to a combined task.
These findings suggest that in-context learning involves two stages: first, the model reads the demonstrations and compresses them into an internal task representation; second, the model applies this task representation to the query input. This two-stage view aligns with both the Bayesian inference framework (where the task vector encodes the inferred latent concept) and the gradient descent framework (where the task vector encodes the implicit weight update).
A particularly important variant of ICL is chain-of-thought (CoT) prompting, introduced by Wei et al. (2022). Rather than presenting demonstrations as bare input-output pairs, CoT demonstrations include intermediate reasoning steps written in natural language. For example, a math word problem demonstration might include the explicit calculation steps before the final numeric answer.
Wei et al. showed that this simple change unlocks substantial gains on tasks requiring multi-step reasoning, including arithmetic, commonsense, and symbolic reasoning. With only eight CoT demonstrations, a 540-billion-parameter model achieved state-of-the-art performance on the GSM8K math benchmark, surpassing a fine-tuned GPT-3 baseline equipped with a verifier. Crucially, the gains from CoT only appeared at sufficient scale: smaller models did not benefit and sometimes performed worse. This made CoT one of the canonical demonstrations of an emergent capability tied to model size.
CoT can be combined with self-consistency, introduced by Wang et al. (2022). Instead of generating a single CoT and reading the final answer, self-consistency samples multiple chains of thought from the model and selects the most common final answer by majority vote. This approach improved chain-of-thought performance on GSM8K by 17.9 percentage points and produced large gains on a range of arithmetic and commonsense benchmarks.
Later extensions further refined CoT-style ICL:
Model scale is one of the strongest predictors of in-context learning ability. Brown et al. (2020) observed that while zero-shot performance improves steadily with model size, few-shot performance increases more rapidly. This means that the benefit of providing demonstrations grows as models become larger. At the scale of GPT-3's 175 billion parameters, in-context learning becomes reliable across a broad range of tasks.
Wei et al. (2023) showed that model scale also affects how models use in-context demonstrations. Small models tend to rely on semantic priors from pretraining and largely ignore the input-label mappings in demonstrations. Large models, by contrast, can override their semantic priors when the demonstrations present conflicting information (for example, flipped labels where positive examples are labeled "Negative"). This ability to override priors is an emergent property that appears at sufficient scale. The same study found that instruction tuning strengthens both reliance on semantic priors and the ability to learn arbitrary input-label mappings, but tilts the balance more toward the former.
The choice of which examples to include in the prompt significantly affects in-context learning performance.
Liu et al. (2021), in a paper later commonly cited as the KATE method (kNN-Augmented in-conText Example selection), proposed selecting demonstration examples based on their semantic similarity to the query input in an embedding space. Their approach retrieves the nearest neighbors of the test input and uses them as demonstrations. This consistently outperformed random example selection across multiple NLU and NLG benchmarks, including a 41.9% gain on table-to-text generation (ToTTo) and 45.5% on open-domain question answering (Natural Questions) when paired with task-tuned sentence encoders.
Other research has shown that selecting diverse examples that cover different aspects of the task can also improve performance. Active selection schemes that combine similarity and coverage often outperform either alone. The optimal selection strategy depends on the task, the model, and the number of demonstrations available.
Lu et al. (2022) demonstrated that the order in which demonstration examples appear in the prompt can dramatically affect performance. In some cases, the difference between the best and worst orderings spans from near state-of-the-art accuracy to random-chance accuracy. This sensitivity exists across model sizes, meaning that even the largest models are not immune to it. The optimal order is also model-specific: a strong order for one model does not necessarily transfer to another.
The authors proposed a method based on entropy statistics to identify high-performing orderings without access to a development set. Their approach generates a synthetic development set using the language model itself and selects orderings that produce low-entropy (confident) predictions on this set. This method achieved a 13% relative improvement for GPT-family models across eleven classification tasks.
A surprising finding by Min et al. (2022) revealed that correct input-label mappings in demonstrations are not always necessary for in-context learning. Randomly replacing the labels in demonstration examples barely affected performance on a range of classification and multiple-choice tasks, consistently across 12 different models including GPT-3.
What does matter, according to their analysis, is that the demonstrations convey:
This result suggests that at least for certain task types, in-context learning relies more on task specification (telling the model what kind of task to perform) than on task learning (learning the input-output mapping from examples). However, Wei et al. (2023) later showed that larger models do learn from the actual input-label mappings, especially when the mappings contradict the model's semantic priors. So the importance of correct labels appears to increase with model scale, and the Min et al. result is best interpreted as a property of mid-scale models rather than a universal claim about ICL.
Beyond example selection and ordering, the specific formatting of the prompt matters. Small changes to the template (such as the separator between input and output, the label words used, the use of an explicit task instruction, or the casing and punctuation of label tokens) can meaningfully affect performance. This sensitivity has motivated research on automatic prompt optimization, soft prompt tuning, and structured-output formats such as JSON schemas, which can stabilize behavior on classification tasks while constraining the output.
Zhao et al. (2021), in "Calibrate Before Use," identified three systematic biases in few-shot ICL:
They proposed a contextual calibration procedure: estimate the model's bias by querying it with a content-free input such as "N/A," then fit calibration parameters that make the prediction uniform across labels. This adjustment improved few-shot accuracy of GPT-3 and GPT-2 by up to 30.0 percentage points and reduced variance across prompt choices. Subsequent methods such as batch calibration extended these ideas to broader settings.
While in-context learning emerges naturally from ordinary pretraining at sufficient scale, several research efforts have investigated training procedures designed to amplify it.
Min et al. (2022), in "MetaICL: Learning to Learn In Context," explicitly meta-trained a pretrained language model on 142 NLP datasets, presenting tasks in an in-context format during fine-tuning. The resulting model showed improved few-shot ICL on held-out tasks, in some cases matching or exceeding fully fine-tuned models with eight times as many parameters. The gains were largest when meta-training tasks were diverse and when target tasks involved domain shift.
Instruction tuning (such as FLAN, T0, and the post-training of GPT-4, Claude, and Gemini models) can be viewed as a related strategy: by training the model on a wide range of instruction-following examples, the model becomes better at zero-shot and few-shot generalization to new tasks specified in natural language. Wei et al. (2023) noted, however, that instruction tuning shifts the relative balance between reliance on semantic priors and learning from in-context demonstrations.
In-context learning and fine-tuning represent two fundamentally different approaches to adapting a pretrained model to a new task.
| Aspect | In-context learning | Fine-tuning |
|---|---|---|
| Parameter updates | None | Yes (gradient-based) |
| Training data needed | A few examples (fits in prompt) | Typically hundreds to thousands of examples |
| Computational cost | Single forward pass | Multiple epochs of training |
| Task switching | Instant (change the prompt) | Requires separate fine-tuned model per task |
| Performance ceiling | Generally lower than fine-tuning for specialized tasks | Higher with sufficient data |
| Risk of catastrophic forgetting | None (weights unchanged) | Yes, especially without careful regularization |
| Storage | Single model serves all tasks | Separate model (or adapter) per task |
| Requires labeled data | Minimally (a few examples) | Yes (labeled dataset) |
| Inference cost per request | Higher (prompt tokens are paid each call) | Lower once weights absorb the task |
| Privacy of training data | Data is kept inside prompts (which may be logged by providers) | Data is absorbed into weights and not directly recoverable |
| Suitability for long horizons | Limited to context window | Unlimited |
Dai et al. (2023) showed that in-context learning and fine-tuning produce similar internal representations and behavioral patterns, supporting the view that ICL is a form of implicit fine-tuning. However, fine-tuning generally achieves higher performance when sufficient labeled data is available, because it can make persistent changes to the model's weights rather than relying on limited prompt space.
In practice, the choice between ICL and fine-tuning depends on the use case. In-context learning is preferred when labeled data is scarce, rapid task switching is needed, the cost of fine-tuning is prohibitive, or low-latency adaptation is required. Fine-tuning is preferred when maximum task performance is required, sufficient training data exists, and the cost of running long prompts at inference would be prohibitive.
With the rise of cheap parameter-efficient fine-tuning techniques such as LoRA and adapters, fine-tuning has become a more attractive alternative to ICL even for moderately sized datasets. Practical systems often combine the two: a model is fine-tuned for a task family and then steered with in-context examples within each invocation.
Traditional few-shot ICL is limited by the model's context window. Early models like GPT-3 had a context window of 2,048 or 4,096 tokens, constraining the number of demonstrations to a handful. As context windows have expanded to 128,000 tokens (in GPT-4 Turbo and Claude 3 Opus), 200,000 tokens (Claude 3 family), and 1 million or more tokens (Gemini 1.5 Pro and Gemini 2.0), a new regime of "many-shot" in-context learning has become feasible.
Agarwal et al. (2024) systematically studied this regime and found:
The main bottleneck for many-shot ICL is the availability of human-labeled demonstrations. To address this, Agarwal et al. proposed two variants:
Both approaches proved effective, particularly on complex reasoning tasks.
The Gemini 1.5 technical report (Reid et al., 2024) provided a striking real-world demonstration of long-context ICL: when given a 500-page reference grammar, a bilingual dictionary, and roughly 400 parallel sentences for Kalamang, a Papuan language with fewer than 200 speakers, Gemini 1.5 Pro learned to translate from English into Kalamang at a quality comparable to a human learner who studied the same materials. No example of Kalamang appears in the model's pretraining data, making the result a clean demonstration of pure in-context language acquisition.
Many-shot ICL has reshaped practical thinking about the ICL-versus-fine-tuning tradeoff: for many tasks, simply pasting hundreds or thousands of examples into the prompt is now a viable alternative to building a custom training pipeline.
The following table summarizes practical techniques drawn from the ICL literature.
| Technique | Idea | Reference / origin |
|---|---|---|
| Random selection | Pick demonstrations uniformly from a pool | Brown et al. (2020) |
| Similarity-based selection (KATE) | Retrieve demonstrations by embedding similarity to query | Liu et al. (2021) |
| Diversity-aware selection | Combine similarity with coverage to avoid redundancy | Various follow-ups (e.g., Su et al., 2023) |
| Order optimization | Search over orderings using entropy on a synthetic dev set | Lu et al. (2022) |
| Contextual calibration | Subtract content-free baseline from output probabilities | Zhao et al. (2021) |
| Instructional priming | Add a clear natural-language task instruction before demonstrations | Brown et al. (2020) |
| Chain-of-thought demonstrations | Include explicit intermediate reasoning steps | Wei et al. (2022) |
| Self-consistency | Sample multiple CoTs and majority-vote answers | Wang et al. (2022) |
| Least-to-most prompting | Decompose into subproblems and chain solutions | Zhou et al. (2022) |
| Step-back prompting | Abstract to higher-level concept before answering | Zheng et al. (2023) |
| Many-shot prompting | Include hundreds to thousands of examples in long contexts | Agarwal et al. (2024) |
| Reinforced ICL | Use model-generated rationales as demonstrations | Agarwal et al. (2024) |
| Structured output schemas | Force JSON or tagged output format to reduce parsing errors | Production practice; OpenAI structured outputs (2024) |
| Format consistency | Keep separator, capitalization, and field names consistent across demonstrations | Min et al. (2022); Wei et al. (2023) |
A practical few-shot prompt construction workflow looks roughly like this:
Input: ... Output: ...) and end the prompt with the query input followed by the same separator.In modern production systems, in-context demonstrations are increasingly used not just to teach a task but to teach a format. JSON schemas, function-calling specifications, and tool-use traces are typical patterns. A small number of input-output examples followed by a JSON schema can reliably produce parseable outputs, even from models that were not explicitly fine-tuned on that schema.
In-context learning is not limited to text. Several efforts have demonstrated ICL in multimodal settings.
Flamingo, introduced by Alayrac et al. (2022) at DeepMind, is a vision-language model that bridges a frozen pretrained language model with a vision encoder via gated cross-attention layers. By training on interleaved image-text data, Flamingo acquired strong few-shot ICL on vision-language tasks such as visual question answering and captioning. Subsequent open-source models including OpenFlamingo and Idefics replicate this capability. Studies of ICL in vision-language models, including Chen et al. (2024) and Baldassini et al. (2024), have found that the textual portion of demonstrations carries most of the task information, with images contributing relatively less.
In-context learning behavior has also been observed in models trained on tabular data, time series, and reinforcement-learning trajectories. Garg et al. (2022) and follow-ups demonstrated ICL on synthetic regression and classification tasks, while Laskin et al. (2023), in "In-context Reinforcement Learning with Algorithm Distillation," showed that transformers trained on long sequences of trajectories can learn new RL tasks in-context.
Despite its practical utility, in-context learning has several well-documented limitations.
As discussed above, in-context learning is sensitive to the choice, ordering, and formatting of demonstration examples. This fragility means that small, seemingly inconsequential changes to the prompt can cause large swings in performance. Reliably optimizing prompts often requires experimentation, automated search methods, or explicit calibration.
Although context windows have grown substantially, they still impose an upper bound on the amount of information that can be provided as demonstrations. For tasks that require learning from large datasets, in-context learning cannot match the capacity of fine-tuning, which can iterate over arbitrary amounts of data through multiple epochs. Long-context inference is also computationally expensive: attention costs scale at least linearly with prompt length, and very long prompts can lead to higher latency and cost per request.
While in-context learning works well for pattern matching and classification tasks, it struggles with tasks that require multi-step reasoning or compositional generalization beyond the patterns in the demonstrations. Chain-of-thought prompting can partially address this limitation by providing intermediate reasoning steps in the demonstrations, but it does not fully close the gap. Reasoning-trained models that perform extended internal computation at inference time, such as o1 and similar systems released in 2024 and 2025, partially substitute for ICL on reasoning-heavy tasks.
In-context learning is ephemeral. Each new prompt starts from scratch, with no memory of previous interactions. The model cannot accumulate knowledge across sessions or build on previous in-context learning episodes. This is by design (the weights are frozen), but it means that in-context learning is not a substitute for training or fine-tuning when persistent adaptation is needed. Long-term memory systems and retrieval-augmented architectures attempt to compensate for this gap by externalizing the knowledge store.
While the theoretical frameworks discussed in this article (gradient descent, Bayesian inference, induction heads) have provided valuable insights, none of them fully explains in-context learning in large, practical language models. Most theoretical results apply to simplified settings (linear models, small transformers, synthetic data), and the extent to which they generalize to models with hundreds of billions of parameters and MLP layers remains an open question. Surveys such as Dong et al. (2024) and "The Mystery of In-Context Learning" (EMNLP 2024) catalog dozens of partial mechanisms that may co-occur in real models.
Because in-context learning depends heavily on the provided demonstrations, biased or unrepresentative examples can lead the model to produce biased outputs. Beyond the systematic biases identified by Zhao et al. (2021), demonstrations can implicitly encode the worldview, formatting habits, or prejudices of whoever wrote them. Unlike fine-tuning, where bias mitigation techniques can be applied during training, there is limited ability to control for bias in the in-context learning setting beyond careful curation of the demonstrations.
Because ICL treats demonstrations and user input on roughly equal footing inside a single context window, models can be manipulated by adversarial content placed in either part of the prompt. This concern, known as prompt injection, becomes more acute as models are increasingly deployed as agents that read external documents, web pages, or tool outputs as part of their context. ICL's strength (taking arbitrary instructions seriously) is also a vulnerability.
In-context learning has found widespread use across many areas of NLP and beyond.
| Application | How ICL is used |
|---|---|
| Text classification | Demonstrations show input texts paired with category labels |
| Machine translation | Source-target sentence pairs serve as demonstrations |
| Question answering | Question-answer pairs demonstrate the desired format and reasoning |
| Code generation | Input-output pairs or natural language descriptions paired with code |
| Summarization | Document-summary pairs establish the desired compression level and style |
| Data extraction | Examples show how to extract structured information from unstructured text |
| Reasoning tasks | Chain-of-thought demonstrations provide step-by-step reasoning templates |
| Format conversion | Examples demonstrate the mapping between data formats (e.g., JSON to CSV) |
| Tool and function calling | Demonstrations show how to invoke a tool or API with the right arguments |
| Multimodal tasks | Image-caption or image-question pairs serve as demonstrations in vision-language models |
| Low-resource translation | Pairing a grammar plus parallel sentences in the prompt enables translation for endangered languages (Kalamang case) |
| Personalization | A small set of user-specific examples adapt model behavior to a particular style or persona |
| Guardrails and policy enforcement | Demonstrations encode the desired refusal behavior or safety constraints |
Imagine you are doing a new kind of worksheet at school that you have never seen before. Your teacher shows you two completed examples at the top of the page so you can see the pattern. Then you try to do the next one on your own by copying what the examples did.
That is basically what in-context learning is. A big language model (like a very smart parrot that has read billions of sentences) gets shown a few "here is the question, here is the answer" examples right before a new question. The model looks at those examples, figures out the pattern, and then answers the new question the same way. The interesting part is that nobody had to re-teach the model anything. It just looked at the examples and figured it out on the spot.
If you give it more examples, it usually gets better. If you give it the wrong-looking examples or write them in a confusing order, it can get confused. And when you start a new conversation, it forgets everything, because it never actually wrote anything down in its long-term memory.